IPFS Distributed Storage Protocol Analysis and Reflection

Introduction#

Recently, I have been working on a Case Study project for my school, which is a music copyright management project based on the Ethereum platform. In this project, we used IPFS distributed file storage technology for uploading music works and copyright proof documents, mainly to detect copyright infringement through its deduplication feature. I became interested in the IPFS system and read the IPFS series articles on the QTech platform (https://tech.hyperchain.cn/tag/ipfs/), as well as some related materials. In this article, I will summarize my findings. If there are any mistakes or omissions, please feel free to correct me.

Overview#

In our daily use of cloud storage or other services, we mostly access files on specific servers (IP addresses), request and download files to our local devices using the HTTP protocol. This method is essentially based on location addressing, where we access URLs to find specific files layer by layer. Although this method is convenient, it has some issues. Files rely on specific servers, so once a centralized server crashes or files are deleted, the content will be permanently lost. Moreover, if the server is far away or there are many people accessing the files at the same time, the access speed will be slow. Additionally, the same file may be stored redundantly on different servers, resulting in resource waste. Furthermore, there are serious security risks, as DDoS, XSS, CSRF attacks, etc., can threaten the security of files.

Is there a better solution?

Imagine if we store files in a distributed network where each node can store files, and users can request files from the nearest nodes through a directory-like indexing method. This is the solution proposed by IPFS (InterPlanetary File System), which is a peer-to-peer hypermedia file storage, indexing, and exchange protocol initiated by Juan Benet in May 2014.

Features#

IPFS aims to connect all computing devices deployed with the same file system worldwide, building a distributed network to replace the traditional centralized server model. Each node can store files, and users can retrieve files through the Distributed Hash Table (DHT) for faster and more secure access. It provides stronger network security.

Since the content of files stored through IPFS is stored as addresses by calculating the hash value of each block, it is essentially a decentralized but content-addressable method. By encrypting the data itself and generating a unique hash for lookup, even minor changes will result in completely different hash results. Therefore, it is easy to detect whether the content has been tampered with without accessing the file itself.

Unlike the traditional server model, IPFS is a unified network. Therefore, the same content of uploaded files will not be redundantly stored (can be verified by hash values), greatly saving overall network resources and improving efficiency. Moreover, theoretically, as long as the nodes reach a certain scale, the files will be permanently saved, and the same file can be downloaded from multiple (and closer) nodes, resulting in higher communication efficiency.

In addition, because it is a distributed network for storage, it can naturally avoid traditional DDoS and other attacks.

Functionality#

In addition to file storage, IPFS also has functions such as DHT networking, Bitswap file exchange, etc., which will be explained in separate blog posts.

Working Principles#

As a file storage system, the two most basic operations are uploading and downloading files. Let's discuss the principles of each operation.

IPFS add Command#

How does the upload operation work in the IPFS system?

In the IPFS file storage system, whenever a new file is uploaded, the system splits the file into several 256KB blocks. Each block is identified by a unique CID (Content Identifier). Then, the system calculates the hash value of each block and stores it in an array. Finally, the system calculates the hash of the array to obtain the final hash value of the file. The system then combines the file's hash and the array of block hashes into an object, forming an indexing structure. Finally, the system uploads the file blocks and the indexing structure to the IPFS node and synchronizes them to the IPFS network.

There are two notable cases during file uploads: 1. If the file is very small (less than 1KB), it will not waste a block and will be directly uploaded to IPFS along with the hash. 2. If the file is very large, for example, if a 1GB video was uploaded earlier and then a few KB of subtitle files were added, the unchanged 1GB part will not be allocated new space. Only the appended subtitle file part will be allocated new blocks and re-uploaded with the hash.

Therefore, even the same part of different files will only be stored once. The indexes of many files will point to the same block, forming a MerkleDAG data structure.

It is worth noting that when a node executes the add command, it will be stored in the local blockstore but will not be immediately uploaded to the IPFS network unless another node requests the block data. Therefore, IPFS is not an automatically backup distributed database. This design is based on considerations such as network bandwidth and reliability.

Another detail is that when a node executes the add command, it will broadcast its block information and maintain a list of all block requests received by this node. Once the added data satisfies this list, the node will actively send the data to the corresponding node and update the list.

IPFS get Command#

After uploading a file, how do we access and retrieve it?

This is related to the IPFS indexing structure called the Distributed Hash Table (DHT). By accessing the DHT, we can quickly retrieve the data.

In the IPFS system, all nodes connected to the current node form a swarm network. When a node sends a file request (i.e., get), it first checks the requested data in its local blockstore. If it is not found, it sends a request to the swarm network and uses DHT routing in the network to find the node that has the requested data.

How does the network know which node(s) in the network have the requested file?

As mentioned earlier in the add command, when a node joins the IPFS network, it informs other nodes about the content it stores (through broadcasting DHT). So whenever a user wants to retrieve content that happens to be stored on a particular node, other nodes will tell the user to retrieve the content from that node.

Once the node that has the data is found, it sends the requested data back to the requesting node. The local node caches the received block data in its local blockstore. This means that the entire network has an additional copy of the original data. When more nodes request the data, the retrieval becomes easier. Therefore, the non-loss of data is based on this principle. As long as one node stores the data, it can be accessed by the entire network.

In the project, uploaded files can be directly accessed through the ipfs.io gateway, similar to a website address like https://ipfs.io/ipfs/Qm...... How does this work?

The ipfs.io gateway is actually an IPFS node. When we open the above network link, we are actually sending a request to this node. Therefore, the ipfs.io gateway will request the block from the node that has the data (if the file was just added to the IPFS network by the local node using the add command, it will be uploaded to the IPFS network in this way). After obtaining the data through DHT routing in the swarm network, the gateway caches a copy of the data and then sends the data to us through the HTTP protocol. This allows us to directly view the file in the browser!

When any other machine accesses this link through a browser, because the ipfs.io gateway has already cached the file, it does not need to request the data from the original node again. It can directly return the data from the cache to the browser.

Content Identifier (CID)#

Now let's consider another issue. Common image file formats include .jpg and .png, while common video formats include .mp4. These formats can be determined directly from the file extension. Files uploaded through IPFS can be of various types and contain a lot of information. How do we distinguish them?

In the early days, IPFS mainly used base58btc encoding for multihash. However, during the development of IPLD (used to define data and model data), many format-related issues were encountered. Therefore, a file addressing format called CID was introduced to manage data of different formats. The official definition is:

CID is a self-describing content-addressed identifier that must use a cryptographic hash function to obtain the address of the content.

In simple terms, CID describes the content of a file through certain mechanisms, including version information and format.

CID Structure#

Currently, there are two versions of CID: v0 and v1. v1 version CID is generated by V1Builder.

<cidv1> ::= <mb><version><mcp><mh>
# or, expanded:
<cidv1> ::= <multibase-prefix><cid-version><multicodec-packed-content-type><multihash-content-address>

As shown in the code above, the mechanism used is called multipleformats, which mainly includes: multibase-prefix indicating the encoding of CID as a string, cid-version indicating the version variable, multicodec-packed-content-type indicating the type and format of the content (similar to an extension, but as part of the identifier and with limited supported formats that cannot be freely modified by users), and multihash-content-address indicating the hash value (allowing CID to use different hash functions).

Currently, CID supports multicodec-packed encoding formats such as native protobuf, IPLD CBOR, git, Bitcoin, and Ethereum objects, and is gradually developing support for more formats.

CID code explanation:

type Cid struct {str string}
type V0Builder struct {}
type V1Builder struct {}

Codec uint64
MhType uint64
MhLength int // Default: -1

Codec represents the encoding type of the content, such as DagProtobuf, DagCBOR, etc. MhType represents the hash algorithm, such as SHA2_256, SHA2_512, SHA3_256, SHA3_512, etc. MhLength represents the length of the generated hash.

For v0 version CID, it is generated by V0Builder and starts with the string Qm. It is backward compatible, and multibase is always base58btc, multicodec is always protobuf-mdag, and cid-version is always cidv0. multihash is represented as cidv0 ::= <multihash-content-address>.

Design Philosophy#

With the binary nature of CID, the compression efficiency of file hashes is greatly improved, making it possible to directly include CID as part of a URL for access. The use of multibase encoding (such as base58btc) shortens the length of CID, making it easier to transmit. CID can represent the results of any format and any hash function, making it very flexible. The encoding version can be upgraded through the cid-version parameter in the structure. It is not limited by historical content.

IPNS#

As mentioned earlier, changes in the content of files in IPFS will result in changes in their hash values. In practical applications, if IPFS is used for hosting websites or other applications that require version updates and iterations, it is inconvenient for users to access the files by using the updated hash every time. Therefore, a mapping solution is needed to ensure a better user experience. This way, users only need to access a fixed address when accessing the files.

IPNS (Inter-Planetary Naming System) provides such a service. It uses a hash ID (usually a PeerID) restricted by a private key to point to specific IPFS files. When the file is updated, the hash ID is automatically updated to point to the updated file.

Even though the hash value can remain unchanged, it is still not convenient for memorization and input. Therefore, a further solution is needed.

IPNS is also compatible with DNS. It can use DNS TXT records to map domain names to IPNS hash IDs, allowing domain names to replace IPNS hash IDs for easier access and memorization.

Conclusion#

The above is a summary of the principles of IPFS distributed storage. There are many aspects worth exploring further, such as its components, storage process details, garbage collection mechanism, data exchange module Bitswap, network, and practical application scenarios.

Recommended reading: QTech platform's "IPFS series articles" (https://tech.hyperchain.cn/tag/ipfs/)

References#

IPFS Official Website (https://ipfs.io)

"How IPFS Stores Files" (https://tech.hyperchain.cn/ipfs/), QTech, Hyperchain Technology

"How Does IPFS Work?" (https://cloud.tencent.com/developer/news/277198), Zhihui

"Understanding IPFS from the Perspective of Web3.0" (https://learnblockchain.cn/2018/12/12/what-is-ipfs), Tiny Xiong, Denglian Community

"IPFS CID Research" (https://medium.com/@kidinamoto/ipfs-cid - 研究 - 717c4ceb14a0), Sophie Huang