Introduction#
Recently, I received a work task to change the data structure of the project's smart contract state tree from a red-black tree to a trie, and compare the performance of the two data structures. The trie is mainly based on Ethereum's official Java implementation ethereum/ethereumj, while the red-black tree is self-implemented. This article records the theoretical and practical performance comparison of the two data structures.
Data Structures#
Red-Black Tree#
A red-black tree is an approximately balanced binary search tree that contains red and black nodes, ensuring that the height difference between any two nodes is less than twice.
Properties#
The red-black tree must satisfy the following five properties:
- Nodes are either red or black.
- The root node is black.
- Leaf nodes (NIL) are black.
- Both children of every red node are black.
- Every path from a node to its descendant leaf nodes contains the same number of black nodes.
Red-black trees are not perfectly balanced, but the number of levels in the left subtree and the right subtree is equal. Therefore, they are also known as perfectly black-balanced trees. Because they are approximately balanced, the frequency of rotations is reduced, the maintenance cost is lowered, and the time complexity is maintained at LogN.
Operations#
Red-black trees are mainly balanced through three operations:
- Left rotation
- Right rotation
- Recoloring
Comparison with AVL Trees#
- AVL trees provide faster search operations (due to perfect balance).
- Red-black trees provide faster insertion and deletion operations.
- AVL trees store more node information (balance factor and height), so they occupy more storage space.
- When there are more read operations and fewer write operations, AVL trees are more suitable and are commonly used in databases. When there are more write operations, red-black trees are generally used because they are concise and easy to implement, and are commonly used in various high-level language libraries such as maps and sets.
Code Implementation#
Due to the complexity of red-black trees, the implementation code has been uploaded to GitHub for learning and viewing.
Trie#
A trie, also known as a prefix tree or a keyword tree, is commonly used for storing and sorting a large number of strings, such as text disk statistics in search engines.
It can minimize unnecessary string comparisons and has high query efficiency.
Properties#
- Nodes do not store complete words.
- The characters along the path from the root to a certain node form the string corresponding to that node.
- The paths of all child nodes of a node represent different characters.
- Nodes can store additional information, such as word frequency.
Internal Node Structure#
Tries have a low height but occupy a large amount of storage space. The core idea is to trade space for time by utilizing the common prefixes of strings to reduce query time and improve efficiency. It can naturally solve word association and other business scenarios.
Code Implementation#
class Trie {
private Trie[] children;
private boolean isEnd;
public Trie() {
children = new Trie[26];
isEnd = false;
}
public void insert(String word) {
Trie node = this;
for (int i = 0; i < word.length(); i++) {
char ch = word.charAt(i);
int index = ch - 'a';
if (node.children[index] == null) {
node.children[index] = new Trie();
}
node = node.children[index];
}
node.isEnd = true;
}
public boolean search(String word) {
Trie node = searchPrefix(word);
return node != null && node.isEnd;
}
public boolean startsWith(String prefix) {
return searchPrefix(prefix) != null;
}
private Trie searchPrefix(String prefix) {
Trie node = this;
for (int i = 0; i < prefix.length(); i++) {
char ch = prefix.charAt(i);
int index = ch - 'a';
if (node.children[index] == null) {
return null;
}
node = node.children[index];
}
return node;
}
}
Modified Merkle Patricia Tries#
Ethereum Account State Storage#
- Using a key-value hash table stored in each block, new transactions are packed into the block every time, which changes the Merkle tree. However, in fact, only a small number of accounts change, resulting in high costs.
- Storing accounts directly in the Merkle tree and modifying the Merkle tree when changing content is not feasible because the Merkle tree does not provide a high-performance lookup and update method.
- Using a sorted Merkle tree is also not feasible because the addresses generated by new accounts are random and require reordering upon insertion.
MPT Structure#
Utilizes the characteristics of the trie structure.
- The trie structure remains unchanged even when the order is disrupted, naturally sorted. Even when inserting new values, it does not affect the structure, making it suitable for Ethereum's account-based structure.
- It has good update locality, so there is no need to traverse the entire tree when updating.
However, the trie structure is relatively wasteful of storage space and has lower efficiency when the key-value pairs are sparsely distributed. In Ethereum, the account address is a 40-digit hexadecimal number, and there are approximately 2^160 possible addresses, making it extremely sparse (to prevent hash collisions).
Therefore, it is necessary to compress the trie structure, which is called the Patricia Trie. After compression, the height of the tree is significantly reduced, and both space and efficiency are improved.
Modified MPT Structure#
The structure actually used in Ethereum is the Modified MPT structure, which is as follows:
When a new block is published, the values of new nodes in the state tree change, rather than modifying the original values. Instead, new branches are created while preserving the original state (thus enabling rollbacks).
In the Ethereum system, forking is common, and the data in orphan blocks needs to be rolled back. Due to the presence of smart contracts in ETH, in order to support the rollback of smart contracts, the previous state must be maintained.
Code Implementation#
The code refers to Ethereum's Java implementation.
Summary#
The above is an analysis of the Ethereum MPT
and red-black tree data structures. When I was struggling with LeetCode, I often thought that I wouldn't use these things, but unexpectedly, there are application scenarios so soon. Still, it is important to understand and practice them!