Reducing state size on Mir

Daniel Lubarov, Brendan Farmer & William Borgeaud · 2021-01-25

TLDR: We describe a state size optimization for Mir which greatly reduces the amount of state required for validation. In this model, validators only need to store about five bits per active account.

The case for stateless blockchains

Today, Ethereum clients store around 50 gigabytes of data which they use to validate transactions.1 While this is manageable for the time being, scaling beyond Ethereum's meager ~102 transactions per second will require a fundamentally different approach.

In traditional blockchain designs like Ethereum's, validating a transaction requires accessing the state of each account3 that the transaction interacts with. To facilitate efficient validation, nodes simply store the state of all active accounts, which gets expensive as the account set grows. Fortunately, there is a better way.

The idea of a stateless blockchain is to move account data off-chain. Accounts owned by an individual can simply be stored on that individual's device. Shared accounts, such as oracles, can be stored on IPFS or other storage layers. Mir validators are agnostic here: they are not required to store any account data, or to know where it is stored.

While validators do not need to store account state, they do need a way to verify claims about account state. For this, we turn to authenticated data structures, namely Merkle trees.

Avoiding stale witnesses

A natural approach is to have users store Merkle witnesses for the accounts they own, and include them as part of a transaction. Unfortunately, Merkle witnesses become stale whenever the Merkle tree is modified, which occurs at every new block. There have been proposals for synchronizing Merkle witnesses, such as EDRAX,4 but requiring regular witness synchronization raises serious practical concerns.

To avoid this dilemma, we turn to asynchronous accumulators, as described by Reyzin and Yakoubov.5 They suggest an append-only Merkle tree structure, similar to Peter Todd's earlier idea of a Merkle Mountain Range (MMR).6 Each mountain in an MMR contains a power-of-two number of elements, corresponding to a 1 in the binary representation of the MMR size. To illustrate, if we had 14 leaves (1110 in binary), they would be grouped into mountains of size 8, 4, and 2:

   /\
/  \
/\  /\  /\
/\/\/\/\/\/\/\


With MMRs, the root of each mountain is required to authenticate operations. The benefit is that since mountains are immutable, a witness for proving membership within a mountain never becomes stale. It does need to be extended when the containing mountain merges with a neighboring mountain, but this occurs only a logarithmic number of times, and there are practical ways to reduce or eliminate the need for witness synchronization.7

Since MMRs are append-only, we use an MMR as a log of historical account states, rather than a set of current account states. We refer to this structure as the Commitment Log.

Each block appends a subtree to the commitment log consisting of newly-created account states.

State liveness

The Commitment Log enables users to prove that their account had a given state at some point in time, but not that it is the current state. We need to augment the Commitment Log with a second structure, which we will call the Liveness Mask, to track whether each state in the log is presently active.

A simple approach would be to store a separate vector of bits, where the $i$th bit indicates the liveness (or lack thereof) of the $i$th account in the Commitment Log. Such a vector would have far more zeros than ones, however, so it is better to use a sparse structure. If we were focused solely on space efficiency, we could encode our bit vector as a sequence of zero-run lengths, then use Huffman coding to encode these lengths using the minimal number of bits. This is similar to the Modified Huffman coding that fax machines use to encode black-and-white bitmaps.

While this structure would be highly space-efficient, it would not support authenticated operations. In particular, when an account changes state, we would like an efficient way to prove the new state of the Liveness Mask. We could use a standard Merkle tree, where each leaf is a zero or one, indicating whether the associated leaf in the Commitment Log is active. Since zeros will be much more common, though, it is better to use a sparse Merkle tree8 (SMT), where all-zero subtrees are not explicitly stored.

The scheme described so far is already fairly compact. Let's imagine that our Commitment Log contains $2^{40}$ (just over a trillion) commitments, and $2^{36}$ of them are currently active. This implies a SMT height of 40, but since we have only $2^{36}$ non-zero leaves, the SMT will have at most $5 \cdot 2^{36} - 1$ nodes,9 or ~5 nodes per active account. With a space-efficient trie implementation such as PATRICIA,10 this should take just a few bytes per active account. Still, we can do better.

A hybrid structure

To get the best of both worlds, we can use a partially Merklized structure. We divide the Liveness Mask into chunks of size $2^k$, and compress each chunk using the modified Huffman coding. We store Merkle data only for the $h - k$ upper layers of the tree, while lazily computing Merkle data for the $k$ lower layers. To create or verify a Merkle proof, we must decompress the corresponding chunk and recompute its Merkle data, but for small $k$ this takes negligible time.

In practice, this works nicely for values of $k$ around 10-12, which is large enough to get close to the compactness of modified Huffman coding. Lazily computing $2^{12}$ hashes may seem expensive (particularly if using a hash like Rescue), but we can mitigate the cost by precomputing or caching certain digests. For example, we could precompute the digest of every possible height-4 subtree, which fits in 2 megabytes, in order to reduce the amount of lazy hashing by a factor of ~16.

Other optimizations

Suppose that I own $n$ active accounts. Under the aforementioned scheme, we would have $n$ associated one bits in the Liveness Mask. We can improve upon this by bundling my $n$ accounts into a single "meta" account. Our Commitment Log would contain a single entry for the latest state of this meta account, which could be a Merkle root of a small tree containing my $n$ accounts. The Liveness Mask, in turn, would contain a single one bit. (Note that we wouldn't want to bundle accounts owned by different users, as one user's transactions would invalidate other users' Merkle witnesses.)

The Liveness Mask could also be sharded to reduce nodes' storage requirements even further, but this might be overkill given how space-efficient this model already is.

Proof of concept

We wrote of a proof of concept to measure the space consumed by the Liveness Mask. Here is a sample run, where we have 1,000,000 commitments in the log, and 100,000 (or 10%) of them are live: