Reducing state size on Mir

Daniel Lubarov, Brendan Farmer & William Borgeaud · 2021-01-25

TLDR: We describe a state size optimization for Mir which greatly reduces the amount of state required for validation. In this model, validators only need to store about five bits per active account.

The case for stateless blockchains

Today, Ethereum clients store around 50 gigabytes of data which they use to validate transactions.1 While this is manageable for the time being, scaling beyond Ethereum's meager ~102 transactions per second will require a fundamentally different approach.

In traditional blockchain designs like Ethereum's, validating a transaction requires accessing the state of each account3 that the transaction interacts with. To facilitate efficient validation, nodes simply store the state of all active accounts, which gets expensive as the account set grows. Fortunately, there is a better way.

The idea of a stateless blockchain is to move account data off-chain. Accounts owned by an individual can simply be stored on that individual's device. Shared accounts, such as oracles, can be stored on IPFS or other storage layers. Mir validators are agnostic here: they are not required store any account data, or to know where it is stored.

While validators do not need to store account state, they do need a way to verify claims about account state. For this, we turn to authenticated data structures, namely Merkle trees.

Avoiding stale witnesses

A natural approach is to have users store Merkle witnesses for the accounts they own, and include them as part of a transaction. Unfortunately, Merkle witnesses become stale whenever the Merkle tree is modified, which occurs at every new block. There have been proposals for synchronizing Merkle witnesses, such as EDRAX,4 but requiring regular witness synchronization raises serious practical concerns.

Merkle witness becoming stale

To avoid this dilemma, we turn to asynchronous accumulators, as described by Reyzin and Yakoubov.5 They suggest an append-only Merkle tree structure, similar to Peter Todd's earlier idea of a Merkle Mountain Range (MMR).6 Each mountain in an MMR contains a power-of-two number of elements, corresponding to a 1 in the binary representation of the MMR size. To illustrate, if we had 14 leaves (1110 in binary), they would be grouped into mountains of size 8, 4, and 2:

   /\
  /  \
 /\  /\  /\
/\/\/\/\/\/\/\

With MMRs, the root of each mountain is required to authenticate operations. The benefit is that since mountains are immutable, a witness for proving membership within a mountain never becomes stale. It does need to be extended when the containing mountain merges with a neighboring mountain, but this occurs only a logarithmic number of times, and there are practical ways to reduce or eliminate the need for witness synchronization.7

Since MMRs are append-only, we use an MMR as log of historical account states, rather than a set of current account states. We refer to this structure as the Commitment Log.

Adding to the Commitment Log

Each block appends a subtree to the commitment log consisting of newly-created account states.

State liveness

The Commitment Log enables users to prove that their account had a given state at some point in time, but not that it is the current state. We need to augment the Commitment Log with a second structure, which we will call the Liveness Mask, to track whether each state in the log is presently active.

A simple approach would be to store a separate vector of bits, where the $i$th bit indicates the liveness (or lack thereof) of the $i$th account in the Commitment Log. Such a vector would have far more zeros than ones, however, so it is better to use a sparse structure. If we were focused solely on space efficiency, we could encode our bit vector as a sequence of zero-run lengths, then use Huffman coding to encode these lengths using the minimal number of bits. This is similar to the Modified Huffman coding that fax machines use to encode black-and-white bitmaps.

Commitment Set

While this structure would be highly space-efficient, it would not support authenticated operations. In particular, when an account changes state, we would like an efficient way to prove the new state of the Liveness Mask. We could use a standard Merkle tree, where each leaf is a zero or one, indicating whether the associated leaf in the Commitment Log is active. Since zeros will be much more common, though, it is better to use a sparse Merkle tree8 (SMT), where all-zero subtrees are not explicitly stored.

The scheme described so far is already fairly compact. Let's imagine that our Commitment Log contains $2^{40}$ (just over a trillion) commitments, and $2^{36}$ of them are currently active. This implies a SMT height of 40, but since we have only $2^{36}$ non-zero leaves, the SMT will have at most $5 \cdot 2^{36} - 1$ nodes,9 or ~5 nodes per active account. With a space-efficient trie implementation such as PATRICIA,10 this should take just a few bytes per active account. Still, we can do better.

A hybrid structure

To get the best of both worlds, we can use a partially Merklized structure. We divide the Liveness Mask into chunks of size $2^k$, and compress each chunk using the modified Huffman coding. We store Merkle data only for the $h - k$ upper layers of the tree, while lazily computing Merkle data for the $k$ lower layers. To create or verify a Merkle proof, we must decompress the corresponding chunk and recompute its Merkle data, but for small $k$ this takes negligible time.

In practice, this works nicely for values of $k$ around 10-12, which is large enough to get close to the compactness of modified Huffman coding. Lazily computing $2^{12}$ hashes may seem expensive (particularly if using a hash like Rescue), but we can mitigate the cost by precomputing or caching certain digests. For example, we could precompute the digest of every possible height-4 subtree, which fits in 2 megabytes, in order to reduce the amount of lazy hashing by a factor of ~16.

Other optimizations

Suppose that I own $n$ active accounts. Under the aforementioned scheme, we would have $n$ associated one bits in the Liveness Mask. We can improve upon this by bundling my $n$ accounts into a single "meta" account. Our Commitment Log would contain a single entry for the latest state of this meta account, which could be a Merkle root of a small tree containing my $n$ accounts. The Liveness Mask, in turn, would contain a single one bit. (Note that we wouldn't want to bundle accounts owned by different users, as one user's transactions would invalidate other users' Merkle witnesses.)

The Liveness Mask could also be sharded to reduce nodes' storage requirements even further, but this might be overkill given how space-efficient this model already is.

Proof of concept

We wrote of a proof of concept to measure the space consumed by the Liveness Mask. Here is a sample run, where we have 1,000,000 commitments in the log, and 100,000 (or 10%) of them are live:

$ python3 commitment_set.py 1000000 100000
Total commitments: 1000000
Live commitments: 100000
Fraction live: 10.0%
Compressed size: 473763 bits
Bits per live commitment: 4.73763
Serialization test: pass

As you can see, the Liveness Mask takes only 59 kilobytes, or ~4.7 bits per active active account state.

This proof of concept stores the Liveness Mask with a modified Huffman coding only, without storing any Merkle data. Still, as we saw earlier, we can get quite close to this space efficiency with an appropriate choice of $k$.

Applications to other blockchains

These techniques can be adapted to work with other blockchains like Ethereum. Accountholders could prove their account data as part of each transaction, along with a witness to prove that that account data is contained in the Commitment Log. However, this would have the undesirable side effect of increasing transaction sizes.

With Mir's architecture, there is no such tradeoff. Since we use zero-knowledge proofs to demonstrate the validity of each transaction, there is no need for validators to inspect any account data.

Notes

1

Actual space usage depends on the client software. For example, at the time of writing, Etherscan reports 130gb using Geth, with ~50gb in the current state.

2

Ethereum's transaction volume fluctuates quite a bit, since different types of transactions require different amounts of gas, and the gas limit itself fluctuates as miners vote to adjust it. See Blockchair's chart.

3

Note that in Mir we refer to units of state as "records" - an application's state and logic at a particular moment. Records can simulate mutable accounts as in Ethereum.

7

If we assume that the network will store the top several layers of the Commitment Log, an accountholder can stop synchronizing their witness once the associated mountain reaches that height. Practically speaking, it seems reasonable to assume that the network will store each block root in the Commitment Log, so that no synchronization is necessary once a block is finalized. Assuming a ten-second block time, the space required to store these block roots would grow by just ~101 megabytes per year.

9

The worst-case scenario is that each leaf has a distinct 36-bit prefix. In this case, the SMT will have 37 complete layers with a node count of $2 \cdot 2^{36} - 1$, followed by 3 linear layers with a node count of $3 \cdot 2^{36}$, for a total node count of $5 \cdot 2^{36} - 1$.