## Fast recursive arguments based on Plonk and Halo

### Daniel Lubarov, William Borgeaud & Brendan Farmer · 2020-06-22

**TLDR**: By combining Plonk's permutation argument, Halo's polynomial commitment scheme, and a high-arity circuit model, we are able to recursively verify an argument using around $2^{14}$ gates.

We have a prototype implementation of this scheme in Plonky. While it's not ready for real use yet, we're seeing encouraging results, with recursive proofs taking ~9 seconds on a 6-core laptop.

This post assumes some familiarity with Plonk, Halo and Rescue.

## Batch opening polynomial commitments

Verifying a Plonk argument involves opening several polynomial commitments at a challenge point $x$, and opening one of them at an additional point $y$. The Halo paper describes a protocol for reducing several openings to one, but it involves a significant amount of computation per opening on the verifier's part, which we would like to avoid.

We will instead apply a batching technique described by Izaak Meckler here. Let $G$ be a sequence of random generators, and let $X = (x^0, x^1, x^2, \dots)$ and $Y = (y^0, y^1, y^2, \dots)$. The prover first sends each polynomial commitment $c(f_i) = \langle f_i, G \rangle$, along with purported evaluations $z_i = \langle f_i, X \rangle, w_i = \langle f_i, Y \rangle$. The verifier samples random $\alpha, \beta \in \mathbb{F}$ and computes

$$ \begin{aligned} Z &= \sum_i \alpha^i z_i, \\ W &= \sum_i \alpha^i w_i, \\ c(F) &= \sum_i \alpha^i c(f_i), \end{aligned} $$

where $F = \sum_i \alpha^i f_i$. At this point, the prover must argue that $F(x) = Z$ and $F(y) = W$ which, with negligible loss of soundness, reduces to

$$ \left\langle F, \space X + \beta Y \right\rangle = Z + \beta W. $$

The prover sends an inner product argument for the relation above. Verifying it requires knowing $\langle s, G \rangle$ and $\langle s, X + \beta Y \rangle$, where $s$ is as defined in the Halo paper. The former is handled with the usual Halo technique, and the latter can be computed as $\langle s, X \rangle + \beta \langle s, Y \rangle = g(X, u_i) + \beta g(Y, u_i)$.

## Halo's bottleneck: curve multiplication

In Halo, verifying a polynomial commitment's opening involves computing

$$ Q = \sum_{j=1}^k \left( \left[ u_j^2 \right] L_j \right) + P' + \sum_{j=1}^k \left( \left[ u_j^{-2} \right] R_j \right), $$

where $L_j$ and $R_j$ are given by the prover, and $u_j \in \{ 0, 1 \}^\lambda$ are random challenge points.

Each curve multiplication $[r] P$ could be performed with a simple double-and-add algorithm, which would involve $\lambda$ additions and $\lambda$ doublings. Halo does something similar, but consumes two bits at a time and adds one of $\{ P, -P, \phi(P), \phi(-P) \}$, where $\phi$ is an endomorphism which is trivial to compute. This reduces the cost to $\lambda/2$ additions and $\lambda/2$ doublings.

We can do better by noticing that the equation above has the form of a multi-scalar multiplication with $2k$ terms (plus the addition of $P'$). A simple, circuit-friendly MSM optimization is to perform the doublings simultaneously for all terms (cf. simultaneous squaring). This brings the cost down to $k \lambda$ additions and $\lambda / 2$ doublings, or $(k + 1/2) \lambda$ group operations.

## Generalizing Plonk's circuit model

In Plonk's "standard" circuit model, each gate interacts with three wires, $a$, $b$ and $c$, and enforces a single constraint on the contents of those wires,

$$ q_L a + q_R b + q_O c + q_M a b + q_C = 0, $$

where $q$ values can be thought of as gate configuration flags. As an example, we could create a multiplicative constraint $ab = c$ by configuring $q_M = 1$ and $q_C = -1$, with the other $q$ values being zero.

This model is nice and simple, but leads to rather high gate counts. Curve operations seems to require around 7 gates, for example, depending on the curve and completeness assumptions. Thankfully, the basic Plonk scheme is highly flexible; there are several generalizations which we can use to achieve lower gate counts.

First, we can use higher-arity gates. If we wanted a single gate to perform a curve operation, for example, we might use an arity of 6. This increases the prover's cost per gate, but it also allows us to dramatically reduce gate counts for certain operations.

Second, we are not limited to a single constraint. At first glance, it might seem as though adding several constraints would require the prover to compute many more FFTs, but this depends on our approach. Let $d$ be the maximum degree of any constraint. If we extend each polynomial to degree $d$ upfront, then all constraint-related arithmetic thereafter can be done in point-value form, with no additional FFTs.

Third, not all wires necessarily need to be routed. "Advice" wires are useful for things like purported inverses, or for intermediate results. Advice wires do not contribute to the degree of Plonk's permutation argument, which is often the highest-degree polynomial in Plonk-based constructions.

Finally, since Plonk opens each polynomial at a "shifted" point $g x$ in addition to the challenge point $x$, constraints can operate on the wires of the "next gate" in addition to the "local gate". This is one of the main insights behind TurboPlonk. We take this approach even farther, adding additional shifted openings in order to minimize the need for "copying" wires with Plonk's permutation argument.

## Elliptic curve gates

In the MSM algorithm described previously, most steps involve conditionally negating a point, conditionally applying the endomorphism $\phi((x,y)) = (\zeta x, y)$, then adding the modified point to an accumulator. This can be expressed as

$$ \begin{aligned} x_1 &\leftarrow (1 + (\zeta - 1) r_\mathrm{high}) x, \\ y_1 &\leftarrow (2 r_\mathrm{low} - 1) y, \\ (x_3, y_3) &\leftarrow (x_1, y_1) + (x_2, y_2), \end{aligned} $$

where $(x, y)$ is the point being multiplied, $(x_1, y_1)$ is the point to be added to the accumulator, $(x_2, y_2)$ is the old state of the accumulator, $(x_3, y_3)$ is its updated state, and $r_\mathrm{high}, r_\mathrm{low}$ are two consecutive bits of the scalar.

### Explicit formulae

For short Weierstrass curves in affine form, incomplete addition can be computed as

$$ \begin{aligned} \mathrm{inv} &= 1 / (x_1 - x_2), \\ \lambda &= (y_1 - y_2) \mathrm{inv}, \\ x_3 &= \lambda^2 - x_1 - x_2, \\ y_3 &= \lambda(x_1 - x_3) - y_1. \end{aligned} $$

A simple way to arithmetize this computation is to introduce advice wires for the intermediate results like $\lambda$. In particular, our gate could be defined in the following way:

- Routed wires: $x$, $y$, $x_2$, $y_2$, $x_3$, $y_3$, $q_\mathrm{low}$, $q_\mathrm{high}$
- Advice wires: $x_1$, $y_1$, $\mathrm{inv}$, $\lambda$
- Constraints:

$$ \begin{aligned} q_\mathrm{low} (q_\mathrm{low} - 1) &= 0, \\ q_\mathrm{high} (q_\mathrm{high} - 1) &= 0, \\ x_1 &= (1 + (\zeta - 1) r_\mathrm{high}) x, \\ y_1 &= (2 r_\mathrm{low} - 1) y, \\ (x_1 - x_2) \mathrm{inv} &= 1, \\ \lambda &= (y_1 - y_2) \mathrm{inv}, \\ x_3 &= \lambda^2 - x_1 - x_2, \\ y_3 &= \lambda(x_1 - x_3) - y_1. \end{aligned} $$

This may seem like an awful lot of constraints, but low-degree constraints have little bearing on performance, since they only need to be checked at a single challenge point.

### Exceptional cases

The affine formulae above assume $x_1 \ne x_2$. In the context of the Halo MSMs described earlier, a malicious prover could break this assumption simply by sending identical $L_j$ values in two consecutive rounds, so as a security measure we must verify that $x_1 \ne x_2$. We enforce this with the constraint $(x_1 - x_2) \mathrm{inv} = 1$.

For an honest prover, though, each $L_j$ is independently random since it incorporates a random group element $[l_j] H$. Since $P(x_1 = x_2) = 2^{-|F|}$ for an honest prover, we can simply abort the protocol in this case, incurring a negligible completeness error.

### Optimizations

In practice, we wouldn't want so many advice wires, as each wire requires the prover to compute a polynomial commitment and contributes to the argument length. $x_1$, $y_1$ and $\lambda$ can simply be inlined. Further, we can combine the accumulators $(x_2, y_2)$ and $(x_3, y_3)$ by constraining the accumulator wires of the "next" gate. Finally, our actual implementation uses a single gate to perform these curve operations while simultaneously verifying the decomposition of each scalar.

## Generating challenges with Rescue

Generating Fiat-Shamir challenges tends to be another bottleneck in recursive circuits. To minimize this cost, we use a duplex (as suggested by Daira Hopwood) with Rescue as the underlying permutation.

Let $M = \bigl( \begin{smallmatrix}A & B \\ C & D\end{smallmatrix}\bigr)$ be an MDS matrix. A single round of a width-2 Rescue permutation can be defined as

$$ \begin{aligned} \operatorname{step}_{i,1}\left(\begin{bmatrix}x_1 \\ x_2\end{bmatrix}\right) &= M \begin{bmatrix}x_1^{1/\alpha} \\ x_2^{1/\alpha}\end{bmatrix} + \begin{bmatrix}r_{i,1} \\ r_{i,2}\end{bmatrix}, \\ \operatorname{step}_{i,2}\left(\begin{bmatrix}x_1 \\ x_2\end{bmatrix}\right) &= M \begin{bmatrix}x_1^{\alpha} \\ x_2^{\alpha}\end{bmatrix} + \begin{bmatrix}r_{i,3} \\ r_{i,4}\end{bmatrix}, \\ \operatorname{round}_i &= \operatorname{step}_{i,2} \circ \operatorname{step}_{i,1}, \end{aligned} $$

where $r_{i, 1} \dots r_{i, 4}$ are round constants.

Computing $\alpha$th roots deterministically would be expensive, so we ask the prover to supply them via advice wires $y_1$ and $y_2$. Let $z_1$ and $z_2$ denote the output of the round function; then our gate may be defined as

- Routed wires: $x_1$, $x_2$, $z_1$, $z_2$
- Advice wires: $y_1$, $y_2$
- Constraints:

$$ \begin{aligned} x_1 &= y_1^\alpha, \\ x_2 &= y_2^\alpha, \\ z_1 &= A(A y_1 + B y_2 + r_{i,1})^\alpha + B(C y_1 + D y_2 + r_{i,2})^\alpha + r_{i,3}, \\ z_2 &= C(A y_1 + B y_2 + r_{i, 1})^\alpha + D(C y_1 + D y_2 + r_{i,2})^\alpha + r_{i,4}. \end{aligned} $$

We could even perform several rounds of Rescue in a single gate, but in practice we are limited by the number of constants available to each gate. A single round already involves 4 constants $r_{i, 1} \dots r_{i, 4}$, and adding more would mean more preprocessed polynomials which must be opened at a challenge point.

As a future optimization, we could interleave Rescue gates with gates that do not require any constants, such as curve operation gates. Each Rescue gate could then utilize the unused constant slots belonging to its neighbor. Alternatively, we could use a different key schedule which didn't require as many constants to be configured.

## A unified constraint set

So far, we have described separate systems of equations for each gate type, but Plonk assumes a single set of constraints. To accomplish this, we combine each constraint with a "filter" expression which evaluates to 0 or 1, indicating whether the constraint should be applied to a given gate index. Our unified constraint set is then simply a sum of filtered constraints.

A simple way to implement this would involve a constant polynomial for each gate type. For example, `is_rescue_gate`

could be defined as the polynomial with `is_rescue_gate(g^i) = 1`

if and only if gate `i`

is a Rescue gate. We use a variety of custom gates, though, and we wouldn't want to open so many constant polynomials.

Instead, we organize gates in the leaves of a binary tree, as shown:

We then introduce a constant polynomial $C_i$ for each layer of the tree, and set its value to 0 or 1 to indicate a left or a right turn in the tree. For example, our arithmetic gate has a path of 1001, so its constraints are combined with the filter $C_1(x) (1 - C_2(x)) (1 - C_3(x)) C_4(x)$.

It might seem odd that different gates have different depths in this binary tree. There are two reasons for this. First, a smaller depth results in a lower-degree filter polynomial, and in order to keep our maximum degree small, certain gates with higher-degree constraints must be given lower-degree filters. Second, any constants not used in the filter polynomial are available to be used for gate configuration. Our arithmetic gate has a filter involving only $C_1, \dots, C_4$, for example, so it uses $C_5$ and $C_6$ to configure the type of arithmetic being performed.

## Future improvements

While we have been focused on optimizing our recursive circuit size, there is also plenty of room to speed up our primitives. Plonky is currently 100% Rust, and we expect a major speedup on x86 systems from carry chain optimizations, which the compiler is not capable of.

Our proving time is dominated by multi-scalar multiplications, which we implemented using a variant of Yao's method. We may be able to do better with Pippinger's algorithm, particularly for the IPA reduction which involves variable-base MSMs.

Another potential improvement was suggested by Daira Hopwood. Instead of applying the endomorphism zero or one times based on a bit of the scalar, we could apply it zero, one or two times based on a base-3 limb of the scalar. This would reduce the number of iterations from 64 to 50 while maintaining injectivity, although it becomes more difficult to prove.

## Thanks

Thanks to Daira Hopwood, Sean Bowe and Zachary Williamson for helpful pointers and discussions.