GPU Acceleration of Rapidsnark
TL;DR:
In this blog post, we present the motivations, background, and performance of GPU acceleration of rapidsnark, which achieves approximately 4 times the performance of the CPU version.
Background and motivations
Zero-knowledge proofs has blossomed in blockchain.
Plonky2 is one of the most popular STARK frameworks for ZK developers.But as widely known, verifying Plonky2 proofs by using FRI-based Polynomial Commitments and Fiat-Shamir in smart contracts is more expensive than verifying Groth16 proofs.Therefore, in the Orbiter business scenario, it is necessary to convert the Plonky2 recursive proof into a Groth16 proof.
Rapidsnark is a fast zkSNARK prover written in C++, that generates proofs for circuits created with circom and snarkjs. However, it only supports CPU acceleration. Furthermore, on the CPU, approximately 98% of the processing time is dedicated to NTT and MSM computations.
In addition, it has been demonstrated that hardware acceleration can effectively handle NTT and MSM(see Zprize).
Consequently, we dedicated three weeks to integrate HW-based acceleration into rapidsnark. Subsequently, we have open-sourced it, and the open-source repository can be found at: https://github.com/Orbiter-Finance/rapidsnark .
Benchmarks
We ran our benchmarks on a setup containing:
GPU — NVIDIA GeForce RTX 4090 24GB,
CPU — 2* AMD EPYC 7763 64-Core Processor
Execution Time:
Here are the data for both the CPU and GPU versions:
- CPU version test data
init and set str for altBbn128r: 0 ms
get zkey,zkeyHeader,wtns,wtnsHeader: 0 ms
make prover: 64 ms
get wtnsData: 0 ms
Multiexp A: 1816 ms
Multiexp B1: 2020 ms
Multiexp B2: 2520 ms
Multiexp C: 2775 ms
Initializing a b c A: 59 ms
Processing coefs: 593 ms
Calculating c: 49 ms
Initializing fft: 0 ms
iFFT A: 18158 ms
a After ifft: 0 ms
Shift A: 46 ms
a After shift: 0 ms
FFT A: 3188 ms
a After fft: 0 ms
iFFT B: 1805 ms
b After ifft: 0 ms
Shift B: 45 ms
b After shift: 0 ms
FFT B: 750 ms
b After fft: 0 ms
iFFT C: 971 ms
c After ifft: 0 ms
Shift C: 49 ms
c After shift: 0 ms
FFT C: 779 ms
c After fft: 0 ms
Start ABC: 48 ms
abc: 0 ms
Multiexp H: 4720 ms
generate proof: 40708 ms
write proof to file: 0 ms
write public to file: 0 ms
prover total: 41757 ms
- GPU version test data
init and set str for altBbn128r: 0 ms
get zkey,zkeyHeader,wtns,wtnsHeader: 0 ms
make prover: 65 ms
get wtnsData: 0 ms
get MSM config: 0 ms
Multiexp A: 848 ms
Multiexp B1: 679 ms
Multiexp B2: 1080 ms
Multiexp C: 683 ms
Initializing a b c A: 75 ms
Processing coefs: 618 ms
Calculating c: 38 ms
Initializing fft: 148 ms
iFFT A: 291 ms
a After ifft: 0 ms
Shift A: 164 ms
a After shift: 0 ms
FFT A: 294 ms
a After fft: 0 ms
iFFT B: 272 ms
b After ifft: 0 ms
Shift B: 33 ms
b After shift: 0 ms
FFT B: 308 ms
b After fft: 0 ms
iFFT C: 255 ms
c After ifft: 0 ms
Shift C: 28 ms
c After shift: 0 ms
FFT C: 278 ms
c After fft: 0 ms
Start ABC: 46 ms
abc: 0 ms
Multiexp H: 929 ms
generate proof: 7362 ms
write proof to file: 0 ms
write public to file: 0 ms
prover cuda total: 8443 ms
It’s worth noting that the CPU version of iFFT A appears to have a somewhat unusually high processing time.
How we did it
We only use reinterpret_cast for data type conversion.Please refer to the specific procedure outlined at https://github.com/Orbiter-Finance/rapidsnark/blob/6deb16bc974e12170e9f1f11ca59985b19d027da/src/groth16_cuda.cu#L56.
More Information about Orbiter Finance
- Github: https://github.com/Orbiter-Finance
- Twitter: https://twitter.com/Orbiter_Finance
- Medium: https://orbiter-finance.medium.com/
- Discord: https://discord.gg/jRJKK8avGM
References
[1] https://github.com/iden3/rapidsnark
[2] https://github.com/ingonyama-zk/icicle
[3] https://dev.ingonyama.com/
[4] https://github.com/supranational/sppark
[5] https://en.cppreference.com/w/