Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
51 commits
Select commit Hold shift + click to select a range
d163891
initial clean up
iakovenkos Dec 23, 2025
75ef963
recursive -> iterative
iakovenkos Dec 23, 2025
2f9c59d
add first approximation of docs + rm redundant alias
iakovenkos Dec 24, 2025
45e1a64
tackle issue 1449
iakovenkos Dec 24, 2025
79e17e2
tackle issue 1449
iakovenkos Dec 24, 2025
6e24ce6
Merge remote-tracking branch 'origin/merge-train/barretenberg' into s…
iakovenkos Jan 13, 2026
3bd30ca
small refactor
iakovenkos Jan 13, 2026
7ef589a
reapply centralized montgomery conversion
iakovenkos Jan 14, 2026
9a771a6
clean up
iakovenkos Jan 14, 2026
6072169
get_offset_generator out of the loop
iakovenkos Jan 14, 2026
47cae52
revert some branching
iakovenkos Jan 14, 2026
e6f9c8f
fixing magic constants + reusing existing stuff
iakovenkos Jan 14, 2026
794a038
more const updates
iakovenkos Jan 14, 2026
37c3d8b
introduce point schedule entry
iakovenkos Jan 15, 2026
a43c966
consolidated --> nonzero_scalar_indices
iakovenkos Jan 16, 2026
cc17b30
clean up get_work_units
iakovenkos Jan 16, 2026
22f583b
batch msm clean up
iakovenkos Jan 16, 2026
8916b60
evaluate_pippenger_round mutates in-place instead of returning confus…
iakovenkos Jan 16, 2026
ac4f0ab
use uint32_t where possible
iakovenkos Jan 16, 2026
5e8cae1
unfold recursion
iakovenkos Jan 16, 2026
c487e40
use common helper to process buckets
iakovenkos Jan 16, 2026
c8142f0
share logic to produce single point edge case
iakovenkos Jan 16, 2026
8f0dbfc
rm redundant args
iakovenkos Jan 16, 2026
f3d3a28
stray comment
iakovenkos Jan 16, 2026
724ca97
check regression
iakovenkos Jan 16, 2026
a2c4a5a
centralize Montgomery conversion in filtering function
iakovenkos Jan 16, 2026
4a59df3
restore iterative consume_point_schedule (cleaner than recursive)
iakovenkos Jan 16, 2026
129eb22
iterative
iakovenkos Jan 17, 2026
1200dab
more docs and renaming
iakovenkos Jan 19, 2026
b074916
brush up tests
iakovenkos Jan 19, 2026
f9e088b
another docs iteration
iakovenkos Jan 19, 2026
7fe4f71
docs+naming
iakovenkos Jan 19, 2026
6ac8e94
clean up processing functions
iakovenkos Jan 19, 2026
9ba1080
better org
iakovenkos Jan 19, 2026
50c6f88
fix docs discrepancies
iakovenkos Jan 19, 2026
3e33312
make docs concise
iakovenkos Jan 19, 2026
de82341
upd hpp
iakovenkos Jan 19, 2026
8dc83f7
fix build, fix montgomery conversion regression
iakovenkos Jan 19, 2026
806e2de
rm funny inclusion
iakovenkos Jan 19, 2026
53b6501
Merge branch 'merge-train/barretenberg' into si/pippenger-audit-0
iakovenkos Jan 20, 2026
ff7f410
fix ivc integration test?
iakovenkos Jan 20, 2026
256770d
change bench script
iakovenkos Jan 20, 2026
108da69
fix multithreading
iakovenkos Jan 20, 2026
0aaa930
rm benches
iakovenkos Jan 20, 2026
40de9d5
fix perf regression
iakovenkos Jan 20, 2026
f1eff36
md fix
iakovenkos Jan 20, 2026
15b9521
fix build
iakovenkos Jan 20, 2026
113a58a
Merge remote-tracking branch 'origin/merge-train/barretenberg' into s…
iakovenkos Jan 21, 2026
e5d0055
move scalar slicing back to pippenger
iakovenkos Jan 23, 2026
65c92dc
address more comments
iakovenkos Jan 23, 2026
6c3dcfa
Merge remote-tracking branch 'origin/merge-train/barretenberg' into s…
iakovenkos Jan 23, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 8 additions & 0 deletions barretenberg/cpp/CLAUDE.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,14 @@ succint aztec-packages cheat sheet.

THE PROJECT ROOT IS AT TWO LEVELS ABOVE THIS FOLDER. Typically, the repository is at ~/aztec-packages. all advice is from the root.

# Git workflow for barretenberg

**IMPORTANT**: When comparing branches or looking at diffs for barretenberg work, use `merge-train/barretenberg` as the base branch, NOT `master`. The master branch is often outdated for barretenberg development.

Examples:
- `git diff merge-train/barretenberg...HEAD` (not `git diff master...HEAD`)
- `git log merge-train/barretenberg..HEAD` (not `git log master..HEAD`)

Run ./bootstrap.sh at the top-level to be sure the repo fully builds.
Bootstrap scripts can be called with relative paths e.g. ../barretenberg/bootstrap.sh

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ PRESET=${3:-clang20}
BUILD_DIR=${4:-build}
HARDWARE_CONCURRENCY=${HARDWARE_CONCURRENCY:-16}

BASELINE_BRANCH="master"
BASELINE_BRANCH="${BASELINE_BRANCH:-merge-train/barretenberg}"
BENCH_TOOLS_DIR="$BUILD_DIR/_deps/benchmark-src/tools"

if [ ! -z "$(git status --untracked-files=no --porcelain)" ]; then
Expand Down
183 changes: 183 additions & 0 deletions barretenberg/cpp/src/barretenberg/ecc/scalar_multiplication/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,183 @@
# Pippenger Multi-Scalar Multiplication (MSM)

## Overview

The Pippenger algorithm computes multi-scalar multiplications:

$$\text{MSM}(\vec{s}, \vec{P}) = \sum_{i=0}^{n-1} s_i \cdot P_i$$

**Complexity**: Let $q = \lceil \log_2(\text{field modulus}) \rceil$ be the scalar bit-length, $|A|$ the cost of a group addition, and $|D|$ the cost of a doubling.

- **Pippenger**: $O\left(\frac{q}{c} \cdot \left((n + 2^c) \cdot |A| + c \cdot |D|\right)\right)$
- **Naive**: $O(n \cdot q \cdot |D| + n \cdot q \cdot |A| / 2)$

With $c \approx \frac{1}{2} \log_2 n$, Pippenger achieves roughly $O(n \cdot q / \log n)$ vs $O(n \cdot q)$ for naive scalar multiplication.

## Algorithm

### Step 1: Scalar Decomposition

**Implementation**: `get_scalar_slice(scalar, round_index, bits_per_slice)`

Each scalar $s_i$ is decomposed into $r$ slices of $c$ bits each, processed **MSB-first**:

$$s_i = \sum_{j=0}^{r-1} s_i^{(j)} \cdot 2^{c(r-1-j)}$$

- $c$ = bits per slice (from `get_optimal_log_num_buckets`, which brute-force searches for minimum cost)
- $r = \lceil $ `NUM_BITS_IN_FIELD` $/ c \rceil$ = number of rounds
- Round 0 extracts the most significant bits

### Step 2: Bucket Accumulation

For each round $j$, points are added into **buckets** based on their scalar slice. Bucket $k$ accumulates all points whose slice value equals $k$:

$$B_k^{(j)} = \sum_{\{i : s_i^{(j)} = k\}} P_i$$

**Two implementation paths:**

- **Affine**: Sorts points by bucket and uses batched affine additions
- **Jacobian**: Direct bucket accumulation in Jacobian coordinates

### Step 3: Bucket Reduction

**Implementation**: `accumulate_buckets(bucket_accumulators)`

Computes weighted sum using a suffix sum (high to low):

$$R^{(j)} = \sum_{k=1}^{2^c - 1} k \cdot B_k^{(j)} = \sum_{k=1}^{2^c - 1} \left( \sum_{m=k}^{2^c - 1} B_m^{(j)} \right)$$

An offset generator is added and subtracted to avoid rare accumulator edge cases—a probabilistic mitigation that simplifies accumulation logic.

### Step 4: Round Combination

Combines all rounds using Horner's method (MSB-first):

```cpp
msm_accumulator = point_at_infinity
for j = 0 to r-1:
repeat c doublings (or fewer for final round)
msm_accumulator += bucket_result[j]
```

## Algorithm Variants

### Entry Points and Safety

| Entry Point | Default | Safety |
|-------------|---------|--------|
| `msm()` | `handle_edge_cases=false` | ⚠️ **Unsafe** |
| `pippenger()` | `handle_edge_cases=true` | ✓ Safe |
| `pippenger_unsafe()` | `handle_edge_cases=false` | ⚠️ Unsafe |
| `batch_multi_scalar_mul()` | `handle_edge_cases=true` | ✓ Safe |

### Edge Cases

Affine addition fails for **P = Q** (doubling), **P = −Q** (inverse), and **P = O** (identity). Jacobian coordinates handle these correctly at higher cost (~2-3× slower).

⚠️ **Use `msm()` or `pippenger_unsafe()` only when points are guaranteed linearly independent** (e.g., SRS points). For user-controlled or potentially duplicate points, use `pippenger()`.

### Affine Pippenger (`handle_edge_cases=false`)

Uses affine coordinates with Montgomery's batch inversion trick: replaces $m$ inversions with **1 inversion + O(m) multiplications**, yielding ~2-3× speedup over Jacobian.

### Jacobian Pippenger (`handle_edge_cases=true`)

Uses Jacobian coordinates for bucket accumulators. Handles all edge cases correctly.

## Tuning Constants

| Constant | Value | Purpose |
|----------|-------|---------|
| `PIPPENGER_THRESHOLD` | 16 | Below this, use naive scalar multiplication |
| `AFFINE_TRICK_THRESHOLD` | 128 | Below this, batch inversion overhead exceeds savings |
| `MAX_SLICE_BITS` | 20 | Upper bound on bucket count exponent |
| `BATCH_SIZE` | 2048 | Points per batch inversion (fits L2 cache) |
| `RADIX_BITS` | 8 | Bits per radix sort pass |

<details>
<summary>Cost model constants and derivations</summary>

| Constant | Value | Derivation |
|----------|-------|------------|
| `BUCKET_ACCUMULATION_COST` | 5 | 2 Jacobian adds/bucket × 2.5× cost ratio |
| `AFFINE_TRICK_SAVINGS_PER_OP` | 5 | ~10 muls saved − ~3 muls for product tree |
| `JACOBIAN_Z_NOT_ONE_PENALTY` | 5 | Extra field ops when Z ≠ 1 |
| `INVERSION_TABLE_COST` | 14 | 4-bit lookup table for modular exp |

**BATCH_SIZE=2048**: Each `AffineElement` is 64 bytes. 2048 points = 128 KB, fitting in L2 cache.

**RADIX_BITS=8**: 256 radix buckets × 4 bytes = 1 KB counting array, fits in L1 cache.

</details>

## Implementation Notes

### Zero Scalar Filtering

`transform_scalar_and_get_nonzero_scalar_indices` filters out zero scalars before processing (since $0 \cdot P_i = \mathcal{O}$). Scalars are converted from Montgomery form in-place to avoid doubling memory usage.

### Bucket Existence Tracking

A `BitVector` bitmap tracks which buckets are populated, avoiding expensive full-array clears between rounds. Clearing the bitmap costs $O(2^c / 64)$ words vs $O(2^c)$ for the full bucket array.

### Point Scheduling (Affine Variant Only)

Entries are packed as `(point_index << 32) | bucket_index` into 64-bit values. Since bucket indices fit in $c$ bits (typically 8-16), they occupy only the lowest bits of the packed entry. An **in-place MSD radix sort** on the low $c$ bits groups points by bucket for efficient batch processing. The sort also detects entries with `bucket_index == 0` during the final radix pass, allowing zero-bucket entries to be skipped without a separate scan.

### Batched Affine Addition

`batch_accumulate_points_into_buckets` processes sorted points iteratively:
- Same-bucket pairs → queue for batch addition
- Different buckets → cache in bucket or queue with existing accumulator
- Uses branchless conditional moves to minimize pipeline stalls
- Prefetches future points to hide memory latency
- Recirculates results to maximize batch efficiency before writing to buckets

<details>
<summary>Batch accumulation case analysis</summary>

| Condition | Action | Iterator Update |
|-----------|--------|-----------------|
| `bucket[i] == bucket[i+1]` | Queue both points for batch add | `point_it += 2` |
| Different buckets, accumulator exists | Queue point + accumulator | `point_it += 1` |
| Different buckets, no accumulator | Cache point into bucket | `point_it += 1` |

After batch addition, results targeting the same bucket are paired again before writing to bucket accumulators, reducing random memory access by ~50%.

</details>

## Parallelization

Uses **per-thread buffers** (bucket accumulators, scratch space) to eliminate contention.

For `batch_multi_scalar_mul()`, work is distributed via `MSMWorkUnit` structures that can split a single MSM across multiple threads. Each thread computes partial results on point subsets, combined in a final reduction.

<details>
<summary>Per-call buffer sizes</summary>

| Buffer | Size | Purpose |
|--------|------|---------|
| `BucketAccumulators` (affine) | $2^c × 64$ bytes | Affine bucket array + bitmap |
| `JacobianBucketAccumulators` | $2^c × 96$ bytes | Jacobian bucket array + bitmap |
| `AffineAdditionData` | ~400 KB | Scratch for batch inversion |
| `point_schedule` | $n × 8$ bytes | Per-MSM point schedule |

Buffers are allocated per-call for WASM compatibility. Memory scales with thread count during parallel execution.

</details>

## File Structure

```
scalar_multiplication/
├── scalar_multiplication.hpp # MSM class, data structures
├── scalar_multiplication.cpp # Core algorithm
├── process_buckets.hpp/cpp # Radix sort
├── bitvector.hpp # Bit vector for bucket tracking
└── README.md # This file
```

## References

1. Pippenger, N. (1976). "On the evaluation of powers and related problems"
2. Bernstein, D.J. et al. "Faster batch forgery identification" (batch inversion)
Original file line number Diff line number Diff line change
Expand Up @@ -10,89 +10,97 @@

namespace bb::scalar_multiplication {

// NOLINTNEXTLINE(misc-no-recursion) recursion is fine here, max recursion depth is 8 (64 bit int / 8 bits per call)
// NOLINTNEXTLINE(misc-no-recursion) recursion is fine here, max depth is 4 (32-bit bucket index / 8 bits per call)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added docs, improved the situation with magic numbers

void radix_sort_count_zero_entries(uint64_t* keys,
const size_t num_entries,
const uint32_t shift,
size_t& num_zero_entries,
const uint32_t total_bits,
const uint64_t* start_pointer) noexcept
const uint32_t bucket_index_bits,
const uint64_t* top_level_keys) noexcept
{
constexpr size_t num_bits = 8;
constexpr size_t num_buckets = 1UL << num_bits;
constexpr uint32_t mask = static_cast<uint32_t>(num_buckets) - 1U;
std::array<uint32_t, num_buckets> bucket_counts{};
constexpr size_t NUM_RADIX_BUCKETS = 1UL << RADIX_BITS;
constexpr uint32_t RADIX_MASK = static_cast<uint32_t>(NUM_RADIX_BUCKETS) - 1U;

// Step 1: Count entries in each radix bucket
std::array<uint32_t, NUM_RADIX_BUCKETS> bucket_counts{};
for (size_t i = 0; i < num_entries; ++i) {
bucket_counts[(keys[i] >> shift) & mask]++;
bucket_counts[(keys[i] >> shift) & RADIX_MASK]++;
}

std::array<uint32_t, num_buckets + 1> offsets;
std::array<uint32_t, num_buckets + 1> offsets_copy;
// Step 2: Convert counts to cumulative offsets (prefix sum)
std::array<uint32_t, NUM_RADIX_BUCKETS + 1> offsets;
std::array<uint32_t, NUM_RADIX_BUCKETS + 1> offsets_copy;
offsets[0] = 0;

for (size_t i = 0; i < num_buckets - 1; ++i) {
for (size_t i = 0; i < NUM_RADIX_BUCKETS - 1; ++i) {
bucket_counts[i + 1] += bucket_counts[i];
}
if ((shift == 0) && (keys == start_pointer)) {

// Count zero entries only at the final recursion level (shift == 0) and only for the full array
if ((shift == 0) && (keys == top_level_keys)) {
num_zero_entries = bucket_counts[0];
}
for (size_t i = 1; i < num_buckets + 1; ++i) {

for (size_t i = 1; i < NUM_RADIX_BUCKETS + 1; ++i) {
offsets[i] = bucket_counts[i - 1];
}
for (size_t i = 0; i < num_buckets + 1; ++i) {
for (size_t i = 0; i < NUM_RADIX_BUCKETS + 1; ++i) {
offsets_copy[i] = offsets[i];
}
uint64_t* start = &keys[0];

for (size_t i = 0; i < num_buckets; ++i) {
// Step 3: In-place permutation using cycle sort
// For each radix bucket, repeatedly swap elements to their correct positions until all elements
// in that bucket's range belong there. The offsets array tracks the next write position for each bucket.
uint64_t* start = &keys[0];
for (size_t i = 0; i < NUM_RADIX_BUCKETS; ++i) {
uint64_t* bucket_start = &keys[offsets[i]];
const uint64_t* bucket_end = &keys[offsets_copy[i + 1]];
while (bucket_start != bucket_end) {
for (uint64_t* it = bucket_start; it < bucket_end; ++it) {
const size_t value = (*it >> shift) & mask;
const size_t value = (*it >> shift) & RADIX_MASK;
const uint64_t offset = offsets[value]++;
std::iter_swap(it, start + offset);
}
bucket_start = &keys[offsets[i]];
}
}

// Step 4: Recursively sort each bucket by the next less-significant byte
if (shift > 0) {
for (size_t i = 0; i < num_buckets; ++i) {
if (offsets_copy[i + 1] - offsets_copy[i] > 1) {
radix_sort_count_zero_entries(&keys[offsets_copy[i]],
offsets_copy[i + 1] - offsets_copy[i],
shift - 8,
num_zero_entries,
total_bits,
keys);
for (size_t i = 0; i < NUM_RADIX_BUCKETS; ++i) {
const size_t bucket_size = offsets_copy[i + 1] - offsets_copy[i];
if (bucket_size > 1) {
radix_sort_count_zero_entries(
&keys[offsets_copy[i]], bucket_size, shift - RADIX_BITS, num_zero_entries, bucket_index_bits, keys);
}
}
}
}

size_t process_buckets_count_zero_entries(uint64_t* wnaf_entries,
const size_t num_entries,
const uint32_t num_bits) noexcept
size_t sort_point_schedule_and_count_zero_buckets(uint64_t* point_schedule,
const size_t num_entries,
const uint32_t bucket_index_bits) noexcept
{
if (num_entries == 0) {
return 0;
}
const uint32_t bits_per_round = 8;
const uint32_t base = num_bits & 7;
const uint32_t total_bits = (base == 0) ? num_bits : num_bits - base + 8;
const uint32_t shift = total_bits - bits_per_round;

// Round bucket_index_bits up to next multiple of RADIX_BITS for proper MSD radix sort alignment.
// E.g., if bucket_index_bits=10, we need to start sorting from bit 16 (2 bytes) not bit 10.
const uint32_t remainder = bucket_index_bits % RADIX_BITS;
const uint32_t padded_bits = (remainder == 0) ? bucket_index_bits : bucket_index_bits - remainder + RADIX_BITS;
const uint32_t initial_shift = padded_bits - RADIX_BITS;

size_t num_zero_entries = 0;
radix_sort_count_zero_entries(wnaf_entries, num_entries, shift, num_zero_entries, num_bits, wnaf_entries);

// inside radix_sort_count_zero_entries, if the least significant *byte* of `wnaf_entries[0] == 0`,
// then num_nonzero_entries = number of entries that share the same value as wnaf_entries[0].
// If wnaf_entries[0] != 0, we must manually set num_zero_entries = 0
if (num_entries > 0) {
if ((wnaf_entries[0] & 0xffffffff) != 0) {
num_zero_entries = 0;
}
radix_sort_count_zero_entries(
point_schedule, num_entries, initial_shift, num_zero_entries, bucket_index_bits, point_schedule);

// The radix sort counts entries where the least significant BYTE is zero, but we need entries where
// the entire bucket_index (lower 32 bits) is zero. Verify the first entry after sorting.
if ((point_schedule[0] & BUCKET_INDEX_MASK) != 0) {
num_zero_entries = 0;
}

return num_zero_entries;
}

} // namespace bb::scalar_multiplication
Loading
Loading