Skip to content

feat(SSA): always simplify AsSlice intrinsic in ACIR functions#10279

Open
asterite wants to merge 12 commits intomasterfrom
ab/always-simplify-as-slice
Open

feat(SSA): always simplify AsSlice intrinsic in ACIR functions#10279
asterite wants to merge 12 commits intomasterfrom
ab/always-simplify-as-slice

Conversation

@asterite
Copy link
Collaborator

Description

Problem

Resolves #10278

Summary

Just trying this out, out of curiosity, because it's relatively simple to implement.

Additional Context

Documentation

Check one:

  • No documentation needed.
  • Documentation included in this PR.
  • [For Experimental Features] Documentation to be submitted in a separate PR.

PR Checklist

  • I have tested the changes locally.
  • I have formatted the changes with Prettier and/or cargo fmt on default settings.

@github-actions
Copy link
Contributor

github-actions bot commented Oct 23, 2025

Changes to circuit sizes

Generated at commit: 4c7d077f383b7cb4e9caff003f31a35ca8d7a344, compared to commit: e34f4ee59422830b75f23c1bbe8aa558f4c3fe13

🧾 Summary (10% most significant diffs)

Program ACIR opcodes (+/-) % Circuit size (+/-) %
slice_dynamic_insert -33 ✅ -84.62% -119 ✅ -66.48%

Full diff report 👇
Program ACIR opcodes (+/-) % Circuit size (+/-) %
slice_dynamic_insert 6 (-33) -84.62% 60 (-119) -66.48%

Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Performance Alert ⚠️

Possible performance regression was detected for benchmark 'Test Suite Duration'.
Benchmark result of this commit is worse than the previous benchmark result exceeding threshold 1.20.

Benchmark suite Current: 87176a2 Previous: 2f73a86 Ratio
test_report_zkpassport_noir-ecdsa_ 3 s 2 s 1.50
test_report_zkpassport_noir_rsa_ 2 s 1 s 2

This comment was automatically generated by workflow using github-action-benchmark.

CC: @TomAFrench

Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Performance Alert ⚠️

Possible performance regression was detected for benchmark 'Execution Time'.
Benchmark result of this commit is worse than the previous benchmark result exceeding threshold 1.20.

Benchmark suite Current: cce2476 Previous: e34f4ee Ratio
rollup-block-root-single-tx 0.003 s 0.002 s 1.50
rollup-checkpoint-merge 0.004 s 0.003 s 1.33

This comment was automatically generated by workflow using github-action-benchmark.

CC: @TomAFrench

@asterite
Copy link
Collaborator Author

It seems this introduces some regressions in Brillig:

image

So for now I'll only apply this optimization for ACIR.

That said, if you think this change is generally good for ACIR and Brillig, it would also mean we don't need to handle AsSlice in ACIR-gen or Brillig-gen anymore as it would be guaranteed that AsSlice is always simplified, making code generation slightly simpler.

@asterite asterite changed the title feat(SSA): always simplify AsSlice intrinsic feat(SSA): always simplify AsSlice intrinsic in ACIR functions Oct 23, 2025
@asterite asterite requested a review from a team October 23, 2025 14:15
Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Performance Alert ⚠️

Possible performance regression was detected for benchmark 'Compilation Time'.
Benchmark result of this commit is worse than the previous benchmark result exceeding threshold 1.20.

Benchmark suite Current: 5872785 Previous: ca04e9c Ratio
private-kernel-reset 8.442 s 6.98 s 1.21
sha512-100-bytes 1.907 s 1.58 s 1.21

This comment was automatically generated by workflow using github-action-benchmark.

CC: @TomAFrench

Copy link
Contributor

@vezenovm vezenovm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am unsure whether constant instruction simplification is the right place for this code. Upon inserting an AsSlice we will always generate this alternative code. At that point this logic may as well just live in SSA gen. If we figure out why there is a Brillig byte code regression maybe we could even get rid of the AsSlice intrinsic entirely (and thus the as_slice_length pass as well).

// ```
//
// We don't do this for Brillig because it sometimes leads to more opcodes.
if dfg.runtime().is_acir() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we reduce nesting here with an inverted condition and early return

Instruction::MakeArray { elements, typ: Type::Slice(element_types.clone()) };
let slice_length = dfg.make_constant(length.into(), NumericType::length_type());
let new_slice =
dfg.insert_instruction_and_results(make_array, block, None, call_stack).first();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a make_array helper function

@asterite
Copy link
Collaborator Author

If we figure out why there is a Brillig byte code regression

I think what happens is that we need code to copy each of the element from the array into the slice. The bigger the array, more ops are needed for this. While with the currentas_slice implementation it's just copying an array into a slice, so a bit more than one instruction.

@asterite asterite added the bench-show Display benchmark results on PR label Oct 24, 2025
Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ACVM Benchmarks

Details
Benchmark suite Current: cce2476 Previous: e34f4ee Ratio
purely_sequential_opcodes 262581 ns/iter (± 582) 263115 ns/iter (± 1711) 1.00
perfectly_parallel_opcodes 231922 ns/iter (± 4599) 232833 ns/iter (± 1560) 1.00
perfectly_parallel_batch_inversion_opcodes 2794302 ns/iter (± 1864) 2796347 ns/iter (± 10387) 1.00

This comment was automatically generated by workflow using github-action-benchmark.

Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test Suite Duration

Details
Benchmark suite Current: cce2476 Previous: e34f4ee Ratio
test_report_AztecProtocol_aztec-packages_noir-projects_aztec-nr 117 s 128 s 0.91
test_report_AztecProtocol_aztec-packages_noir-projects_noir-contracts 156 s 131 s 1.19
test_report_AztecProtocol_aztec-packages_noir-projects_noir-protocol-circuits_crates_blob 308 s 287 s 1.07
test_report_AztecProtocol_aztec-packages_noir-projects_noir-protocol-circuits_crates_private-kernel-lib 239 s 226 s 1.06
test_report_AztecProtocol_aztec-packages_noir-projects_noir-protocol-circuits_crates_types 123 s 125 s 0.98
test_report_noir-lang_noir-bignum_ 161 s 164 s 0.98
test_report_noir-lang_noir_bigcurve_ 336 s 319 s 1.05
test_report_noir-lang_sha256_ 14 s 15 s 0.93
test_report_noir-lang_sha512_ 13 s 14 s 0.93
test_report_zkpassport_noir-ecdsa_ 1 s 1 s 1
test_report_zkpassport_noir_rsa_ 1 s 2 s 0.50

This comment was automatically generated by workflow using github-action-benchmark.

Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Execution Time

Details
Benchmark suite Current: cce2476 Previous: e34f4ee Ratio
private-kernel-inner 0.011 s 0.011 s 1
private-kernel-reset 0.146 s 0.147 s 0.99
private-kernel-tail 0.009 s 0.009 s 1
rollup-block-root-first-empty-tx 0.003 s 0.003 s 1
rollup-block-root-single-tx 0.003 s 0.002 s 1.50
rollup-block-root 0.004 s 0.004 s 1
rollup-checkpoint-merge 0.004 s 0.003 s 1.33
rollup-checkpoint-root-single-block 11.4 s 11.3 s 1.01
rollup-checkpoint-root 11.6 s 11.2 s 1.04
rollup-root 0.004 s 0.004 s 1
rollup-tx-base-private 0.299 s 0.298 s 1.00
rollup-tx-base-public 0.239 s 0.238 s 1.00
rollup-tx-merge 0.002 s 0.002 s 1
semaphore-depth-10 0.009 s 0.009 s 1
sha512-100-bytes 0.049 s 0.076 s 0.64

This comment was automatically generated by workflow using github-action-benchmark.

Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Compilation Time

Details
Benchmark suite Current: cce2476 Previous: e34f4ee Ratio
private-kernel-inner 1.72 s 1.746 s 0.99
private-kernel-reset 7 s 7.32 s 0.96
private-kernel-tail 1.412 s 1.356 s 1.04
rollup-block-root-first-empty-tx 1.41 s 1.43 s 0.99
rollup-block-root-single-tx 1.41 s 1.34 s 1.05
rollup-block-root 1.43 s 1.43 s 1
rollup-checkpoint-merge 1.462 s 1.452 s 1.01
rollup-checkpoint-root-single-block 200 s 210 s 0.95
rollup-checkpoint-root 207 s 196 s 1.06
rollup-root 1.494 s 1.688 s 0.89
rollup-tx-base-private 21.3 s 18.08 s 1.18
rollup-tx-base-public 76.56 s 77.18 s 0.99
rollup-tx-merge 1.382 s 1.362 s 1.01
semaphore-depth-10 0.803 s 0.784 s 1.02
sha512-100-bytes 1.665 s 1.907 s 0.87

This comment was automatically generated by workflow using github-action-benchmark.

Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Opcode count

Details
Benchmark suite Current: cce2476 Previous: e34f4ee Ratio
private-kernel-inner 14544 opcodes 14544 opcodes 1
private-kernel-reset 70415 opcodes 70415 opcodes 1
private-kernel-tail 11680 opcodes 11680 opcodes 1
rollup-block-root-first-empty-tx 1364 opcodes 1364 opcodes 1
rollup-block-root-single-tx 1048 opcodes 1048 opcodes 1
rollup-block-root 2409 opcodes 2409 opcodes 1
rollup-checkpoint-merge 2130 opcodes 2130 opcodes 1
rollup-checkpoint-root-single-block 962015 opcodes 962015 opcodes 1
rollup-checkpoint-root 963375 opcodes 963375 opcodes 1
rollup-root 2630 opcodes 2630 opcodes 1
rollup-tx-base-private 263908 opcodes 263908 opcodes 1
rollup-tx-base-public 245185 opcodes 245185 opcodes 1
rollup-tx-merge 1486 opcodes 1486 opcodes 1
semaphore-depth-10 5699 opcodes 5699 opcodes 1
sha512-100-bytes 13173 opcodes 13173 opcodes 1

This comment was automatically generated by workflow using github-action-benchmark.

Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Artifact Size

Details
Benchmark suite Current: cce2476 Previous: e34f4ee Ratio
private-kernel-inner 714.3 KB 714.3 KB 1
private-kernel-reset 1864 KB 1864 KB 1
private-kernel-tail 546.2 KB 546.2 KB 1
rollup-block-root-first-empty-tx 179.6 KB 179.6 KB 1
rollup-block-root-single-tx 177.9 KB 177.9 KB 1
rollup-block-root 257.9 KB 257.9 KB 1
rollup-checkpoint-merge 370.6 KB 370.6 KB 1
rollup-checkpoint-root-single-block 27640.2 KB 27640.2 KB 1
rollup-checkpoint-root 27685.9 KB 27685.9 KB 1
rollup-root 411.4 KB 411.4 KB 1
rollup-tx-base-private 4909.3 KB 4909.3 KB 1
rollup-tx-base-public 4555.4 KB 4555.4 KB 1
rollup-tx-merge 186.1 KB 186.1 KB 1
semaphore-depth-10 570.9 KB 570.9 KB 1
sha512-100-bytes 506.3 KB 506.3 KB 1

This comment was automatically generated by workflow using github-action-benchmark.

Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Execution Memory

Details
Benchmark suite Current: cce2476 Previous: e34f4ee Ratio
private-kernel-inner 253.63 MB 253.63 MB 1
private-kernel-reset 287.24 MB 287.24 MB 1
private-kernel-tail 243.5 MB 243.5 MB 1
rollup-block-root 337.87 MB 337.87 MB 1
rollup-checkpoint-merge 336.67 MB 336.67 MB 1
rollup-checkpoint-root-single-block 1020 MB 1020 MB 1
rollup-checkpoint-root 1020 MB 1020 MB 1
rollup-root 337.87 MB 337.87 MB 1
rollup-tx-base-private 451.42 MB 451.42 MB 1
rollup-tx-base-public 466.49 MB 466.49 MB 1
rollup-tx-merge 336.13 MB 336.13 MB 1
semaphore_depth_10 73.7 MB 73.7 MB 1
sha512_100_bytes 71.96 MB 71.96 MB 1

This comment was automatically generated by workflow using github-action-benchmark.

Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Compilation Memory

Details
Benchmark suite Current: cce2476 Previous: e34f4ee Ratio
private-kernel-inner 254.47 MB 254.47 MB 1
private-kernel-reset 493.18 MB 493.18 MB 1
private-kernel-tail 244.63 MB 244.63 MB 1
rollup-block-root-first-empty-tx 339.38 MB 339.38 MB 1
rollup-block-root-single-tx 337.41 MB 337.41 MB 1
rollup-block-root 340.43 MB 340.43 MB 1
rollup-checkpoint-merge 339.76 MB 339.76 MB 1
rollup-checkpoint-root-single-block 5910 MB 5910 MB 1
rollup-checkpoint-root 5920 MB 5920 MB 1
rollup-root 341.31 MB 341.31 MB 1
rollup-tx-base-private 1070 MB 1070 MB 1
rollup-tx-base-public 2890 MB 2890 MB 1
rollup-tx-merge 336.59 MB 336.59 MB 1
semaphore_depth_10 92.18 MB 92.18 MB 1
sha512_100_bytes 185.49 MB 185.48 MB 1.00

This comment was automatically generated by workflow using github-action-benchmark.

@vezenovm
Copy link
Contributor

I think what happens is that we need code to copy each of the element from the array into the slice. The bigger the array, more ops are needed for this. While with the currentas_slice implementation it's just copying an array into a slice, so a bit more than one instruction.

Yeah I was assuming there was some additional overhead from array gets + make_array vs. as_slice in Brillig gen. The regressions in #10279 (comment) don't look horrible to be honest. Maybe we are ok with just eating that overhead in exchange for simplifying the compiler. I do think this ACIR change can be moved to SSA gen either way though.

@asterite
Copy link
Collaborator Author

I do think this ACIR change can be moved to SSA gen either way though.

Do you mean, instead of simplifying AsSlice we handle it during ssa-gen and directly output these chain of instructions?

the regressions in #10279 (comment) don't look horrible to be honest

In that program I tried changing the input array: the larger it is, the larger the Brillig code gets. For example right now it's 238 instructions with 20 elements, but if we bump it to 200 elements it's 1138 instructions. So... I'm not sure if we should go forward with it, I don't know if test_programs are representative of real programs.

Ideally, we translate as_slice as a series of array_get + make_array. That way we can apply more optimization. Then right before Brillig gen we could try to detect this pattern of array_get + make_array and pack those into a single as_slice. Or maybe SSA optimizations could be better.

The SSA for this program:

fn main(array: [Field; 3]) -> pub Field {
    let slice = array.as_slice();
    let slice = slice.insert(0, 1);
    slice[0]
}

is:

acir(inline) predicate_pure fn main f0 {
  b0(v0: [Field; 3]):
    v2, v3 = call as_slice(v0) -> (u32, [Field])
    v8, v9 = call slice_insert(u32 3, v3, u32 0, Field 1) -> (u32, [Field])
    constrain v8 != u32 0, "Index out of bounds"
    v10 = array_get v9, index u32 0 -> Field
    return v10
}

The last array_get could see that v9 is the result of a slice_insert at position 0. Since it's getting from position 0 it could be optimized to Field 1.

In the test program slice_dynamic_insert we end up with this SSA:

acir(inline) predicate_pure fn main f0 {
  b0(v0: u32, v1: [Field; 5]):
    v3, v4 = call as_slice(v1) -> (u32, [Field])
    constrain v0 == u32 5
    v9, v10 = call slice_insert(u32 5, v4, u32 3, Field 10) -> (u32, [Field])
    constrain v9 == u32 6
    v13 = array_get v10, index u32 0 -> Field
    constrain v13 == Field 1
    v16 = array_get v10, index u32 1 -> Field
    constrain v16 == Field 2
    v19 = array_get v10, index u32 2 -> Field
    constrain v19 == Field 3
    v21 = array_get v10, index u32 3 -> Field
    constrain v21 == Field 10
    v23 = array_get v10, index u32 4 -> Field
    constrain v23 == Field 4
    v25 = array_get v10, index u32 5 -> Field
    constrain v25 == Field 5
    v27 = array_set v1, index u32 4, value Field 10
    v28 = array_get v1, index u32 0 -> Field
    constrain v28 == Field 1
    v29 = array_get v1, index u32 1 -> Field
    constrain v29 == Field 2
    v30 = array_get v1, index u32 2 -> Field
    constrain v30 == Field 3
    v31 = array_get v1, index u32 3 -> Field
    constrain v31 == Field 4
    v32 = array_get v1, index u32 4 -> Field
    constrain v32 == Field 5
    v33 = array_get v27, index u32 0 -> Field
    constrain v33 == Field 1
    v34 = array_get v27, index u32 1 -> Field
    constrain v34 == Field 2
    v35 = array_get v27, index u32 2 -> Field
    constrain v35 == Field 3
    v36 = array_get v27, index u32 3 -> Field
    constrain v36 == Field 4
    return
}

I didn't think it was the same case as before, but it seems it is! All those array_get on v10 can be optimized to do array_get on v1 since we know they come from a slice that had an insert at a specific position. All of them can be optimized because if the index is less than where it was inserted, it's the same index. Otherwise it's shifted. In fact, if we look at the SSA after we apply the optimization in this PR, we get this SSA:

acir(inline) predicate_pure fn main f0 {
  b0(v0: u32, v1: [Field; 5]):
    v3 = array_get v1, index u32 0 -> Field
    v5 = array_get v1, index u32 1 -> Field
    v7 = array_get v1, index u32 2 -> Field
    v9 = array_get v1, index u32 3 -> Field
    v11 = array_get v1, index u32 4 -> Field
    constrain v0 == u32 5
    constrain v3 == Field 1
    constrain v5 == Field 2
    constrain v7 == Field 3
    constrain v9 == Field 4
    constrain v11 == Field 5
    v19 = array_set mut v1, index u32 4, value Field 10
    v20 = array_get v19, index u32 0 -> Field
    constrain v20 == Field 1
    v21 = array_get v19, index u32 1 -> Field
    constrain v21 == Field 2
    v22 = array_get v19, index u32 2 -> Field
    constrain v22 == Field 3
    v23 = array_get v19, index u32 3 -> Field
    constrain v23 == Field 4
    return
}

We can see that the array gets that were on v10 are no on v1.

I think I'm going to try this optimization in a separate PR. It seems relatively simple and we don't end up generating a huge list of array_get for as_slice.

@asterite
Copy link
Collaborator Author

Out of curiosity, I tried the above in this PR: #10295

Initially slice_dynamic_insert wasn't optimized as much as in this PR so I looked at what else we could optimize. I found these things:

So by now #10295 is a monster PR doing a bit of each of the above, just to see how much things can be optimized. I'll probably close this as at this point it might be risky to introduce these changes, but at least I was able to confirm that these optimizations can impact performance.

@vezenovm
Copy link
Contributor

vezenovm commented Oct 27, 2025

Do you mean, instead of simplifying AsSlice we handle it during ssa-gen and directly output these chain of instructions?

Yes. The reason I say this is because the instruction simplification logic you added operates over all value kinds (constant and dynamic). So when we insert an as_slice in SSA gen, it will be immediately be re-written. At that point we may as well just lower the intrinsic that way in the first place.

Ideally, we translate as_slice as a series of array_get + make_array. That way we can apply more optimization. Then right before Brillig gen we could try to detect this pattern of array_get + make_array and pack those into a single as_slice. Or maybe SSA optimizations could be better.

I also think it would be ideal translate as_slice into array_gets + make_array as we could then remove the SSA intrinsic entirely. I don't like the idea of lowering as_slice and then bring it back. I would prefer to figure out where the SSA opts could be better.

I think I'm going to try this optimization in a separate PR. It seems relatively simple and we don't end up generating a huge list of array_get for as_slice.

Yes this looks like it would be good as a separate optimization.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bench-show Display benchmark results on PR

Projects

None yet

Development

Successfully merging this pull request may close these issues.

as_slice could be simplified in more cases

3 participants