Skip to content

Fix performance regression in bind group processing#8519

Merged
cwfitzgerald merged 1 commit intogfx-rs:trunkfrom
andyleiserson:binding-perf
Nov 14, 2025
Merged

Fix performance regression in bind group processing#8519
cwfitzgerald merged 1 commit intogfx-rs:trunkfrom
andyleiserson:binding-perf

Conversation

@andyleiserson
Copy link
Contributor

@andyleiserson andyleiserson commented Nov 13, 2025

In the change to defer some bind group processing until draw/dispatch (#8418), I was not careful enough to avoid extra work on this hot path.

This makes two fixes:

  1. Only processes initialization actions when the bindings have actually changed.
  2. Reuses the same usage scope for the entire compute pass, because the cost of creating and destroying the scope is significant (even with the pool -- clearing a Vec<Option<Arc<T>>> is expensive even if we aren't immediately freeing the memory). This fix is largely a revert.

Fixes #8499.
Fixes #8500.

Testing
Using wgpu-benchmark. This recovers most of the lost performance. There is still a significant (10-50%) drop in the submit time benchmarks, but that figure is a bit misleading because the submission can't happen separately from encoding. There's also still a drop of ~10% in the compute pass encode benchmark, which is more than I would like, but I haven't been able to identify any specific changes to mitigate it. The performance drop seems to be associated with moving the init tracking from set_bind_group to dispatch, even though the amount of work is not changing (there is one set_bind_group per dispatch in this case). Unfortunately, deferring recording of init actions until we're certain we're actually using the surfaces in the dispatch is important. (Although we now have a check at submit time that the bind groups are still valid, this doesn't handle the case where the resources are alive at submit and then destroyed while the submission is in flight -- destroy() only checks the tracker for presence of the resource, not the bind group.)

The computepass bindless benchmark isn't supported on my test system, it is probably worth finding somewhere we can verify that one as well.

Squash or Rebase? Squash

Checklist

  • Run cargo fmt.
  • Run taplo format.
  • Run cargo clippy --tests. If applicable, add:
    • --target wasm32-unknown-unknown
  • Run cargo xtask test to run tests.
  • If this contains user-facing changes, add a CHANGELOG.md entry.

@cwfitzgerald cwfitzgerald self-assigned this Nov 13, 2025
@andyleiserson
Copy link
Contributor Author

In case it is useful, here is the diff from before the original change, to the version in this PR: b3d9431...andyleiserson:wgpu:binding-perf-alt

@cwfitzgerald
Copy link
Member

cwfitzgerald commented Nov 13, 2025

Ran the benchmarks with fairly long run time to try to get better information. This is the current benchmark report from v27 to the tip of this PR, with multi-threaded tests removed. Seems to be fairly minor, worst being ~12% in renderpass submit, which is acceptable.

Will review this after work.

Results
Gnuplot not found, using plotters backend
Benchmarking Bind Group Creation/5 Element Bind Group: Warming up for 5.0000 sAdapterInfo { name: "NVIDIA GeForce RTX 4070", vendor: 4318, device: 10118, device_type: DiscreteGpu, device_pci_bus_id: "0000:01:00.0", driver: "NVIDIA", driver_info: "581.80", backend: Vulkan, transient_saves_memory: false }
Bind Group Creation/5 Element Bind Group
                        time:   [1.2283 µs 1.2291 µs 1.2299 µs]
                        thrpt:  [4.0655 Melem/s 4.0681 Melem/s 4.0705 Melem/s]
                 change:
                        time:   [−1.0122% −0.7749% −0.5128%] (p = 0.00 < 0.05)
                        thrpt:  [+0.5154% +0.7810% +1.0225%]
                        Change within noise threshold.
Found 8 outliers among 100 measurements (8.00%)
  7 (7.00%) high mild
  1 (1.00%) high severe
Bind Group Creation/50 Element Bind Group
                        time:   [4.4265 µs 4.4275 µs 4.4285 µs]
                        thrpt:  [11.291 Melem/s 11.293 Melem/s 11.296 Melem/s]
                 change:
                        time:   [−0.2225% +0.0373% +0.2809%] (p = 0.78 > 0.05)
                        thrpt:  [−0.2801% −0.0373% +0.2230%]
                        No change in performance detected.
Found 5 outliers among 100 measurements (5.00%)
  3 (3.00%) low mild
  1 (1.00%) high mild
  1 (1.00%) high severe
Bind Group Creation/500 Element Bind Group
                        time:   [35.177 µs 35.203 µs 35.229 µs]
                        thrpt:  [14.193 Melem/s 14.203 Melem/s 14.214 Melem/s]
                 change:
                        time:   [−0.0605% +0.2644% +0.5828%] (p = 0.09 > 0.05)
                        thrpt:  [−0.5794% −0.2637% +0.0605%]
                        No change in performance detected.
Found 9 outliers among 100 measurements (9.00%)
  7 (7.00%) high mild
  2 (2.00%) high severe
Bind Group Creation/5000 Element Bind Group
                        time:   [425.58 µs 425.67 µs 425.76 µs]
                        thrpt:  [11.744 Melem/s 11.746 Melem/s 11.749 Melem/s]
                 change:
                        time:   [−0.2531% −0.0013% +0.3050%] (p = 1.00 > 0.05)
                        thrpt:  [−0.3040% +0.0013% +0.2537%]
                        No change in performance detected.
Found 12 outliers among 100 measurements (12.00%)
  5 (5.00%) low mild
  4 (4.00%) high mild
  3 (3.00%) high severe
Benchmarking Bind Group Creation/50000 Element Bind Group: Warming up for 5.0000 s
Warning: Unable to complete 100 samples in 20.0s. You may wish to increase target time to 32.7s, enable flat sampling, or reduce sample count to 50.
Bind Group Creation/50000 Element Bind Group
                        time:   [5.9898 ms 5.9979 ms 6.0062 ms]
                        thrpt:  [8.3248 Melem/s 8.3362 Melem/s 8.3475 Melem/s]
                 change:
                        time:   [−4.6312% −4.1938% −3.7629%] (p = 0.00 < 0.05)
                        thrpt:  [+3.9101% +4.3774% +4.8561%]
                        Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
  1 (1.00%) low mild
  5 (5.00%) high mild
  1 (1.00%) high severe

Benchmarking Renderpass: Single Threaded/1 renderpasses x 10000 draws (Renderpass Time): Warming up for 5.0000 sAdapterInfo { name: "NVIDIA GeForce RTX 4070", vendor: 4318, device: 10118, device_type: DiscreteGpu, device_pci_bus_id: "0000:01:00.0", driver: "NVIDIA", driver_info: "581.80", backend: Vulkan, transient_saves_memory: false }
Renderpass: Single Threaded/1 renderpasses x 10000 draws (Renderpass Time)
                        time:   [10.688 ms 10.739 ms 10.792 ms]
                        thrpt:  [926.58 Kelem/s 931.18 Kelem/s 935.59 Kelem/s]
                 change:
                        time:   [+0.9303% +1.6338% +2.3507%] (p = 0.00 < 0.05)
                        thrpt:  [−2.2967% −1.6075% −0.9217%]
                        Change within noise threshold.
Found 4 outliers among 100 measurements (4.00%)
  4 (4.00%) high mild
Renderpass: Single Threaded/2 renderpasses x 5000 draws (Renderpass Time)
                        time:   [11.883 ms 12.062 ms 12.287 ms]
                        thrpt:  [813.86 Kelem/s 829.04 Kelem/s 841.55 Kelem/s]
                 change:
                        time:   [+7.0157% +8.6598% +10.602%] (p = 0.00 < 0.05)
                        thrpt:  [−9.5854% −7.9697% −6.5558%]
                        Performance has regressed.
Found 10 outliers among 100 measurements (10.00%)
  7 (7.00%) high mild
  3 (3.00%) high severe
Renderpass: Single Threaded/4 renderpasses x 2500 draws (Renderpass Time)
                        time:   [12.308 ms 12.401 ms 12.500 ms]
                        thrpt:  [799.97 Kelem/s 806.38 Kelem/s 812.50 Kelem/s]
                 change:
                        time:   [+5.2772% +6.2344% +7.2226%] (p = 0.00 < 0.05)
                        thrpt:  [−6.7361% −5.8685% −5.0127%]
                        Performance has regressed.
Found 3 outliers among 100 measurements (3.00%)
  3 (3.00%) high mild
Renderpass: Single Threaded/8 renderpasses x 1250 draws (Renderpass Time)
                        time:   [13.882 ms 14.019 ms 14.165 ms]
                        thrpt:  [705.98 Kelem/s 713.31 Kelem/s 720.35 Kelem/s]
                 change:
                        time:   [+7.9814% +9.2957% +10.536%] (p = 0.00 < 0.05)
                        thrpt:  [−9.5315% −8.5051% −7.3915%]
                        Performance has regressed.
Found 16 outliers among 100 measurements (16.00%)
  16 (16.00%) high mild
Renderpass: Single Threaded/1 renderpasses x 10000 draws (Submit Time)
                        time:   [837.19 µs 846.16 µs 856.09 µs]
                        thrpt:  [11.681 Melem/s 11.818 Melem/s 11.945 Melem/s]
                 change:
                        time:   [+19.326% +21.676% +23.886%] (p = 0.00 < 0.05)
                        thrpt:  [−19.281% −17.815% −16.196%]
                        Performance has regressed.
Found 5 outliers among 100 measurements (5.00%)
  4 (4.00%) high mild
  1 (1.00%) high severe
Renderpass: Single Threaded/2 renderpasses x 5000 draws (Submit Time)
                        time:   [1.7312 ms 1.7428 ms 1.7552 ms]
                        thrpt:  [5.6973 Melem/s 5.7380 Melem/s 5.7764 Melem/s]
                 change:
                        time:   [+13.768% +14.758% +15.772%] (p = 0.00 < 0.05)
                        thrpt:  [−13.624% −12.860% −12.102%]
                        Performance has regressed.
Found 8 outliers among 100 measurements (8.00%)
  8 (8.00%) high mild
Renderpass: Single Threaded/4 renderpasses x 2500 draws (Submit Time)
                        time:   [2.0188 ms 2.0325 ms 2.0470 ms]
                        thrpt:  [4.8851 Melem/s 4.9202 Melem/s 4.9534 Melem/s]
                 change:
                        time:   [+14.020% +15.109% +16.149%] (p = 0.00 < 0.05)
                        thrpt:  [−13.904% −13.126% −12.296%]
                        Performance has regressed.
Found 10 outliers among 100 measurements (10.00%)
  9 (9.00%) high mild
  1 (1.00%) high severe
Renderpass: Single Threaded/8 renderpasses x 1250 draws (Submit Time)
                        time:   [2.4577 ms 2.4750 ms 2.4933 ms]
                        thrpt:  [4.0107 Melem/s 4.0403 Melem/s 4.0689 Melem/s]
                 change:
                        time:   [+13.193% +14.240% +15.316%] (p = 0.00 < 0.05)
                        thrpt:  [−13.282% −12.465% −11.655%]
                        Performance has regressed.
Found 8 outliers among 100 measurements (8.00%)
  8 (8.00%) high mild

Benchmarking Renderpass: Bindless/10000 draws: Warming up for 5.0000 s
Warning: Unable to complete 100 samples in 20.0s. You may wish to increase target time to 27.1s, enable flat sampling, or reduce sample count to 60.
Renderpass: Bindless/10000 draws
                        time:   [3.4756 ms 3.4861 ms 3.4968 ms]
                        thrpt:  [2.8598 Melem/s 2.8685 Melem/s 2.8772 Melem/s]
                 change:
                        time:   [+3.7507% +5.6400% +7.3186%] (p = 0.00 < 0.05)
                        thrpt:  [−6.8195% −5.3389% −3.6151%]
                        Performance has regressed.
Found 3 outliers among 100 measurements (3.00%)
  1 (1.00%) low mild
  1 (1.00%) high mild
  1 (1.00%) high severe

Renderpass: Empty Submit with 90000 Resources
                        time:   [29.006 µs 29.126 µs 29.255 µs]
                        change: [−0.7769% +0.1332% +1.0426%] (p = 0.79 > 0.05)
                        No change in performance detected.
Found 8 outliers among 100 measurements (8.00%)
  4 (4.00%) high mild
  4 (4.00%) high severe

Benchmarking Computepass: Single Threaded/2 computepasses x 5000 dispatches (Computepass Time): Warming up for 5.0000 sAdapterInfo { name: "NVIDIA GeForce RTX 4070", vendor: 4318, device: 10118, device_type: DiscreteGpu, device_pci_bus_id: "0000:01:00.0", driver: "NVIDIA", driver_info: "581.80", backend: Vulkan, transient_saves_memory: false }
Computepass: Single Threaded/2 computepasses x 5000 dispatches (Computepass Time)
                        time:   [8.6534 ms 8.7157 ms 8.7814 ms]
                        thrpt:  [1.1388 Melem/s 1.1474 Melem/s 1.1556 Melem/s]
                 change:
                        time:   [+1.7984% +2.9928% +4.1230%] (p = 0.00 < 0.05)
                        thrpt:  [−3.9598% −2.9059% −1.7666%]
                        Performance has regressed.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild
Computepass: Single Threaded/4 computepasses x 2500 dispatches (Computepass Time)
                        time:   [9.0380 ms 9.1031 ms 9.1736 ms]
                        thrpt:  [1.0901 Melem/s 1.0985 Melem/s 1.1064 Melem/s]
                 change:
                        time:   [+1.8669% +3.0172% +4.1976%] (p = 0.00 < 0.05)
                        thrpt:  [−4.0285% −2.9289% −1.8327%]
                        Performance has regressed.
Found 13 outliers among 100 measurements (13.00%)
  11 (11.00%) high mild
  2 (2.00%) high severe
Computepass: Single Threaded/8 computepasses x 1250 dispatches (Computepass Time)
                        time:   [9.6201 ms 9.6922 ms 9.7689 ms]
                        thrpt:  [1.0237 Melem/s 1.0318 Melem/s 1.0395 Melem/s]
                 change:
                        time:   [+0.7233% +2.1875% +3.5679%] (p = 0.00 < 0.05)
                        thrpt:  [−3.4450% −2.1406% −0.7181%]
                        Change within noise threshold.
Found 14 outliers among 100 measurements (14.00%)
  13 (13.00%) high mild
  1 (1.00%) high severe
Computepass: Single Threaded/2 computepasses x 5000 dispatches (Submit Time)
                        time:   [3.0840 ms 3.1300 ms 3.1825 ms]
                        thrpt:  [3.1422 Melem/s 3.1949 Melem/s 3.2426 Melem/s]
                 change:
                        time:   [+10.980% +12.990% +15.102%] (p = 0.00 < 0.05)
                        thrpt:  [−13.120% −11.496% −9.8940%]
                        Performance has regressed.
Found 7 outliers among 100 measurements (7.00%)
  4 (4.00%) high mild
  3 (3.00%) high severe
Computepass: Single Threaded/4 computepasses x 2500 dispatches (Submit Time)
                        time:   [3.2716 ms 3.3072 ms 3.3453 ms]
                        thrpt:  [2.9893 Melem/s 3.0237 Melem/s 3.0566 Melem/s]
                 change:
                        time:   [+8.3174% +9.7926% +11.359%] (p = 0.00 < 0.05)
                        thrpt:  [−10.200% −8.9192% −7.6787%]
                        Performance has regressed.
Found 10 outliers among 100 measurements (10.00%)
  9 (9.00%) high mild
  1 (1.00%) high severe
Computepass: Single Threaded/8 computepasses x 1250 dispatches (Submit Time)
                        time:   [3.6430 ms 3.6814 ms 3.7219 ms]
                        thrpt:  [2.6868 Melem/s 2.7163 Melem/s 2.7450 Melem/s]
                 change:
                        time:   [+8.8186% +10.230% +11.812%] (p = 0.00 < 0.05)
                        thrpt:  [−10.565% −9.2807% −8.1039%]
                        Performance has regressed.
Found 11 outliers among 100 measurements (11.00%)
  11 (11.00%) high mild

Computepass: Bindless/1000 dispatch
                        time:   [176.74 ms 177.30 ms 177.89 ms]
                        thrpt:  [5.6216 Kelem/s 5.6401 Kelem/s 5.6581 Kelem/s]
                 change:
                        time:   [+5.3271% +5.8179% +6.2841%] (p = 0.00 < 0.05)
                        thrpt:  [−5.9126% −5.4980% −5.0577%]
                        Performance has regressed.

Computepass: Empty Submit with 60000 Resources
                        time:   [18.833 µs 18.909 µs 18.990 µs]
                        change: [−1.0899% −0.4765% +0.1366%] (p = 0.12 > 0.05)
                        No change in performance detected.
Found 16 outliers among 100 measurements (16.00%)
  10 (10.00%) high mild
  6 (6.00%) high severe

front/shader: naga module bincode decode
                        time:   [736.76 µs 739.69 µs 742.75 µs]
                        thrpt:  [288.02 MiB/s 289.21 MiB/s 290.36 MiB/s]
                 change:
                        time:   [−3.8057% −2.8831% −1.9193%] (p = 0.00 < 0.05)
                        thrpt:  [+1.9568% +2.9687% +3.9563%]
                        Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
  4 (4.00%) high mild
  2 (2.00%) high severe
front/shader: wgsl-in   time:   [9.1691 ms 9.2391 ms 9.3403 ms]
                        thrpt:  [22.904 MiB/s 23.155 MiB/s 23.331 MiB/s]
                 change:
                        time:   [+2.9524% +3.8163% +5.0991%] (p = 0.00 < 0.05)
                        thrpt:  [−4.8517% −3.6760% −2.8678%]
                        Performance has regressed.
Found 21 outliers among 100 measurements (21.00%)
  21 (21.00%) high severe
front/shader: spv-in    time:   [245.15 µs 246.22 µs 247.32 µs]
                        thrpt:  [30.417 MiB/s 30.553 MiB/s 30.685 MiB/s]
                 change:
                        time:   [−1.9363% −0.9960% −0.0468%] (p = 0.03 < 0.05)
                        thrpt:  [+0.0468% +1.0060% +1.9745%]
                        Change within noise threshold.
Found 18 outliers among 100 measurements (18.00%)
  11 (11.00%) high mild
  7 (7.00%) high severe
Benchmarking front/shader: glsl-in: Warming up for 5.0000 s
Warning: Unable to complete 100 samples in 20.0s. You may wish to increase target time to 21.3s, enable flat sampling, or reduce sample count to 60.
front/shader: glsl-in   time:   [4.1889 ms 4.2060 ms 4.2238 ms]
                        thrpt:  [13.133 MiB/s 13.189 MiB/s 13.243 MiB/s]
                 change:
                        time:   [−2.2174% −1.2927% −0.4221%] (p = 0.00 < 0.05)
                        thrpt:  [+0.4239% +1.3096% +2.2677%]
                        Change within noise threshold.
Found 5 outliers among 100 measurements (5.00%)
  2 (2.00%) high mild
  3 (3.00%) high severe

validate/shader: validation
                        time:   [834.09 µs 837.47 µs 841.04 µs]
                        thrpt:  [254.36 MiB/s 255.45 MiB/s 256.48 MiB/s]
                 change:
                        time:   [−1.6834% −0.9918% −0.2726%] (p = 0.01 < 0.05)
                        thrpt:  [+0.2733% +1.0018% +1.7122%]
                        Change within noise threshold.
Found 13 outliers among 100 measurements (13.00%)
  8 (8.00%) high mild
  5 (5.00%) high severe

compact/shader: compact time:   [345.98 µs 347.25 µs 348.58 µs]
                        thrpt:  [430.00 MiB/s 431.65 MiB/s 433.23 MiB/s]
                 change:
                        time:   [+2.8596% +3.6390% +4.5155%] (p = 0.00 < 0.05)
                        thrpt:  [−4.3204% −3.5112% −2.7801%]
                        Performance has regressed.
Found 12 outliers among 100 measurements (12.00%)
  6 (6.00%) high mild
  6 (6.00%) high severe

back/shader: wgsl-out   time:   [507.79 µs 510.12 µs 512.55 µs]
                        thrpt:  [292.44 MiB/s 293.83 MiB/s 295.18 MiB/s]
                 change:
                        time:   [+0.4815% +1.5469% +2.6249%] (p = 0.00 < 0.05)
                        thrpt:  [−2.5577% −1.5233% −0.4792%]
                        Change within noise threshold.
Found 8 outliers among 100 measurements (8.00%)
  4 (4.00%) high mild
  4 (4.00%) high severe
back/shader: spv-out    time:   [961.00 µs 964.38 µs 967.87 µs]
                        thrpt:  [154.86 MiB/s 155.43 MiB/s 155.97 MiB/s]
                 change:
                        time:   [+2.5181% +3.3043% +4.0728%] (p = 0.00 < 0.05)
                        thrpt:  [−3.9134% −3.1986% −2.4562%]
                        Performance has regressed.
Found 8 outliers among 100 measurements (8.00%)
  5 (5.00%) high mild
  3 (3.00%) high severe
back/shader: spv-out multiple entrypoints
                        time:   [1.2330 ms 1.2370 ms 1.2410 ms]
                        thrpt:  [120.78 MiB/s 121.17 MiB/s 121.56 MiB/s]
                 change:
                        time:   [−1.0289% +0.3374% +1.4823%] (p = 0.62 > 0.05)
                        thrpt:  [−1.4606% −0.3363% +1.0396%]
                        No change in performance detected.
Found 5 outliers among 100 measurements (5.00%)
  1 (1.00%) high mild
  4 (4.00%) high severe
back/shader: msl-out    time:   [1.0643 ms 1.0693 ms 1.0744 ms]
                        thrpt:  [139.52 MiB/s 140.18 MiB/s 140.83 MiB/s]
                 change:
                        time:   [−3.3911% −2.7333% −1.9837%] (p = 0.00 < 0.05)
                        thrpt:  [+2.0238% +2.8101% +3.5102%]
                        Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
  2 (2.00%) high mild
  3 (3.00%) high severe
back/shader: hlsl-out   time:   [719.54 µs 722.93 µs 726.47 µs]
                        thrpt:  [206.32 MiB/s 207.34 MiB/s 208.31 MiB/s]
                 change:
                        time:   [−1.7680% −1.0135% −0.2399%] (p = 0.01 < 0.05)
                        thrpt:  [+0.2405% +1.0239% +1.7999%]
                        Change within noise threshold.
Found 11 outliers among 100 measurements (11.00%)
  7 (7.00%) high mild
  4 (4.00%) high severe
back/shader: glsl-out multiple entrypoints
                        time:   [723.62 µs 726.92 µs 730.41 µs]
                        thrpt:  [205.21 MiB/s 206.20 MiB/s 207.14 MiB/s]
                 change:
                        time:   [−5.4999% −4.7723% −3.9147%] (p = 0.00 < 0.05)
                        thrpt:  [+4.0742% +5.0115% +5.8200%]
                        Performance has improved.
Found 10 outliers among 100 measurements (10.00%)
  5 (5.00%) high mild
  5 (5.00%) high severe

@cwfitzgerald
Copy link
Member

From the numbers, I would call both fixed (adjusted OP)

Copy link
Member

@cwfitzgerald cwfitzgerald left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@cwfitzgerald cwfitzgerald merged commit 71820ee into gfx-rs:trunk Nov 14, 2025
41 checks passed
@andyleiserson andyleiserson deleted the binding-perf branch November 14, 2025 17:48
andyleiserson added a commit to andyleiserson/wgpu that referenced this pull request Nov 24, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Large performance regression in many tests between v27 Massive performance regresssion when binding large bind groups

2 participants