Skip to content

Use `two_stage_warp_reduce` also in softmax kernel, move smem out of it

1ebe58d
Select commit
Loading
Failed to load commit list.
Open

UPSTREAM PR #18785: CUDA: Factor out and re-use two_stage_warp_reduce function #897

Use `two_stage_warp_reduce` also in softmax kernel, move smem out of it
1ebe58d
Select commit
Loading
Failed to load commit list.
LOCI Review / Performance Review #897 succeeded Jan 12, 2026

Performance unchanged

0 binaries improved · 0 binaries unchanged · 0 binaries stable ~ within threshold · 0 binaries degraded ~ beyond threshold

Binary Δ % Response Δ % Throughput Performance (based on response time)

Performance threshold: 30%
Default configuration used.
Note: Performance status is evaluated only from Δ% Response. Throughput is displayed for reference.

Explore the complete analysis inside the Version Insights.
Open the Pull Request linked to this check-run.