perf(helios): replace strided RoPE with stack+flatten for contiguous memory#2474
Conversation
|
can you help us check whether other models can benefit from this change? |
|
@willamhou thanks for doing this! are you able to run performance benchmarks of some kind on this as well? I had looked into it a little yesterday, but didn't see as dramatic change in helios (using this example at least). It would also be nice to get a feel for how this affects different model architectures, and Helios is a bit of a unique one with some of the pyramid / stage stuff |
|
This PR's content was already merged to main via commit dbc0700. Closing as duplicate. |
|
Reopening — the commit exists only on this branch, not on main. Previous close was a mistake. |
|
@alex-jw-brooks @hsliuustc0106 Thanks for looking into this! Reference data from PR #2393 (same optimization, applied to Wan2.2):
Helios uses the exact same Why the improvement may be less visible in some setups: To better isolate the RoPE difference, we could profile just the import torch
from torch.profiler import profile, ProfilerActivity
with profile(activities=[ProfilerActivity.CUDA], record_shapes=True) as prof:
# run a few forward passes
...
print(prof.key_averages().table(sort_by="cuda_time_total", row_limit=20))@alex-jw-brooks Could you share your benchmark config (model variant, resolution, steps, GPU type)? That would help narrow down whether the difference is masked by other ops or genuinely minimal on your hardware. |
|
Gentle ping @hsliuustc0106 @alex-jw-brooks — any further thoughts on the benchmark discussion above? Happy to provide more details if needed. |
|
@hsliuustc0106 @alex-jw-brooks Friendly follow-up — this has been open for ~12 days. If the benchmark concern is a blocker, I'm happy to add a micro-benchmark script that profiles just the RoPE function in isolation. Otherwise, if the scope is acceptable as-is (code + equivalence test), a review would be appreciated. Thanks! |
lishunyang12
left a comment
There was a problem hiding this comment.
Review: perf(helios) — stack+flatten RoPE
Verdict: Approve
Correctness
The mathematical equivalence is sound. The original code writes interleaved results via strided slice assignment:
out[..., 0::2] = even_vals
out[..., 1::2] = odd_valsThe replacement torch.stack((even_vals, odd_vals), dim=-1).flatten(-2, -1) produces the identical interleaving: for each position i in the head_dim/2 axis, the stack creates [even_i, odd_i], and flatten merges them into [even_0, odd_0, even_1, odd_1, ...] — which matches the 0::2 / 1::2 layout exactly.
The cos/sin sub-indexing (cos[..., 0::2], sin[..., 1::2]) is preserved unchanged from the original, so no risk of swapped frequencies.
Performance
Good change. The original pattern (torch.empty_like + two strided writes) produces non-contiguous write patterns that are unfriendly to GPU/NPU memory subsystems. The stack+flatten approach builds the result in a single contiguous allocation. The PR description cites a 25% speedup on NPU from the same pattern in PR #2393 for Wan2.2, which is plausible.
Tests
The 11 unit tests are well-structured:
- Bit-exact (
atol=0, rtol=0) across float32/float16/bfloat16 — good, this is a purely algebraic rearrangement so zero tolerance is correct. - Shape coverage includes a video-scale case (8192 tokens) which exercises realistic workloads.
- Contiguity assertion directly validates the optimization goal.
- Odd
head_dimguard test is a nice edge-case check.
Minor observations (non-blocking)
-
Test helper duplicates production code:
_apply_rotary_emb_helios_optimizedin the test file is a copy of the production function. If the production function later diverges (e.g., someone adds in-place ops), the test would still pass against the stale copy. Consider importing the production function directly instead, or add a comment noting this is intentional for isolation. -
type_asplacement:.type_as(hidden_states)is now called on theflattenresult, which is fine. Just noting that ifhidden_statesis already the same dtype (the common case), this is a no-op — no concern here.
LGTM. Clean, well-tested optimization.
…memory (vllm-project#2436) Replace strided slice assignment (`out[..., 0::2] = ...`) with `torch.stack(..., dim=-1).flatten(-2, -1)` in Helios RoPE, matching the optimization applied to Wan2.2 in PR vllm-project#2393 (25% speedup on NPU). The stack+flatten pattern produces contiguous memory layout, avoiding non-sequential write patterns that hurt GPU/NPU cache performance. Math is identical — verified bit-exact across float32/float16/bfloat16. This is Phase 1 of vllm-project#2436 (Helios only). Phase 2 (unified utility) to follow. Signed-off-by: willamhou <willamhou@ceresman.com>
dbc0700 to
ccd0c5d
Compare
|
Rebased onto latest main. @hsliuustc0106 @alex-jw-brooks ready for review. |
|
@willamhou Please check the L3 CI failure https://buildkite.com/vllm/vllm-omni/builds/7241/steps/canvas and resolve it. Thanks |
|
@Gaohan123 Thanks for flagging! The failing test is This looks like a pre-existing flaky test or an issue introduced by another commit merged around the same time. The Helios-specific tests all pass. |
|
Correction — there are actually two failing tests, not one:
I've checked all five failing test files: none import helios code, share conftest fixtures with helios, or have any indirect dependency path to This is a merge CI run on main (full test matrix). The failures are unrelated flaky tests or resource contention. |
…memory (vllm-project#2474) Signed-off-by: willamhou <willamhou@ceresman.com> Co-authored-by: willamhou <willamhou@ceresman.com>
…memory (vllm-project#2474) Signed-off-by: willamhou <willamhou@ceresman.com> Co-authored-by: willamhou <willamhou@ceresman.com>
Summary
Phase 1 of #2436 — applies the stack+flatten RoPE optimization to
helios_transformer, matching what PR #2393 did for Wan2.2 (25% speedup on NPU).Before (strided slice assignment — non-sequential writes):
After (stack+flatten — contiguous memory):
Math is identical. Output is bit-exact across float32/float16/bfloat16.
Change: 10 insertions, 4 deletions in
helios_transformer.py.Test plan