[GDN] Eliminate GPU->CPU sync in prepare_chunk_indices during prefill#38361
[GDN] Eliminate GPU->CPU sync in prepare_chunk_indices during prefill#38361arpera wants to merge 12 commits intovllm-project:mainfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces a register method to the FLA operations cache utility, enabling manual insertion of entries. This functionality is used in the GDN attention backend to pre-compute chunk indices on the CPU and cache them against GPU keys, preventing performance-degrading GPU-to-CPU synchronization. Feedback includes a recommendation to refactor duplicated cache management logic into a helper function and to replace a hardcoded chunk size with a named constant.
|
Hi @arpera, the pre-commit checks have failed. Please run: uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
1 similar comment
|
Hi @arpera, the pre-commit checks have failed. Please run: uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
a6fe4f3 to
dbe3abd
Compare
|
This pull request has merge conflicts that must be resolved before it can be |
ad69400 to
3e8de13
Compare
|
cc: @vadiklyutiy |
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request centralizes the chunk size configuration by introducing a global FLA_CHUNK_SIZE constant across various FLA Triton kernels. It also enhances the tensor_cache utility with a new register method, which allows for manual insertion of entries into the cache. This functionality is utilized in the GDN attention backend to pre-compute chunk indices on the CPU and register them using GPU tensor keys, thereby avoiding performance-degrading GPU-to-CPU synchronizations. I have no feedback to provide as there were no review comments.
|
@claude review |
|
I agree that storing |
what exact things looked not so good? |
|
Hi @arpera, the pre-commit checks have failed. Please run: uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
|
Initially the patch itself was huge for such a small issue |
|
@claude review |
|
@claude review |
1 similar comment
|
@claude review |
|
@arpera Thanks for your contribution, do we have some perf and eval tests? |
Yes, I uploaded perf and eval tests results for Qwen3.5 in PR description, have a look. |
| and current_platform.is_cuda() | ||
| and current_platform.is_device_capability(90) | ||
| ) | ||
| if not use_flashinfer: |
There was a problem hiding this comment.
Do we really need this check? I don't think that there is something wrong with unused compute of chunk_indices and chunk_offsets.
There was a problem hiding this comment.
Hm, it was recommended by Claude...
So, I don't agree with Claude. Looks like overoptimization that made code more complicated.
There was a problem hiding this comment.
I am not against of getting rid of this check, but claude complains about this in case we use a different backend. Should I then remove this check?
There was a problem hiding this comment.
Looks like overoptimization that made code more complicated.
Ok, then I'll remove this check.
|
Except above LGTM |
prepare_chunk_indices calls .tolist() which triggers a GPU->CPU sync when cu_seqlens is on GPU. This is called from ~7 FLA ops during chunk_kda_fwd, and while tensor_cache prevents repeated syncs within a step, the first call still blocks. Fix: add a register() method to tensor_cache that allows pre-populating the cache for a given key. In GDNAttentionMetadataBuilder.build(), we compute chunk_indices on CPU (where cu_seqlens_cpu is already available) and register the GPU result under the GPU cu_seqlens key. All downstream FLA ops then hit the cache without any sync. Signed-off-by: Artem Perevedentsev <aperevedents@nvidia.com>
…ODO for chunk_size constant Signed-off-by: Artem Perevedentsev <aperevedents@nvidia.com>
Signed-off-by: Artem Perevedentsev <aperevedents@nvidia.com>
Addresses reviewer feedback to avoid magic numbers across FLA triton kernels. The constant lives in fla/ops/utils.py and is imported by kda.py, chunk.py, chunk_o.py, and gdn_attn.py. Signed-off-by: Artem Perevedentsev <aperevedents@nvidia.com>
Replace `if non_spec_query_start_loc_cpu is not None:` guard with `if num_prefills > 0:` so that the CPU computation, async HtoD copy, and tensor_cache slot are not wasted on decode-only steps where FLA chunk ops are never called. Mixed prefill+decode batches are unaffected: num_prefills > 0 whenever any non-spec prefill sequence is present, which is exactly when chunk_gated_delta_rule / chunk_kda run and need the cached indices. Signed-off-by: Artem Perevedentsev <aperevedents@nvidia.com>
Eliminate GPU->CPU sync and tensor_cache overhead on the GDN prefill path by pre-computing chunk_indices and chunk_offsets on CPU in GDNAttentionMetadataBuilder.build() and threading them as explicit parameters through the FLA ops chain. Previously, prepare_chunk_indices called .tolist() on a GPU tensor, triggering a blocking GPU->CPU sync on the first call per step, and ~7 tensor_cache lookups consumed ~4% of GDN CPU time. Now the builder calls prepare_chunk_indices/prepare_chunk_offsets with the already- available CPU cu_seqlens tensor and async-copies the results to GPU. All downstream FLA ops accept optional chunk_indices (and chunk_offsets for chunk_delta_h) parameters; when provided they skip both the computation and the cache lookup. The @tensor_cache fallback remains for callers that do not pass pre-computed values (KDA, OLMo Hybrid). Also extracts hardcoded chunk_size=64 into FLA_CHUNK_SIZE constant. Signed-off-by: Artem Perevedentsev <aperevedents@nvidia.com>
Remove the tensor_cache register() method and _insert helper that are no longer needed since chunk_indices is now passed directly through the FLA call chain. Remove the duplicate 'if num_prefills > 0' block that redundantly recomputed chunk_indices and called register(). Restore tensor_cache to match upstream/main. Signed-off-by: Artem Perevedentsev <aperevedents@nvidia.com>
…ication - Guard chunk_indices/chunk_offsets pre-computation with backend check so FlashInfer path skips unnecessary CPU work and HtoD copies - Add warning_once in forward_cuda for unexpected non-None chunk params - Fix chunk_kda_scaled_dot_kkt_fwd default to FLA_CHUNK_SIZE and pass chunk_size explicitly from chunk_kda_fwd - Simplify BT computation in chunk_o.py (sync with PR vllm-project#38343) - Remove unused FLA_GDN_FIX_BT env var - Fix mypy union-attr error for additional_config.get() Signed-off-by: Artem Perevedentsev <aperevedents@nvidia.com>
Always pre-compute chunk_indices/chunk_offsets on CPU when num_prefills > 0, regardless of which backend (FlashInfer or Triton) will be used. The guard was overoptimization that added complexity without meaningful benefit since there is nothing wrong with unused compute of these tensors. Also removes the related warning_once in forward_cuda that would fire when FlashInfer received pre-computed chunk params. Signed-off-by: Artem Perevedentsev <aperevedents@nvidia.com>
7ecc2c1 to
2760814
Compare
|
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: Vadim Gimpelson <156319763+vadiklyutiy@users.noreply.github.com>
|
I have two CI tests failures but not sure if there is any known issues related to these tests: |

Purpose
Eliminate a GPU→CPU sync caused by
.tolist()insideprepare_chunk_indicesduring GDN prefill.prepare_chunk_indices(invllm/model_executor/layers/fla/ops/index.py) calls.tolist()on a GPU tensor, which triggers a blocking GPU→CPU synchronization. The function is decorated withtensor_cache(identity-based LRU cache), so only the first call per step actually syncs — but that one sync is enough to stall the pipeline.Fix: Pre-compute
chunk_indicesandchunk_offsetson CPU inGDNAttentionMetadataBuilder.build()(wherecu_seqlens_cpuis already available, so no sync needed), async-copy them to GPU, and thread them as explicit optional parameters through the entire FLA ops chain. When provided, downstream ops skip both the computation and thetensor_cachelookup. The@tensor_cachefallback remains unchanged for callers that do not pass pre-computed values (KDA, OLMo Hybrid).Also extracts hardcoded
chunk_size=64intoFLA_CHUNK_SIZEconstant.Changes:
vllm/model_executor/layers/fla/ops/utils.py: addFLA_CHUNK_SIZEconstantvllm/v1/attention/backends/gdn_attn.py: pre-computechunk_indices/chunk_offsetson CPU, store inGDNAttentionMetadatavllm/model_executor/layers/mamba/gdn_linear_attn.py: passchunk_indices/chunk_offsetsfrom metadata to FLA opsvllm/model_executor/layers/fla/ops/chunk.py,cumsum.py,chunk_o.py,chunk_delta_h.py,chunk_scaled_dot_kkt.py,solve_tril.py,wy_fast.py: accept optionalchunk_indices(andchunk_offsetsforchunk_delta_h), skiptensor_cachewhen providedvllm/model_executor/layers/fla/ops/kda.py: useFLA_CHUNK_SIZEconstantTest Plan
vllm serve nvidia/Qwen3.5-397B-A17B-NVFP4with GDN attention and verified no GPU→CPU sync fromprepare_chunk_indicesin Nsight Systems profile.Test Result
Before: Nsight Systems shows a GPU→CPU sync (
.tolist()) on the firstprepare_chunk_indicescall per step during GDN prefill.After: No GPU→CPU sync, no
tensor_cacheoverhead on the GDN prefill path.Perf & Eval Testing
Server (same for all tests):
vllm serve nvidia/Qwen3.5-397B-A17B-NVFP4 \ --port 8000 -tp 1 -pp 1 -dp 8 \ --enable-expert-parallel --language-model-only \ --reasoning-parser qwen3 --stream-interval 100Prefill Benchmark
vllm bench serve --backend vllm --model nvidia/Qwen3.5-397B-A17B-NVFP4 \ --port 8000 --endpoint /v1/completions --dataset-name random \ --random-input 32768 --random-output 1 --max-concurrency 128 \ --num-prompt 128 --ignore-eos --temperature 0.0GSM8K Eval
No perf or accuracy regression — results are within run-to-run noise.
Nsight Systems Profile (prefill, input 8192 tokens)
After this change, the first GDN layer in a step no longer stands out from subsequent GDN layers in CPU time — all GDN blocks are now uniform. Additionally, there are zero DtoH (GPU→CPU) memory copies during execution.
Before: DtoH memcpy is present (<0.1%), the first GDN block stalls on the GPU→CPU sync from
.tolist(), forward ≈ 410 ms.After: No DtoH memcpy at all, all GDN blocks have uniform CPU time, forward ≈ 385 ms.
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.