[GDN] Eliminate GPU->CPU sync in prepare_chunk_indices during prefill by arpera · Pull Request #38361 · vllm-project/vllm

arpera · 2026-03-27T13:58:33Z

Purpose

Eliminate a GPU→CPU sync caused by .tolist() inside prepare_chunk_indices during GDN prefill.

prepare_chunk_indices (in vllm/model_executor/layers/fla/ops/index.py) calls .tolist() on a GPU tensor, which triggers a blocking GPU→CPU synchronization. The function is decorated with tensor_cache (identity-based LRU cache), so only the first call per step actually syncs — but that one sync is enough to stall the pipeline.

Fix: Pre-compute chunk_indices and chunk_offsets on CPU in GDNAttentionMetadataBuilder.build() (where cu_seqlens_cpu is already available, so no sync needed), async-copy them to GPU, and thread them as explicit optional parameters through the entire FLA ops chain. When provided, downstream ops skip both the computation and the tensor_cache lookup. The @tensor_cache fallback remains unchanged for callers that do not pass pre-computed values (KDA, OLMo Hybrid).

Also extracts hardcoded chunk_size=64 into FLA_CHUNK_SIZE constant.

Changes:

vllm/model_executor/layers/fla/ops/utils.py: add FLA_CHUNK_SIZE constant
vllm/v1/attention/backends/gdn_attn.py: pre-compute chunk_indices/chunk_offsets on CPU, store in GDNAttentionMetadata
vllm/model_executor/layers/mamba/gdn_linear_attn.py: pass chunk_indices/chunk_offsets from metadata to FLA ops
vllm/model_executor/layers/fla/ops/chunk.py, cumsum.py, chunk_o.py, chunk_delta_h.py, chunk_scaled_dot_kkt.py, solve_tril.py, wy_fast.py: accept optional chunk_indices (and chunk_offsets for chunk_delta_h), skip tensor_cache when provided
vllm/model_executor/layers/fla/ops/kda.py: use FLA_CHUNK_SIZE constant

Test Plan

Ran vllm serve nvidia/Qwen3.5-397B-A17B-NVFP4 with GDN attention and verified no GPU→CPU sync from prepare_chunk_indices in Nsight Systems profile.

Test Result

Before: Nsight Systems shows a GPU→CPU sync (.tolist()) on the first prepare_chunk_indices call per step during GDN prefill.
After: No GPU→CPU sync, no tensor_cache overhead on the GDN prefill path.

Perf & Eval Testing

Server (same for all tests):

vllm serve nvidia/Qwen3.5-397B-A17B-NVFP4 \
    --port 8000 -tp 1 -pp 1 -dp 8 \
    --enable-expert-parallel --language-model-only \
    --reasoning-parser qwen3 --stream-interval 100

Prefill Benchmark

vllm bench serve --backend vllm --model nvidia/Qwen3.5-397B-A17B-NVFP4 \
    --port 8000 --endpoint /v1/completions --dataset-name random \
    --random-input 32768 --random-output 1 --max-concurrency 128 \
    --num-prompt 128 --ignore-eos --temperature 0.0

	Total Throughput (tok/s)	Mean TTFT (ms)	Median TTFT (ms)	P99 TTFT (ms)
Before (main)	154,344	14,020	13,999	26,151
After (this PR)	153,575	14,274	14,103	26,204

GSM8K Eval

python3 tests/evals/gsm8k/gsm8k_eval.py

Config	Accuracy	Invalid Rate	Tokens/sec
Before (main)	0.879	0.024	1,474
After (this PR)	0.886	0.017	1,547

No perf or accuracy regression — results are within run-to-run noise.

Nsight Systems Profile (prefill, input 8192 tokens)

After this change, the first GDN layer in a step no longer stands out from subsequent GDN layers in CPU time — all GDN blocks are now uniform. Additionally, there are zero DtoH (GPU→CPU) memory copies during execution.

Before: DtoH memcpy is present (<0.1%), the first GDN block stalls on the GPU→CPU sync from .tolist(), forward ≈ 410 ms.

After: No DtoH memcpy at all, all GDN blocks have uniform CPU time, forward ≈ 385 ms.

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

claude

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

gemini-code-assist

Code Review

This pull request introduces a register method to the FLA operations cache utility, enabling manual insertion of entries. This functionality is used in the GDN attention backend to pre-compute chunk indices on the CPU and cache them against GPU keys, preventing performance-degrading GPU-to-CPU synchronization. Feedback includes a recommendation to refactor duplicated cache management logic into a helper function and to replace a hardcoded chunk size with a named constant.

vllm/model_executor/layers/fla/ops/utils.py

vllm/v1/attention/backends/gdn_attn.py

mergify · 2026-03-27T14:11:41Z

Hi @arpera, the pre-commit checks have failed. Please run:

uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?

mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

mergify · 2026-03-27T14:34:01Z

Hi @arpera, the pre-commit checks have failed. Please run:

uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?

mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

vllm/v1/attention/backends/gdn_attn.py

mergify · 2026-03-27T21:31:01Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @arpera.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

arpera · 2026-03-28T08:31:04Z

cc: @vadiklyutiy

vadiklyutiy · 2026-03-28T10:04:13Z

/gemini review

gemini-code-assist

Code Review

This pull request centralizes the chunk size configuration by introducing a global FLA_CHUNK_SIZE constant across various FLA Triton kernels. It also enhances the tensor_cache utility with a new register method, which allows for manual insertion of entries into the cache. This functionality is utilized in the GDN attention backend to pre-compute chunk indices on the CPU and register them using GPU tensor keys, thereby avoiding performance-degrading GPU-to-CPU synchronizations. I have no feedback to provide as there were no review comments.

vllm/v1/attention/backends/gdn_attn.py

vadiklyutiy · 2026-03-28T10:12:25Z

@claude review

vllm/v1/attention/backends/gdn_attn.py

vadiklyutiy · 2026-03-29T22:36:21Z

I investigated a little bit.

chunk_indices = prepare_chunk_indices(cu_seqlens, chunk_size) is always the same for fixed step: cu_seqlens and chunk_size the same within step.

So, actually we even don't need @tensor_cache - may just calculate once prepare_chunk_indices in MetaDataBuilder and use it.
Cons: implementation of above approach add new arg chunk_size to around 10 funcs.

But for prefill_size=1000, GDN CPU part takes 2x+ longer then GPU part, and default cudagraph_max_capture_size is 512, so we don't hide CPU in this case.

So, for this case make sense to speed up CPU part because it is a bottleneck. Searching in tensor_cache takes around 4% from GDN CPU times (purple on screenshot above).

What do you think?

arpera · 2026-03-30T04:16:07Z

I agree that storing chunk_indices in metadata would be a more clear solution in this case. Initially I tried that approach but the patch wasn't neat enough. If you think that storing chunk_indices in metadata would be a better approach, then I'll implement it.

vadiklyutiy · 2026-03-30T08:28:18Z

wasn't neat enough

what exact things looked not so good?

mergify · 2026-03-30T08:29:38Z

Hi @arpera, the pre-commit checks have failed. Please run:

uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?

mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

arpera · 2026-03-30T08:30:14Z

Initially the patch itself was huge for such a small issue

vadiklyutiy · 2026-03-30T08:53:32Z

@claude review

vllm/v1/attention/backends/gdn_attn.py

arpera · 2026-03-30T10:12:21Z

@claude review

ZJY0516 · 2026-03-30T13:40:05Z

@claude review

ZJY0516 · 2026-03-30T13:40:54Z

@arpera Thanks for your contribution, do we have some perf and eval tests?

vllm/model_executor/layers/fla/ops/chunk_o.py

vllm/model_executor/layers/fla/ops/kda.py

vllm/model_executor/layers/mamba/gdn_linear_attn.py

arpera · 2026-03-31T13:47:58Z

@arpera Thanks for your contribution, do we have some perf and eval tests?

Yes, I uploaded perf and eval tests results for Qwen3.5 in PR description, have a look.

vadiklyutiy · 2026-03-31T14:47:33Z

vllm/v1/attention/backends/gdn_attn.py

+                and current_platform.is_cuda()
+                and current_platform.is_device_capability(90)
+            )
+            if not use_flashinfer:


Do we really need this check? I don't think that there is something wrong with unused compute of chunk_indices and chunk_offsets.

Hm, it was recommended by Claude...

So, I don't agree with Claude. Looks like overoptimization that made code more complicated.

I am not against of getting rid of this check, but claude complains about this in case we use a different backend. Should I then remove this check?

Looks like overoptimization that made code more complicated.

Ok, then I'll remove this check.

vadiklyutiy · 2026-03-31T14:50:31Z

Except above LGTM

vadiklyutiy

LGTM

prepare_chunk_indices calls .tolist() which triggers a GPU->CPU sync when cu_seqlens is on GPU. This is called from ~7 FLA ops during chunk_kda_fwd, and while tensor_cache prevents repeated syncs within a step, the first call still blocks. Fix: add a register() method to tensor_cache that allows pre-populating the cache for a given key. In GDNAttentionMetadataBuilder.build(), we compute chunk_indices on CPU (where cu_seqlens_cpu is already available) and register the GPU result under the GPU cu_seqlens key. All downstream FLA ops then hit the cache without any sync. Signed-off-by: Artem Perevedentsev <aperevedents@nvidia.com>

…ODO for chunk_size constant Signed-off-by: Artem Perevedentsev <aperevedents@nvidia.com>

Signed-off-by: Artem Perevedentsev <aperevedents@nvidia.com>

Addresses reviewer feedback to avoid magic numbers across FLA triton kernels. The constant lives in fla/ops/utils.py and is imported by kda.py, chunk.py, chunk_o.py, and gdn_attn.py. Signed-off-by: Artem Perevedentsev <aperevedents@nvidia.com>

Replace `if non_spec_query_start_loc_cpu is not None:` guard with `if num_prefills > 0:` so that the CPU computation, async HtoD copy, and tensor_cache slot are not wasted on decode-only steps where FLA chunk ops are never called. Mixed prefill+decode batches are unaffected: num_prefills > 0 whenever any non-spec prefill sequence is present, which is exactly when chunk_gated_delta_rule / chunk_kda run and need the cached indices. Signed-off-by: Artem Perevedentsev <aperevedents@nvidia.com>

Eliminate GPU->CPU sync and tensor_cache overhead on the GDN prefill path by pre-computing chunk_indices and chunk_offsets on CPU in GDNAttentionMetadataBuilder.build() and threading them as explicit parameters through the FLA ops chain. Previously, prepare_chunk_indices called .tolist() on a GPU tensor, triggering a blocking GPU->CPU sync on the first call per step, and ~7 tensor_cache lookups consumed ~4% of GDN CPU time. Now the builder calls prepare_chunk_indices/prepare_chunk_offsets with the already- available CPU cu_seqlens tensor and async-copies the results to GPU. All downstream FLA ops accept optional chunk_indices (and chunk_offsets for chunk_delta_h) parameters; when provided they skip both the computation and the cache lookup. The @tensor_cache fallback remains for callers that do not pass pre-computed values (KDA, OLMo Hybrid). Also extracts hardcoded chunk_size=64 into FLA_CHUNK_SIZE constant. Signed-off-by: Artem Perevedentsev <aperevedents@nvidia.com>

Remove the tensor_cache register() method and _insert helper that are no longer needed since chunk_indices is now passed directly through the FLA call chain. Remove the duplicate 'if num_prefills > 0' block that redundantly recomputed chunk_indices and called register(). Restore tensor_cache to match upstream/main. Signed-off-by: Artem Perevedentsev <aperevedents@nvidia.com>

…ication - Guard chunk_indices/chunk_offsets pre-computation with backend check so FlashInfer path skips unnecessary CPU work and HtoD copies - Add warning_once in forward_cuda for unexpected non-None chunk params - Fix chunk_kda_scaled_dot_kkt_fwd default to FLA_CHUNK_SIZE and pass chunk_size explicitly from chunk_kda_fwd - Simplify BT computation in chunk_o.py (sync with PR vllm-project#38343) - Remove unused FLA_GDN_FIX_BT env var - Fix mypy union-attr error for additional_config.get() Signed-off-by: Artem Perevedentsev <aperevedents@nvidia.com>

Always pre-compute chunk_indices/chunk_offsets on CPU when num_prefills > 0, regardless of which backend (FlashInfer or Triton) will be used. The guard was overoptimization that added complexity without meaningful benefit since there is nothing wrong with unused compute of these tensors. Also removes the related warning_once in forward_cuda that would fire when FlashInfer received pre-computed chunk params. Signed-off-by: Artem Perevedentsev <aperevedents@nvidia.com>

mergify · 2026-03-31T19:22:00Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @arpera.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: Vadim Gimpelson <156319763+vadiklyutiy@users.noreply.github.com>

vadiklyutiy · 2026-04-01T19:20:07Z

Qwen3.5 evals passed https://buildkite.com/vllm/ci/builds/59111/steps/canvas?sid=019d458b-d7b7-4904-95ee-57ff1b7753df&tab=output

arpera · 2026-04-01T19:24:37Z

I have two CI tests failures but not sure if there is any known issues related to these tests:

arpera requested review from LucasWilkinson and MatthewBonanni as code owners March 27, 2026 13:58

claude bot reviewed Mar 27, 2026

View reviewed changes

mergify bot added the v1 label Mar 27, 2026

gemini-code-assist bot reviewed Mar 27, 2026

View reviewed changes

vllm/model_executor/layers/fla/ops/utils.py Outdated Show resolved Hide resolved

vllm/v1/attention/backends/gdn_attn.py Outdated Show resolved Hide resolved

arpera force-pushed the artem/remove-extra-d2h-copy branch from a6fe4f3 to dbe3abd Compare March 27, 2026 15:05

ZJY0516 reviewed Mar 27, 2026

View reviewed changes

vllm/v1/attention/backends/gdn_attn.py Outdated Show resolved Hide resolved

mergify bot added the needs-rebase label Mar 27, 2026

arpera force-pushed the artem/remove-extra-d2h-copy branch from ad69400 to 3e8de13 Compare March 28, 2026 08:08

mergify bot removed the needs-rebase label Mar 28, 2026

gemini-code-assist bot reviewed Mar 28, 2026

View reviewed changes

vadiklyutiy reviewed Mar 28, 2026

View reviewed changes

vllm/v1/attention/backends/gdn_attn.py Outdated Show resolved Hide resolved

claude bot reviewed Mar 28, 2026

View reviewed changes

vllm/v1/attention/backends/gdn_attn.py Outdated Show resolved Hide resolved

arpera requested a review from tdoublep as a code owner March 30, 2026 08:25

claude bot reviewed Mar 30, 2026

View reviewed changes

vllm/v1/attention/backends/gdn_attn.py Outdated Show resolved Hide resolved

claude bot reviewed Mar 30, 2026

View reviewed changes

vllm/model_executor/layers/fla/ops/chunk_o.py Outdated Show resolved Hide resolved

vllm/model_executor/layers/fla/ops/kda.py Show resolved Hide resolved

vllm/model_executor/layers/mamba/gdn_linear_attn.py Show resolved Hide resolved

vadiklyutiy reviewed Mar 31, 2026

View reviewed changes

vadiklyutiy added the ready ONLY add when PR is ready to merge/full CI is needed label Mar 31, 2026

vadiklyutiy approved these changes Mar 31, 2026

View reviewed changes

arpera added 9 commits March 31, 2026 17:56

Fix gemini-code issues: extract _insert helper in tensor_cache, add T…

8ef70dc

…ODO for chunk_size constant Signed-off-by: Artem Perevedentsev <aperevedents@nvidia.com>

Fix mypy: add type: ignore for dynamic register attribute

b21bb56

Signed-off-by: Artem Perevedentsev <aperevedents@nvidia.com>

arpera force-pushed the artem/remove-extra-d2h-copy branch from 7ecc2c1 to 2760814 Compare March 31, 2026 17:57

mgoin added the nvidia label Mar 31, 2026

github-project-automation bot added this to NVIDIA Mar 31, 2026

github-project-automation bot moved this to Ready in NVIDIA Mar 31, 2026

mergify bot added the needs-rebase label Mar 31, 2026

Merge branch 'main' into artem/remove-extra-d2h-copy

dae956b

Signed-off-by: Vadim Gimpelson <156319763+vadiklyutiy@users.noreply.github.com>

mergify bot removed the needs-rebase label Mar 31, 2026

arpera added 2 commits April 1, 2026 10:25

Merge branch 'main' into artem/remove-extra-d2h-copy

6193bb9

Merge branch 'main' into artem/remove-extra-d2h-copy

5c9b75f

Uh oh!

Conversation

arpera commented Mar 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Perf & Eval Testing

Prefill Benchmark

GSM8K Eval

Nsight Systems Profile (prefill, input 8192 tokens)

Uh oh!

claude bot left a comment

Choose a reason for hiding this comment

Claude Code Review

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

mergify bot commented Mar 27, 2026

Uh oh!

mergify bot commented Mar 27, 2026

Uh oh!

Uh oh!

mergify bot commented Mar 27, 2026

Uh oh!

arpera commented Mar 28, 2026

Uh oh!

vadiklyutiy commented Mar 28, 2026

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

vadiklyutiy commented Mar 28, 2026

Uh oh!

Uh oh!

vadiklyutiy commented Mar 29, 2026

Uh oh!

arpera commented Mar 30, 2026

Uh oh!

vadiklyutiy commented Mar 30, 2026

Uh oh!

mergify bot commented Mar 30, 2026

Uh oh!

arpera commented Mar 30, 2026

Uh oh!

vadiklyutiy commented Mar 30, 2026

Uh oh!

Uh oh!

arpera commented Mar 30, 2026

Uh oh!

ZJY0516 commented Mar 30, 2026

Uh oh!

ZJY0516 commented Mar 30, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

arpera commented Mar 31, 2026

Uh oh!

vadiklyutiy Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

vadiklyutiy Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

arpera Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

arpera Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

vadiklyutiy commented Mar 31, 2026

Uh oh!

vadiklyutiy left a comment

arpera commented Mar 27, 2026 •

edited

Loading