Skip to content

[GDN] Eliminate GPU->CPU sync in prepare_chunk_indices during prefill#38361

Open
arpera wants to merge 12 commits intovllm-project:mainfrom
arpera:artem/remove-extra-d2h-copy
Open

[GDN] Eliminate GPU->CPU sync in prepare_chunk_indices during prefill#38361
arpera wants to merge 12 commits intovllm-project:mainfrom
arpera:artem/remove-extra-d2h-copy

Conversation

@arpera
Copy link
Copy Markdown
Contributor

@arpera arpera commented Mar 27, 2026

Purpose

Eliminate a GPU→CPU sync caused by .tolist() inside prepare_chunk_indices during GDN prefill.

prepare_chunk_indices (in vllm/model_executor/layers/fla/ops/index.py) calls .tolist() on a GPU tensor, which triggers a blocking GPU→CPU synchronization. The function is decorated with tensor_cache (identity-based LRU cache), so only the first call per step actually syncs — but that one sync is enough to stall the pipeline.

Fix: Pre-compute chunk_indices and chunk_offsets on CPU in GDNAttentionMetadataBuilder.build() (where cu_seqlens_cpu is already available, so no sync needed), async-copy them to GPU, and thread them as explicit optional parameters through the entire FLA ops chain. When provided, downstream ops skip both the computation and the tensor_cache lookup. The @tensor_cache fallback remains unchanged for callers that do not pass pre-computed values (KDA, OLMo Hybrid).

Also extracts hardcoded chunk_size=64 into FLA_CHUNK_SIZE constant.

Changes:

  • vllm/model_executor/layers/fla/ops/utils.py: add FLA_CHUNK_SIZE constant
  • vllm/v1/attention/backends/gdn_attn.py: pre-compute chunk_indices/chunk_offsets on CPU, store in GDNAttentionMetadata
  • vllm/model_executor/layers/mamba/gdn_linear_attn.py: pass chunk_indices/chunk_offsets from metadata to FLA ops
  • vllm/model_executor/layers/fla/ops/chunk.py, cumsum.py, chunk_o.py, chunk_delta_h.py, chunk_scaled_dot_kkt.py, solve_tril.py, wy_fast.py: accept optional chunk_indices (and chunk_offsets for chunk_delta_h), skip tensor_cache when provided
  • vllm/model_executor/layers/fla/ops/kda.py: use FLA_CHUNK_SIZE constant

Test Plan

  • Ran vllm serve nvidia/Qwen3.5-397B-A17B-NVFP4 with GDN attention and verified no GPU→CPU sync from prepare_chunk_indices in Nsight Systems profile.

Test Result

Before: Nsight Systems shows a GPU→CPU sync (.tolist()) on the first prepare_chunk_indices call per step during GDN prefill.
After: No GPU→CPU sync, no tensor_cache overhead on the GDN prefill path.

Perf & Eval Testing

Server (same for all tests):

vllm serve nvidia/Qwen3.5-397B-A17B-NVFP4 \
    --port 8000 -tp 1 -pp 1 -dp 8 \
    --enable-expert-parallel --language-model-only \
    --reasoning-parser qwen3 --stream-interval 100

Prefill Benchmark

vllm bench serve --backend vllm --model nvidia/Qwen3.5-397B-A17B-NVFP4 \
    --port 8000 --endpoint /v1/completions --dataset-name random \
    --random-input 32768 --random-output 1 --max-concurrency 128 \
    --num-prompt 128 --ignore-eos --temperature 0.0
Total Throughput (tok/s) Mean TTFT (ms) Median TTFT (ms) P99 TTFT (ms)
Before (main) 154,344 14,020 13,999 26,151
After (this PR) 153,575 14,274 14,103 26,204

GSM8K Eval

python3 tests/evals/gsm8k/gsm8k_eval.py
Config Accuracy Invalid Rate Tokens/sec
Before (main) 0.879 0.024 1,474
After (this PR) 0.886 0.017 1,547

No perf or accuracy regression — results are within run-to-run noise.

Nsight Systems Profile (prefill, input 8192 tokens)

After this change, the first GDN layer in a step no longer stands out from subsequent GDN layers in CPU time — all GDN blocks are now uniform. Additionally, there are zero DtoH (GPU→CPU) memory copies during execution.

Before: DtoH memcpy is present (<0.1%), the first GDN block stalls on the GPU→CPU sync from .tolist(), forward ≈ 410 ms.

image

After: No DtoH memcpy at all, all GDN blocks have uniform CPU time, forward ≈ 385 ms.

image
Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Copy link
Copy Markdown

@claude claude bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

@mergify mergify bot added the v1 label Mar 27, 2026
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a register method to the FLA operations cache utility, enabling manual insertion of entries. This functionality is used in the GDN attention backend to pre-compute chunk indices on the CPU and cache them against GPU keys, preventing performance-degrading GPU-to-CPU synchronization. Feedback includes a recommendation to refactor duplicated cache management logic into a helper function and to replace a hardcoded chunk size with a named constant.

@mergify
Copy link
Copy Markdown

mergify bot commented Mar 27, 2026

Hi @arpera, the pre-commit checks have failed. Please run:

uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?
mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

1 similar comment
@mergify
Copy link
Copy Markdown

mergify bot commented Mar 27, 2026

Hi @arpera, the pre-commit checks have failed. Please run:

uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?
mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

@arpera arpera force-pushed the artem/remove-extra-d2h-copy branch from a6fe4f3 to dbe3abd Compare March 27, 2026 15:05
@mergify
Copy link
Copy Markdown

mergify bot commented Mar 27, 2026

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @arpera.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Mar 27, 2026
@arpera arpera force-pushed the artem/remove-extra-d2h-copy branch from ad69400 to 3e8de13 Compare March 28, 2026 08:08
@mergify mergify bot removed the needs-rebase label Mar 28, 2026
@arpera
Copy link
Copy Markdown
Contributor Author

arpera commented Mar 28, 2026

cc: @vadiklyutiy

@vadiklyutiy
Copy link
Copy Markdown
Collaborator

/gemini review

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request centralizes the chunk size configuration by introducing a global FLA_CHUNK_SIZE constant across various FLA Triton kernels. It also enhances the tensor_cache utility with a new register method, which allows for manual insertion of entries into the cache. This functionality is utilized in the GDN attention backend to pre-compute chunk indices on the CPU and register them using GPU tensor keys, thereby avoiding performance-degrading GPU-to-CPU synchronizations. I have no feedback to provide as there were no review comments.

@vadiklyutiy
Copy link
Copy Markdown
Collaborator

@claude review

@vadiklyutiy
Copy link
Copy Markdown
Collaborator

I investigated a little bit.

chunk_indices = prepare_chunk_indices(cu_seqlens, chunk_size) is always the same for fixed step: cu_seqlens and chunk_size the same within step.

So, actually we even don't need @tensor_cache - may just calculate once prepare_chunk_indices in MetaDataBuilder and use it.
Cons: implementation of above approach add new arg chunk_size to around 10 funcs.

But for prefill_size=1000, GDN CPU part takes 2x+ longer then GPU part, and default cudagraph_max_capture_size is 512, so we don't hide CPU in this case.

image

So, for this case make sense to speed up CPU part because it is a bottleneck. Searching in tensor_cache takes around 4% from GDN CPU times (purple on screenshot above).

What do you think?

@arpera
Copy link
Copy Markdown
Contributor Author

arpera commented Mar 30, 2026

I agree that storing chunk_indices in metadata would be a more clear solution in this case. Initially I tried that approach but the patch wasn't neat enough. If you think that storing chunk_indices in metadata would be a better approach, then I'll implement it.

@arpera arpera requested a review from tdoublep as a code owner March 30, 2026 08:25
@vadiklyutiy
Copy link
Copy Markdown
Collaborator

wasn't neat enough

what exact things looked not so good?

@mergify
Copy link
Copy Markdown

mergify bot commented Mar 30, 2026

Hi @arpera, the pre-commit checks have failed. Please run:

uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?
mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

@arpera
Copy link
Copy Markdown
Contributor Author

arpera commented Mar 30, 2026

Initially the patch itself was huge for such a small issue

@vadiklyutiy
Copy link
Copy Markdown
Collaborator

@claude review

@arpera
Copy link
Copy Markdown
Contributor Author

arpera commented Mar 30, 2026

@claude review

1 similar comment
@ZJY0516
Copy link
Copy Markdown
Member

ZJY0516 commented Mar 30, 2026

@claude review

@ZJY0516
Copy link
Copy Markdown
Member

ZJY0516 commented Mar 30, 2026

@arpera Thanks for your contribution, do we have some perf and eval tests?

@arpera
Copy link
Copy Markdown
Contributor Author

arpera commented Mar 31, 2026

@arpera Thanks for your contribution, do we have some perf and eval tests?

Yes, I uploaded perf and eval tests results for Qwen3.5 in PR description, have a look.

and current_platform.is_cuda()
and current_platform.is_device_capability(90)
)
if not use_flashinfer:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really need this check? I don't think that there is something wrong with unused compute of chunk_indices and chunk_offsets.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, it was recommended by Claude...

So, I don't agree with Claude. Looks like overoptimization that made code more complicated.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not against of getting rid of this check, but claude complains about this in case we use a different backend. Should I then remove this check?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like overoptimization that made code more complicated.

Ok, then I'll remove this check.

@vadiklyutiy
Copy link
Copy Markdown
Collaborator

Except above LGTM

@vadiklyutiy vadiklyutiy added the ready ONLY add when PR is ready to merge/full CI is needed label Mar 31, 2026
Copy link
Copy Markdown
Collaborator

@vadiklyutiy vadiklyutiy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

arpera added 9 commits March 31, 2026 17:56
prepare_chunk_indices calls .tolist() which triggers a GPU->CPU sync
when cu_seqlens is on GPU. This is called from ~7 FLA ops during
chunk_kda_fwd, and while tensor_cache prevents repeated syncs within
a step, the first call still blocks.

Fix: add a register() method to tensor_cache that allows pre-populating
the cache for a given key. In GDNAttentionMetadataBuilder.build(), we
compute chunk_indices on CPU (where cu_seqlens_cpu is already available)
and register the GPU result under the GPU cu_seqlens key. All downstream
FLA ops then hit the cache without any sync.

Signed-off-by: Artem Perevedentsev <aperevedents@nvidia.com>
…ODO for chunk_size constant

Signed-off-by: Artem Perevedentsev <aperevedents@nvidia.com>
Signed-off-by: Artem Perevedentsev <aperevedents@nvidia.com>
Addresses reviewer feedback to avoid magic numbers across FLA triton
kernels. The constant lives in fla/ops/utils.py and is imported by
kda.py, chunk.py, chunk_o.py, and gdn_attn.py.

Signed-off-by: Artem Perevedentsev <aperevedents@nvidia.com>
Replace `if non_spec_query_start_loc_cpu is not None:` guard with
`if num_prefills > 0:` so that the CPU computation, async HtoD copy,
and tensor_cache slot are not wasted on decode-only steps where FLA
chunk ops are never called.

Mixed prefill+decode batches are unaffected: num_prefills > 0 whenever
any non-spec prefill sequence is present, which is exactly when
chunk_gated_delta_rule / chunk_kda run and need the cached indices.

Signed-off-by: Artem Perevedentsev <aperevedents@nvidia.com>
Eliminate GPU->CPU sync and tensor_cache overhead on the GDN prefill
path by pre-computing chunk_indices and chunk_offsets on CPU in
GDNAttentionMetadataBuilder.build() and threading them as explicit
parameters through the FLA ops chain.

Previously, prepare_chunk_indices called .tolist() on a GPU tensor,
triggering a blocking GPU->CPU sync on the first call per step, and
~7 tensor_cache lookups consumed ~4% of GDN CPU time.  Now the builder
calls prepare_chunk_indices/prepare_chunk_offsets with the already-
available CPU cu_seqlens tensor and async-copies the results to GPU.
All downstream FLA ops accept optional chunk_indices (and chunk_offsets
for chunk_delta_h) parameters; when provided they skip both the
computation and the cache lookup.  The @tensor_cache fallback remains
for callers that do not pass pre-computed values (KDA, OLMo Hybrid).

Also extracts hardcoded chunk_size=64 into FLA_CHUNK_SIZE constant.

Signed-off-by: Artem Perevedentsev <aperevedents@nvidia.com>
Remove the tensor_cache register() method and _insert helper that are
no longer needed since chunk_indices is now passed directly through
the FLA call chain.  Remove the duplicate 'if num_prefills > 0' block
that redundantly recomputed chunk_indices and called register().
Restore tensor_cache to match upstream/main.

Signed-off-by: Artem Perevedentsev <aperevedents@nvidia.com>
…ication

- Guard chunk_indices/chunk_offsets pre-computation with backend check
  so FlashInfer path skips unnecessary CPU work and HtoD copies
- Add warning_once in forward_cuda for unexpected non-None chunk params
- Fix chunk_kda_scaled_dot_kkt_fwd default to FLA_CHUNK_SIZE and pass
  chunk_size explicitly from chunk_kda_fwd
- Simplify BT computation in chunk_o.py (sync with PR vllm-project#38343)
- Remove unused FLA_GDN_FIX_BT env var
- Fix mypy union-attr error for additional_config.get()

Signed-off-by: Artem Perevedentsev <aperevedents@nvidia.com>
Always pre-compute chunk_indices/chunk_offsets on CPU when num_prefills > 0,
regardless of which backend (FlashInfer or Triton) will be used.  The guard
was overoptimization that added complexity without meaningful benefit since
there is nothing wrong with unused compute of these tensors.

Also removes the related warning_once in forward_cuda that would fire
when FlashInfer received pre-computed chunk params.

Signed-off-by: Artem Perevedentsev <aperevedents@nvidia.com>
@arpera arpera force-pushed the artem/remove-extra-d2h-copy branch from 7ecc2c1 to 2760814 Compare March 31, 2026 17:57
@mgoin mgoin added the nvidia label Mar 31, 2026
@github-project-automation github-project-automation bot moved this to Ready in NVIDIA Mar 31, 2026
@mergify
Copy link
Copy Markdown

mergify bot commented Mar 31, 2026

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @arpera.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Mar 31, 2026
Signed-off-by: Vadim Gimpelson <156319763+vadiklyutiy@users.noreply.github.com>
@mergify mergify bot removed the needs-rebase label Mar 31, 2026
@vadiklyutiy
Copy link
Copy Markdown
Collaborator

@arpera
Copy link
Copy Markdown
Contributor Author

arpera commented Apr 1, 2026

I have two CI tests failures but not sure if there is any known issues related to these tests:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

nvidia ready ONLY add when PR is ready to merge/full CI is needed v1

Projects

Status: Ready

Development

Successfully merging this pull request may close these issues.

4 participants