Skip to content

[Bugfix][V1][TurboQuant] Warm up decode kernels#42215

Open
lesj0610 wants to merge 1 commit intovllm-project:mainfrom
lesj0610:lesj/tq-decode-jit-warmup-20260510
Open

[Bugfix][V1][TurboQuant] Warm up decode kernels#42215
lesj0610 wants to merge 1 commit intovllm-project:mainfrom
lesj0610:lesj/tq-decode-jit-warmup-20260510

Conversation

@lesj0610
Copy link
Copy Markdown
Contributor

Problem

TurboQuant decode kernels (_tq_decode_stage1, _tq_decode_stage2) are not compiled during V1 startup warmup. The dummy/profile run does not always go through TQ decode path, so these kernels compile on first real decode request, after JIT monitor already started.

There is also workspace problem. Decode scratch buffers from WorkspaceManager must be allocated before CUDA graph capture calls lock_workspace(). If warmup skips this allocation, first decode request tries to grow locked workspace and crashes.

Approach

Added TurboQuant decode warmup step inside kernel_warmup(). It scans model attention layers, finds TurboQuant ones, and for each unique compile-key config runs _decode_attention() with synthetic tensors. This covers both kernel compilation and workspace pre-allocation in one path.

Layers sharing same Triton compile constants are deduplicated. No model forward pass, only backend-level decode call.

I searched open PRs for TurboQuant JIT / decode warmup, nothing found.

Test Plan

.venv/bin/python -m pytest tests/model_executor/test_turboquant_warmup.py -q
pre-commit run ruff-format ruff-check --files <changed files>
pre-commit run mypy-3.10 --files <changed files> --hook-stage manual
git diff --check

Test Result

pytest: 5 passed.

Linters and type check all passed.

Runtime: Qwen3-8B with --kv-cache-dtype turboquant_4bit_nc, first request HTTP 200. _tq_decode_stage1 / _tq_decode_stage2 JIT warnings gone, no workspace lock error.

Checklist
  • Purpose
  • Test plan and results
  • AI assistance disclosed

AI assistance: Codex, Claude, Gemini.

Copy link
Copy Markdown

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

@lesj0610
Copy link
Copy Markdown
Contributor Author

@ZJY0516 @qiching @tdoublep @vadiklyutiy Sorry to bother again, third one from me for the #40137 area.

This time it is TurboQuant decode. _tq_decode_stage1 and _tq_decode_stage2 compile during first real decode because startup warmup does not go through TQ decode path. There is also workspace buffer issue — if buffers are not pre-allocated before CUDA graph capture locks workspace, first request crashes.

I call _decode_attention() directly with synthetic inputs during warmup. This handles both kernel compile and workspace allocation. Verified on Qwen3-8B with turboquant_4bit_nc.

If you have concerns about the approach or scope please let me know.

@mergify mergify Bot added the bug Something isn't working label May 10, 2026
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a warmup mechanism for TurboQuant decode kernels to ensure they are compiled before serving requests, thereby reducing latency on the first inference. It adds the turboquant_warmup.py module, integrates it into the kernel_warmup flow, and provides comprehensive unit tests. Feedback was provided regarding the calculation of block_table_stride, noting that the current approach might default to an incorrect value during the initial warmup phase and suggesting a more direct way to access the required constant from the model runner to avoid unnecessary re-compilation.

Comment thread vllm/model_executor/warmup/kernel_warmup.py Outdated
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: fb6fd58b07

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread vllm/model_executor/warmup/kernel_warmup.py Outdated
Warm TurboQuant decode through the runtime decode helper so the decode Triton kernels and workspace buffers are initialized before serving requests.

Co-authored-by: Codex <codex@openai.com>
Co-authored-by: Claude <noreply@anthropic.com>
Co-authored-by: Gemini <noreply@google.com>
Signed-off-by: lesj0610 <lesj0610@users.noreply.github.com>
@lesj0610 lesj0610 force-pushed the lesj/tq-decode-jit-warmup-20260510 branch from fb6fd58 to 42aafd5 Compare May 10, 2026 07:49
@MidasMining
Copy link
Copy Markdown

This PR's problem statement matches Issue #41565 exactly:

"Decode scratch buffers from WorkspaceManager must be allocated before CUDA graph capture calls lock_workspace(). If warmup skips this allocation, first decode request tries to grow locked workspace and crashes."

That's the workspace lock-violation we filed in #41565 (AssertionError: Workspace is locked but allocation from 'turboquant_attn.py:757:_continuation_prefill' requires 2.00 MB, current size is 0.26 MB). We've been pinned to the pre-#40941 fork waiting for either this PR or PR #40798 to land.

The two PRs attack the same root cause from different angles:

These are complementary, not competing. Both landing would be belt-and-suspenders against the same regression. Either one resolves #41565.

For the maintainer queue: I can validate this on 8× RTX A4000 (SM86) / Nemotron-3-Super-120B-AWQ-4bit / TurboQuant — different model class than your Qwen3-8B test (hybrid Mamba+MoE+Attention vs dense attention) and different arch generation. Worth a cross-platform second data point. Happy to run a sweep at 4K / 16K / 64K / 131K cached tokens once review opens.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants