[Bugfix][V1][TurboQuant] Warm up decode kernels by lesj0610 · Pull Request #42215 · vllm-project/vllm

lesj0610 · 2026-05-10T07:38:12Z

Problem

TurboQuant decode kernels (_tq_decode_stage1, _tq_decode_stage2) are not compiled during V1 startup warmup. The dummy/profile run does not always go through TQ decode path, so these kernels compile on first real decode request, after JIT monitor already started.

There is also workspace problem. Decode scratch buffers from WorkspaceManager must be allocated before CUDA graph capture calls lock_workspace(). If warmup skips this allocation, first decode request tries to grow locked workspace and crashes.

Approach

Added TurboQuant decode warmup step inside kernel_warmup(). It scans model attention layers, finds TurboQuant ones, and for each unique compile-key config runs _decode_attention() with synthetic tensors. This covers both kernel compilation and workspace pre-allocation in one path.

Layers sharing same Triton compile constants are deduplicated. No model forward pass, only backend-level decode call.

I searched open PRs for TurboQuant JIT / decode warmup, nothing found.

Test Plan

.venv/bin/python -m pytest tests/model_executor/test_turboquant_warmup.py -q
pre-commit run ruff-format ruff-check --files <changed files>
pre-commit run mypy-3.10 --files <changed files> --hook-stage manual
git diff --check

Test Result

pytest: 5 passed.

Linters and type check all passed.

Runtime: Qwen3-8B with --kv-cache-dtype turboquant_4bit_nc, first request HTTP 200. _tq_decode_stage1 / _tq_decode_stage2 JIT warnings gone, no workspace lock error.

Checklist

Purpose
Test plan and results
AI assistance disclosed

AI assistance: Codex, Claude, Gemini.

claude

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

lesj0610 · 2026-05-10T07:38:49Z

@ZJY0516 @qiching @tdoublep @vadiklyutiy Sorry to bother again, third one from me for the #40137 area.

This time it is TurboQuant decode. _tq_decode_stage1 and _tq_decode_stage2 compile during first real decode because startup warmup does not go through TQ decode path. There is also workspace buffer issue — if buffers are not pre-allocated before CUDA graph capture locks workspace, first request crashes.

I call _decode_attention() directly with synthetic inputs during warmup. This handles both kernel compile and workspace allocation. Verified on Qwen3-8B with turboquant_4bit_nc.

If you have concerns about the approach or scope please let me know.

gemini-code-assist

Code Review

This pull request introduces a warmup mechanism for TurboQuant decode kernels to ensure they are compiled before serving requests, thereby reducing latency on the first inference. It adds the turboquant_warmup.py module, integrates it into the kernel_warmup flow, and provides comprehensive unit tests. Feedback was provided regarding the calculation of block_table_stride, noting that the current approach might default to an incorrect value during the initial warmup phase and suggesting a more direct way to access the required constant from the model runner to avoid unnecessary re-compilation.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: fb6fd58b07

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Warm TurboQuant decode through the runtime decode helper so the decode Triton kernels and workspace buffers are initialized before serving requests. Co-authored-by: Codex <codex@openai.com> Co-authored-by: Claude <noreply@anthropic.com> Co-authored-by: Gemini <noreply@google.com> Signed-off-by: lesj0610 <lesj0610@users.noreply.github.com>

MidasMining · 2026-05-10T08:47:21Z

This PR's problem statement matches Issue #41565 exactly:

"Decode scratch buffers from WorkspaceManager must be allocated before CUDA graph capture calls lock_workspace(). If warmup skips this allocation, first decode request tries to grow locked workspace and crashes."

That's the workspace lock-violation we filed in #41565 (AssertionError: Workspace is locked but allocation from 'turboquant_attn.py:757:_continuation_prefill' requires 2.00 MB, current size is 0.26 MB). We've been pinned to the pre-#40941 fork waiting for either this PR or PR #40798 to land.

The two PRs attack the same root cause from different angles:

[TurboQuant] Share decode scratch workspace across layers #40798 (Bot1822): reserve max-size workspace before lock_workspace() is called
[Bugfix][V1][TurboQuant] Warm up decode kernels #42215 (this PR): warm up the decode kernels through the warmup path so workspace is naturally allocated before the lock

These are complementary, not competing. Both landing would be belt-and-suspenders against the same regression. Either one resolves #41565.

For the maintainer queue: I can validate this on 8× RTX A4000 (SM86) / Nemotron-3-Super-120B-AWQ-4bit / TurboQuant — different model class than your Qwen3-8B test (hybrid Mamba+MoE+Attention vs dense attention) and different arch generation. Worth a cross-platform second data point. Happy to run a sweep at 4K / 16K / 64K / 131K cached tokens once review opens.

claude Bot reviewed May 10, 2026

View reviewed changes

mergify Bot added the bug Something isn't working label May 10, 2026

gemini-code-assist Bot reviewed May 10, 2026

View reviewed changes

Comment thread vllm/model_executor/warmup/kernel_warmup.py Outdated

chatgpt-codex-connector Bot reviewed May 10, 2026

View reviewed changes

Comment thread vllm/model_executor/warmup/kernel_warmup.py Outdated

lesj0610 force-pushed the lesj/tq-decode-jit-warmup-20260510 branch from fb6fd58 to 42aafd5 Compare May 10, 2026 07:49

This was referenced May 10, 2026

[TurboQuant] Share decode scratch workspace across layers #40798

Open

[Bug]: TurboQuant _continuation_prefill workspace allocation fails at long context — v0.20.0 regression #41565

Open

gaby mentioned this pull request May 10, 2026

[Tracking issue]: TurboQuant/HIGGS Attention follow-ups #40069

Open

13 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bugfix][V1][TurboQuant] Warm up decode kernels#42215

[Bugfix][V1][TurboQuant] Warm up decode kernels#42215
lesj0610 wants to merge 1 commit intovllm-project:mainfrom
lesj0610:lesj/tq-decode-jit-warmup-20260510

lesj0610 commented May 10, 2026

Uh oh!

claude Bot left a comment

Uh oh!

lesj0610 commented May 10, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

Uh oh!

MidasMining commented May 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

lesj0610 commented May 10, 2026

Problem

Approach

Test Plan

Test Result

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

Claude Code Review

Uh oh!

lesj0610 commented May 10, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

MidasMining commented May 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants