[Bugfix][V1][TurboQuant] Warm up decode kernels#42215
[Bugfix][V1][TurboQuant] Warm up decode kernels#42215lesj0610 wants to merge 1 commit intovllm-project:mainfrom
Conversation
|
@ZJY0516 @qiching @tdoublep @vadiklyutiy Sorry to bother again, third one from me for the #40137 area. This time it is TurboQuant decode. I call If you have concerns about the approach or scope please let me know. |
There was a problem hiding this comment.
Code Review
This pull request introduces a warmup mechanism for TurboQuant decode kernels to ensure they are compiled before serving requests, thereby reducing latency on the first inference. It adds the turboquant_warmup.py module, integrates it into the kernel_warmup flow, and provides comprehensive unit tests. Feedback was provided regarding the calculation of block_table_stride, noting that the current approach might default to an incorrect value during the initial warmup phase and suggesting a more direct way to access the required constant from the model runner to avoid unnecessary re-compilation.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: fb6fd58b07
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
Warm TurboQuant decode through the runtime decode helper so the decode Triton kernels and workspace buffers are initialized before serving requests. Co-authored-by: Codex <codex@openai.com> Co-authored-by: Claude <noreply@anthropic.com> Co-authored-by: Gemini <noreply@google.com> Signed-off-by: lesj0610 <lesj0610@users.noreply.github.com>
fb6fd58 to
42aafd5
Compare
|
This PR's problem statement matches Issue #41565 exactly:
That's the workspace lock-violation we filed in #41565 ( The two PRs attack the same root cause from different angles:
These are complementary, not competing. Both landing would be belt-and-suspenders against the same regression. Either one resolves #41565. For the maintainer queue: I can validate this on 8× RTX A4000 (SM86) / Nemotron-3-Super-120B-AWQ-4bit / TurboQuant — different model class than your Qwen3-8B test (hybrid Mamba+MoE+Attention vs dense attention) and different arch generation. Worth a cross-platform second data point. Happy to run a sweep at 4K / 16K / 64K / 131K cached tokens once review opens. |
Problem
TurboQuant decode kernels (
_tq_decode_stage1,_tq_decode_stage2) are not compiled during V1 startup warmup. The dummy/profile run does not always go through TQ decode path, so these kernels compile on first real decode request, after JIT monitor already started.There is also workspace problem. Decode scratch buffers from WorkspaceManager must be allocated before CUDA graph capture calls
lock_workspace(). If warmup skips this allocation, first decode request tries to grow locked workspace and crashes.Approach
Added TurboQuant decode warmup step inside
kernel_warmup(). It scans model attention layers, finds TurboQuant ones, and for each unique compile-key config runs_decode_attention()with synthetic tensors. This covers both kernel compilation and workspace pre-allocation in one path.Layers sharing same Triton compile constants are deduplicated. No model forward pass, only backend-level decode call.
I searched open PRs for TurboQuant JIT / decode warmup, nothing found.
Test Plan
Test Result
pytest: 5 passed.
Linters and type check all passed.
Runtime: Qwen3-8B with
--kv-cache-dtype turboquant_4bit_nc, first request HTTP 200._tq_decode_stage1/_tq_decode_stage2JIT warnings gone, no workspace lock error.Checklist
AI assistance: Codex, Claude, Gemini.