TRTLLM gen-full attn spec decode & FP8 KV dequant tests#35222
TRTLLM gen-full attn spec decode & FP8 KV dequant tests#35222ojhaanshika wants to merge 1 commit intovllm-project:mainfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request adds two new integration tests for TRTLLM gen-full attention, covering speculative decoding and FP8 KV cache dequantization. The tests are comprehensive and correctly use a reference implementation for verification. My main feedback is on improving the maintainability of the new test file by refactoring duplicated code.
6b98b9e to
ad9f44d
Compare
|
This pull request has merge conflicts that must be resolved before it can be |
ad9f44d to
f6f2176
Compare
f6f2176 to
1651168
Compare
1651168 to
0cef6ea
Compare
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request adds two valuable integration tests for TRTLLM gen-full attention, covering speculative decoding and FP8 KV cache dequantization, which previously had no test coverage. The tests are well-structured and increase coverage as intended. My review identifies a significant opportunity for improvement by refactoring duplicated code between the two new test functions to enhance maintainability.
Signed-off-by: Anshika Ojha <anshikao@nvidia.com>
41b23f9 to
6714622
Compare
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces two new integration tests for the TRTLLM attention backend in FlashInferImpl. The first test covers speculative decoding with multi-token decode requests, exercising a previously untested code path. The second test covers the scenario of an FP8 KV cache with bfloat16 queries, which triggers the FP8 dequantization path for prefill operations. The changes also include refactoring existing test code to improve modularity and reuse by extracting common setup logic into helper functions. The new tests are well-structured and correctly set up the specific conditions to target the new code paths, comparing the results against a reference SDPA implementation to ensure correctness. My review did not find any critical or high-severity issues.
Purpose
Add two new integration tests for TRTLLM gen-full attention covering previously untested code paths in
FlashInferImpl.forward():Speculative decode (multi-token decode): Exercises the
q_len_per_req > 1path (line 1583 offlashinfer.py) by injecting aSpeculativeConfigthat setsreorder_batch_threshold=4, causingsplit_decodes_and_prefillsto classify multi-token requests as decode. This path had zero test coverage at any level.FP8 KV cache with bf16 queries (prefill dequant): Exercises the
trtllm_prefill_attn_kvfp8_dequantfallback path (lines 1464-1478 offlashinfer.py) where FP8 KV pages are dequantized to bf16 before the TRTLLM prefill kernel. The kernel-level tests explicitly skip mixed Q/KV dtypes for prefill, so this path also had zero coverage.Combined TRTLLM-relevant coverage increased
flashinfer.pyutils/flashinfer.pyTest Plan
pytest tests/v1/attention/test_trtllm_attention_integration.py -v
Requires Blackwell (SM100) GPU.
Test Result
tests/v1/attention/test_trtllm_attention_integration.py::test_trtllm_gen_full_attention_integration[decode_only] PASSED [ 20%]
tests/v1/attention/test_trtllm_attention_integration.py::test_trtllm_gen_full_attention_integration[prefill_only] PASSED [ 40%]
tests/v1/attention/test_trtllm_attention_integration.py::test_trtllm_gen_full_attention_integration[mixed] PASSED [ 60%]
tests/v1/attention/test_trtllm_attention_integration.py::test_trtllm_spec_decode_integration PASSED [ 80%]
tests/v1/attention/test_trtllm_attention_integration.py::test_trtllm_fp8_kv_dequant_integration PASSED [100%]
5 passed, 20 warnings in 23.32s