TRTLLM gen-full attn spec decode & FP8 KV dequant tests by ojhaanshika · Pull Request #35222 · vllm-project/vllm

ojhaanshika · 2026-02-24T19:38:59Z

Purpose

Add two new integration tests for TRTLLM gen-full attention covering previously untested code paths in FlashInferImpl.forward():

Speculative decode (multi-token decode): Exercises the q_len_per_req > 1 path (line 1583 of flashinfer.py) by injecting a SpeculativeConfig that sets reorder_batch_threshold=4, causing split_decodes_and_prefills to classify multi-token requests as decode. This path had zero test coverage at any level.
FP8 KV cache with bf16 queries (prefill dequant): Exercises the trtllm_prefill_attn_kvfp8_dequant fallback path (lines 1464-1478 of flashinfer.py) where FP8 KV pages are dequantized to bf16 before the TRTLLM prefill kernel. The kernel-level tests explicitly skip mixed Q/KV dtypes for prefill, so this path also had zero coverage.

Combined TRTLLM-relevant coverage increased

File	Before	After	Change
`flashinfer.py`	356/641 (56%)	375/641 (59%)	+19 lines
`utils/flashinfer.py`	110/252 (44%)	137/252 (54%)	+27 lines

Note: This PR depends on #<34986> which adds the base integration test file (test_trtllm_attention_integration.py). Both PRs modify the same file — this one appends two new test functions.

Test Plan

pytest tests/v1/attention/test_trtllm_attention_integration.py -v

Requires Blackwell (SM100) GPU.

Test Result

tests/v1/attention/test_trtllm_attention_integration.py::test_trtllm_gen_full_attention_integration[decode_only] PASSED [ 20%]
tests/v1/attention/test_trtllm_attention_integration.py::test_trtllm_gen_full_attention_integration[prefill_only] PASSED [ 40%]
tests/v1/attention/test_trtllm_attention_integration.py::test_trtllm_gen_full_attention_integration[mixed] PASSED [ 60%]
tests/v1/attention/test_trtllm_attention_integration.py::test_trtllm_spec_decode_integration PASSED [ 80%]
tests/v1/attention/test_trtllm_attention_integration.py::test_trtllm_fp8_kv_dequant_integration PASSED [100%]
5 passed, 20 warnings in 23.32s

gemini-code-assist

Code Review

This pull request adds two new integration tests for TRTLLM gen-full attention, covering speculative decoding and FP8 KV cache dequantization. The tests are comprehensive and correctly use a reference implementation for verification. My main feedback is on improving the maintainability of the new test file by refactoring duplicated code.

mergify · 2026-03-03T16:41:31Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @ojhaanshika.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

jasonlizhengjian · 2026-03-10T17:58:50Z

/gemini review

gemini-code-assist

Code Review

This pull request adds two valuable integration tests for TRTLLM gen-full attention, covering speculative decoding and FP8 KV cache dequantization, which previously had no test coverage. The tests are well-structured and increase coverage as intended. My review identifies a significant opportunity for improvement by refactoring duplicated code between the two new test functions to enhance maintainability.

Signed-off-by: Anshika Ojha <anshikao@nvidia.com>

ojhaanshika · 2026-03-11T16:11:55Z

/gemini review

gemini-code-assist

Code Review

This pull request introduces two new integration tests for the TRTLLM attention backend in FlashInferImpl. The first test covers speculative decoding with multi-token decode requests, exercising a previously untested code path. The second test covers the scenario of an FP8 KV cache with bfloat16 queries, which triggers the FP8 dequantization path for prefill operations. The changes also include refactoring existing test code to improve modularity and reuse by extracting common setup logic into helper functions. The new tests are well-structured and correctly set up the specific conditions to target the new code paths, comparing the results against a reference SDPA implementation to ensure correctness. My review did not find any critical or high-severity issues.

mergify Bot added nvidia v1 labels Feb 24, 2026

github-project-automation Bot added this to NVIDIA Feb 24, 2026

gemini-code-assist Bot reviewed Feb 24, 2026

View reviewed changes

Comment thread tests/v1/attention/test_trtllm_attention_integration.py

ojhaanshika force-pushed the trtllm-integration-tests-extended branch from 6b98b9e to ad9f44d Compare February 24, 2026 21:20

mergify Bot added the needs-rebase label Mar 3, 2026

ojhaanshika force-pushed the trtllm-integration-tests-extended branch from ad9f44d to f6f2176 Compare March 3, 2026 19:14

mergify Bot removed the needs-rebase label Mar 3, 2026

ojhaanshika force-pushed the trtllm-integration-tests-extended branch from f6f2176 to 1651168 Compare March 3, 2026 19:22

ojhaanshika marked this pull request as ready for review March 3, 2026 19:29

jasonlizhengjian mentioned this pull request Mar 6, 2026

[Tracking issue]: NVIDIA CI improvements #36264

Open

ojhaanshika force-pushed the trtllm-integration-tests-extended branch from 1651168 to 0cef6ea Compare March 9, 2026 19:16

mergify Bot added the ci/build label Mar 9, 2026

jasonlizhengjian reviewed Mar 10, 2026

View reviewed changes

Comment thread .github/workflows/macos-smoke-test.yml

gemini-code-assist Bot reviewed Mar 10, 2026

View reviewed changes

Comment thread tests/v1/attention/test_trtllm_attention_integration.py

Add spec decode & FP8 KV dequant TRTLLM tests with refactored helpers

6714622

Signed-off-by: Anshika Ojha <anshikao@nvidia.com>

ojhaanshika force-pushed the trtllm-integration-tests-extended branch from 41b23f9 to 6714622 Compare March 10, 2026 21:38

gemini-code-assist Bot reviewed Mar 11, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

TRTLLM gen-full attn spec decode & FP8 KV dequant tests#35222

TRTLLM gen-full attn spec decode & FP8 KV dequant tests#35222
ojhaanshika wants to merge 1 commit intovllm-project:mainfrom
ojhaanshika:trtllm-integration-tests-extended

ojhaanshika commented Feb 24, 2026 •

edited by github-actions Bot

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

mergify Bot commented Mar 3, 2026

Uh oh!

jasonlizhengjian commented Mar 10, 2026

Uh oh!

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

ojhaanshika commented Mar 11, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

ojhaanshika commented Feb 24, 2026 • edited by github-actions Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

mergify Bot commented Mar 3, 2026

Uh oh!

jasonlizhengjian commented Mar 10, 2026

Uh oh!

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

ojhaanshika commented Mar 11, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ojhaanshika commented Feb 24, 2026 •

edited by github-actions Bot

Loading