Skip to content

TRTLLM gen-full attn spec decode & FP8 KV dequant tests#35222

Open
ojhaanshika wants to merge 1 commit intovllm-project:mainfrom
ojhaanshika:trtllm-integration-tests-extended
Open

TRTLLM gen-full attn spec decode & FP8 KV dequant tests#35222
ojhaanshika wants to merge 1 commit intovllm-project:mainfrom
ojhaanshika:trtllm-integration-tests-extended

Conversation

@ojhaanshika
Copy link
Copy Markdown
Contributor

@ojhaanshika ojhaanshika commented Feb 24, 2026

Purpose

Add two new integration tests for TRTLLM gen-full attention covering previously untested code paths in FlashInferImpl.forward():

  1. Speculative decode (multi-token decode): Exercises the q_len_per_req > 1 path (line 1583 of flashinfer.py) by injecting a SpeculativeConfig that sets reorder_batch_threshold=4, causing split_decodes_and_prefills to classify multi-token requests as decode. This path had zero test coverage at any level.

  2. FP8 KV cache with bf16 queries (prefill dequant): Exercises the trtllm_prefill_attn_kvfp8_dequant fallback path (lines 1464-1478 of flashinfer.py) where FP8 KV pages are dequantized to bf16 before the TRTLLM prefill kernel. The kernel-level tests explicitly skip mixed Q/KV dtypes for prefill, so this path also had zero coverage.

Combined TRTLLM-relevant coverage increased

File Before After Change
flashinfer.py 356/641 (56%) 375/641 (59%) +19 lines
utils/flashinfer.py 110/252 (44%) 137/252 (54%) +27 lines

Note: This PR depends on #<34986> which adds the base integration test file (test_trtllm_attention_integration.py). Both PRs modify the same file — this one appends two new test functions.

Test Plan

pytest tests/v1/attention/test_trtllm_attention_integration.py -v

Requires Blackwell (SM100) GPU.

Test Result

tests/v1/attention/test_trtllm_attention_integration.py::test_trtllm_gen_full_attention_integration[decode_only] PASSED [ 20%]
tests/v1/attention/test_trtllm_attention_integration.py::test_trtllm_gen_full_attention_integration[prefill_only] PASSED [ 40%]
tests/v1/attention/test_trtllm_attention_integration.py::test_trtllm_gen_full_attention_integration[mixed] PASSED [ 60%]
tests/v1/attention/test_trtllm_attention_integration.py::test_trtllm_spec_decode_integration PASSED [ 80%]
tests/v1/attention/test_trtllm_attention_integration.py::test_trtllm_fp8_kv_dequant_integration PASSED [100%]
5 passed, 20 warnings in 23.32s

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds two new integration tests for TRTLLM gen-full attention, covering speculative decoding and FP8 KV cache dequantization. The tests are comprehensive and correctly use a reference implementation for verification. My main feedback is on improving the maintainability of the new test file by refactoring duplicated code.

Comment thread tests/v1/attention/test_trtllm_attention_integration.py
@ojhaanshika ojhaanshika force-pushed the trtllm-integration-tests-extended branch from 6b98b9e to ad9f44d Compare February 24, 2026 21:20
@mergify
Copy link
Copy Markdown
Contributor

mergify Bot commented Mar 3, 2026

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @ojhaanshika.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify Bot added the needs-rebase label Mar 3, 2026
@ojhaanshika ojhaanshika force-pushed the trtllm-integration-tests-extended branch from ad9f44d to f6f2176 Compare March 3, 2026 19:14
@mergify mergify Bot removed the needs-rebase label Mar 3, 2026
@ojhaanshika ojhaanshika force-pushed the trtllm-integration-tests-extended branch from f6f2176 to 1651168 Compare March 3, 2026 19:22
@ojhaanshika ojhaanshika marked this pull request as ready for review March 3, 2026 19:29
@ojhaanshika ojhaanshika force-pushed the trtllm-integration-tests-extended branch from 1651168 to 0cef6ea Compare March 9, 2026 19:16
@mergify mergify Bot added the ci/build label Mar 9, 2026
@jasonlizhengjian
Copy link
Copy Markdown
Contributor

/gemini review

Comment thread .github/workflows/macos-smoke-test.yml
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds two valuable integration tests for TRTLLM gen-full attention, covering speculative decoding and FP8 KV cache dequantization, which previously had no test coverage. The tests are well-structured and increase coverage as intended. My review identifies a significant opportunity for improvement by refactoring duplicated code between the two new test functions to enhance maintainability.

Comment thread tests/v1/attention/test_trtllm_attention_integration.py
Signed-off-by: Anshika Ojha <anshikao@nvidia.com>
@ojhaanshika ojhaanshika force-pushed the trtllm-integration-tests-extended branch from 41b23f9 to 6714622 Compare March 10, 2026 21:38
@ojhaanshika
Copy link
Copy Markdown
Contributor Author

/gemini review

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces two new integration tests for the TRTLLM attention backend in FlashInferImpl. The first test covers speculative decoding with multi-token decode requests, exercising a previously untested code path. The second test covers the scenario of an FP8 KV cache with bfloat16 queries, which triggers the FP8 dequantization path for prefill operations. The changes also include refactoring existing test code to improve modularity and reuse by extracting common setup logic into helper functions. The new tests are well-structured and correctly set up the specific conditions to target the new code paths, comparing the results against a reference SDPA implementation to ensure correctness. My review did not find any critical or high-severity issues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

2 participants