Skip to content

[Paged KV] Add paged attention deterministic smoke test#138

Merged
LxYuan0420 merged 7 commits intovllm-project:mainfrom
WindChimeRan:test_deterministic
Mar 6, 2026
Merged

[Paged KV] Add paged attention deterministic smoke test#138
LxYuan0420 merged 7 commits intovllm-project:mainfrom
WindChimeRan:test_deterministic

Conversation

@WindChimeRan
Copy link
Copy Markdown
Collaborator

@WindChimeRan WindChimeRan commented Mar 5, 2026

Summary

  • Add test_paged_deterministic.py: 5-prompt smoke test using vLLM offline inference (temp=0, greedy) against hardcoded golden token IDs from Qwen3-0.6B
  • Golden values generated on main from both MLX inline cache and HF paged KV cache paths
  • Add tools/gen_golden.py helper to regenerate golden values

Motivation

Prerequisite for the native Metal kernel PR (#136). After inlining the vendored Metal shaders, paged attention output must remain identical to the current HF kernel baseline. This test anchors that.

Test

  • python -m pytest tests/test_paged_deterministic.py -v -s (paged path by default)
  • Passes on main with HF kernel: 5/5

Relevant Issue & PR

upstream batch invariant feature

Batch invariant is hardware & kernel dependent. Supporting this feature is non-trivial on metal.

output example:
image

Signed-off-by: ran <hzz5361@psu.edu>
Signed-off-by: ran <hzz5361@psu.edu>
@WindChimeRan
Copy link
Copy Markdown
Collaborator Author

@LxYuan0420 request for review

Copy link
Copy Markdown
Collaborator

@LxYuan0420 LxYuan0420 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice direction and the tools/ helper is a good addition.

To recap the main issue I ran into locally: on macOS 15 (HF compat kernel 8968951), this test can flake under the current batched llm.generate(PROMPTS, ...) setup. Specifically, greedy output for One plus one equals sometimes diverges from both golden sets.

For CI stability, I think we should make the execution mode explicit. Would you be open to forcing single-seq execution (LLM(..., max_num_seqs=1)) for this test?

If your intent is to also cover batched behavior, maybe we can add a separate batched smoke test with a weaker assertion (since exact token IDs aren’t batch-invariant on Metal today)

Comment thread tests/test_paged_deterministic.py Outdated
Comment thread tests/test_paged_deterministic.py Outdated
Comment thread tests/test_paged_deterministic.py
Comment thread tools/gen_golden.py Outdated
@LxYuan0420
Copy link
Copy Markdown
Collaborator

LxYuan0420 commented Mar 6, 2026

On my machine, I need to set max_seq_nums=1 to pass the test

Screenshot 2026-03-06 at 11 17 08 AM

EDIT Typo: it is max_num_seqs=1 (vLLM LLM(..., max_num_seqs=1)), not max_seq_nums

Signed-off-by: ran <hzz5361@psu.edu>
Signed-off-by: ran <hzz5361@psu.edu>
Signed-off-by: ran <hzz5361@psu.edu>
@WindChimeRan WindChimeRan requested a review from LxYuan0420 March 6, 2026 04:26
@WindChimeRan
Copy link
Copy Markdown
Collaborator Author

On my machine, I need to set max_seq_nums=1 to pass the test

Screenshot 2026-03-06 at 11 17 08 AM EDIT Typo: it is max_num_seqs=1 `(vLLM LLM(..., max_num_seqs=1))`, not max_seq_nums

This is really intriguing... Maybe it's due to Metal 3 vs. Metal 4, or the 0.2 × RAM auto-calculated max sequence length, or differences in our chips.

Signed-off-by: Yuan Lik Xun <lxyuan0420@gmail.com>
Copy link
Copy Markdown
Collaborator

@LxYuan0420 LxYuan0420 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for iterating on this. Forcing single-seq execution (LLM(..., max_num_seqs=1)) resolves the macOS 15 flake I hit with batched scheduling.

I also pushed a small maintainer tweak on top to (1) fix the generator script name in comments/usage and (2) make the autouse env fixture respect any user-provided env overrides so the MLX run instructions match actual behavior.

LGTM.

Re batch-invariant determinism: I’m not expecting that in this PR (I only wanted a CI-stable single-seq golden smoke test). Let’s defer any batch-invariant feature/test until we support real continuous batching on Metal.

Signed-off-by: Yuan Lik Xun <lxyuan0420@gmail.com>
@LxYuan0420 LxYuan0420 merged commit 6ecf38f into vllm-project:main Mar 6, 2026
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants