Update deterministic test golden tokens for MLX 0.31 by WindChimeRan · Pull Request #202 · vllm-project/vllm-metal

WindChimeRan · 2026-03-23T00:25:25Z

Summary

Regenerate golden token IDs for test_paged_deterministic.py using tools/gen_golden_token_ids_for_deterministics.py on mlx 0.31.1 / mlx-lm 0.31.1
Paged attention path now diverges on 3/6 prompts (was 0/6 on mlx 0.22)
Both golden sets updated; existing either/or assertion logic unchanged

Context

After upgrading to mlx 0.31.1 from mlx-lm==0.29.1, the paged attention path produces different tokens than the MLX inline cache path on 3 out of 6 golden prompts (Qwen3-0.6B, greedy decoding, max_num_seqs=1). Divergence starts at token 2-6 where top-2 logits are close.

Both outputs are valid English, not gibberish. This is floating-point non-determinism between the Metal paged attention kernel and MLX's built-in SDPA, not a correctness bug.

Prompts that diverge (paged vs mlx):

"The capital of France is" — token 5
"One plus one equals" — token 2
"Water boils at a temperature of" — token 6

Test plan

python -m pytest tests/test_paged_deterministic.py -v -s — 6/6 pass

Help

Need to confirm this is expected not just on my own machine.

Signed-off-by: ran <hzz5361@psu.edu>

LxYuan0420 · 2026-03-23T03:49:30Z

Reproduced on my end on mlx 0.31.1 / mlx-lm 0.31.1.

MLX path: 6/6 pass, all match GOLDEN_MLX exactly.
Paged path: my output matches your proposed GOLDEN_PAGED token-for-token. The two prompts failing against the old golden (One plus one equals, Water boils at a temperature of) produce exactly the tokens you recorded.

Also, it seems like the divergence is due to the two rival tokens being extremely close in probability where the Metal kernel and MLX SDPA resolve the tie differently but more investigation needed to confirm it (?)

one minor non blocking comments:

The golden-gen comment uses VLLM_METAL_MEMORY_FRACTION=0.3, but the test default paged fraction is 0.2 (no behavioral impact, just inconsistent docs).

LxYuan0420

LGTM

…#201) ## Summary - Extract attention-type-specific logic from the monolithic wrapper into per-type modules (`attention_sdpa.py`, `attention_linear.py`) - The wrapper now dispatches based on module attributes: paged varlen SDPA path for standard dot-product attention (MHA/GQA/MQA), stub for linear attention (GatedDeltaNet), and this will also unblock MLA for glm-4.7-flash and deepseek. - Change layer patching from single-attribute (`self_attn` on all layers) to per-layer lookup — required for hybrid models like Qwen3.5 where some layers use `self_attn` and others use `linear_attn` - Add xfail integration test for Qwen/Qwen3.5-0.8B with paged attention **No new features. This is a refactor to unblock collaboration on Qwen3.5 linear attention support.** ## Why attribute-based detection? We wrap mlx_lm/mlx_vlm attention modules at runtime without modifying their source. Since we don't own the model code, we detect attention type by probing module attributes (e.g. `q_proj` + `o_proj` → SDPA, `conv1d` + no `q_proj` → linear). This should works across all known GatedDeltaNet variants (qwen3.5) ## How to add a new attention type 1. Create `attention_<type>.py` with `is_<type>()` detector + `<type>_forward()` implementation 2. Add one `elif` branch in the wrapper dispatch (`paged_attention.py`) ## Test plan - [x] Qwen3.5-0.8B paged attention xfail hits `NotImplementedError` on linear attention layers - [x] Passed the new deterministic test from #202. The refactoring doesn't break the existing paged varlen kernel on Qwen3-0.6B --------- Signed-off-by: ran <hzz5361@psu.edu>

update tests

8a4ec8b

Signed-off-by: ran <hzz5361@psu.edu>

WindChimeRan marked this pull request as draft March 23, 2026 00:26

WindChimeRan marked this pull request as ready for review March 23, 2026 00:33

WindChimeRan mentioned this pull request Mar 23, 2026

Refactor paged attention dispatch to support multiple attention types #201

Merged

2 tasks

LxYuan0420 approved these changes Mar 23, 2026

View reviewed changes

LxYuan0420 merged commit b7b415f into vllm-project:main Mar 23, 2026
5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update deterministic test golden tokens for MLX 0.31#202

Update deterministic test golden tokens for MLX 0.31#202
LxYuan0420 merged 1 commit into
vllm-project:mainfrom
WindChimeRan:test_update_deterministic_paged

WindChimeRan commented Mar 23, 2026 •

edited

Loading

Uh oh!

LxYuan0420 commented Mar 23, 2026

Uh oh!

LxYuan0420 left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

WindChimeRan commented Mar 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Context

Test plan

Help

Uh oh!

LxYuan0420 commented Mar 23, 2026

Uh oh!

LxYuan0420 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

WindChimeRan commented Mar 23, 2026 •

edited

Loading