Skip to content

Update deterministic test golden tokens for MLX 0.31#202

Merged
LxYuan0420 merged 1 commit into
vllm-project:mainfrom
WindChimeRan:test_update_deterministic_paged
Mar 23, 2026
Merged

Update deterministic test golden tokens for MLX 0.31#202
LxYuan0420 merged 1 commit into
vllm-project:mainfrom
WindChimeRan:test_update_deterministic_paged

Conversation

@WindChimeRan
Copy link
Copy Markdown
Collaborator

@WindChimeRan WindChimeRan commented Mar 23, 2026

Summary

  • Regenerate golden token IDs for test_paged_deterministic.py using tools/gen_golden_token_ids_for_deterministics.py on mlx 0.31.1 / mlx-lm 0.31.1
  • Paged attention path now diverges on 3/6 prompts (was 0/6 on mlx 0.22)
  • Both golden sets updated; existing either/or assertion logic unchanged

Context

After upgrading to mlx 0.31.1 from mlx-lm==0.29.1, the paged attention path produces different tokens than the MLX inline cache path on 3 out of 6 golden prompts (Qwen3-0.6B, greedy decoding, max_num_seqs=1). Divergence starts at token 2-6 where top-2 logits are close.

Both outputs are valid English, not gibberish. This is floating-point non-determinism between the Metal paged attention kernel and MLX's built-in SDPA, not a correctness bug.

Prompts that diverge (paged vs mlx):

  • "The capital of France is" — token 5
  • "One plus one equals" — token 2
  • "Water boils at a temperature of" — token 6

Test plan

  • python -m pytest tests/test_paged_deterministic.py -v -s — 6/6 pass

Help

Need to confirm this is expected not just on my own machine.

Signed-off-by: ran <hzz5361@psu.edu>
@WindChimeRan WindChimeRan marked this pull request as draft March 23, 2026 00:26
@WindChimeRan WindChimeRan marked this pull request as ready for review March 23, 2026 00:33
@LxYuan0420
Copy link
Copy Markdown
Collaborator

Reproduced on my end on mlx 0.31.1 / mlx-lm 0.31.1.

  • MLX path: 6/6 pass, all match GOLDEN_MLX exactly.
  • Paged path: my output matches your proposed GOLDEN_PAGED token-for-token. The two prompts failing against the old golden (One plus one equals, Water boils at a temperature of) produce exactly the tokens you recorded.

Also, it seems like the divergence is due to the two rival tokens being extremely close in probability where the Metal kernel and MLX SDPA resolve the tie differently but more investigation needed to confirm it (?)

one minor non blocking comments:

  1. The golden-gen comment uses VLLM_METAL_MEMORY_FRACTION=0.3, but the test default paged fraction is 0.2 (no behavioral impact, just inconsistent docs).

Copy link
Copy Markdown
Collaborator

@LxYuan0420 LxYuan0420 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@LxYuan0420 LxYuan0420 merged commit b7b415f into vllm-project:main Mar 23, 2026
5 checks passed
LxYuan0420 pushed a commit that referenced this pull request Mar 24, 2026
…#201)

## Summary

- Extract attention-type-specific logic from the monolithic wrapper into
per-type modules (`attention_sdpa.py`, `attention_linear.py`)
- The wrapper now dispatches based on module attributes: paged varlen
SDPA path for standard dot-product attention (MHA/GQA/MQA), stub for
linear attention (GatedDeltaNet), and this will also unblock MLA for
glm-4.7-flash and deepseek.
- Change layer patching from single-attribute (`self_attn` on all
layers) to per-layer lookup — required for hybrid models like Qwen3.5
where some layers use `self_attn` and others use `linear_attn`
- Add xfail integration test for Qwen/Qwen3.5-0.8B with paged attention

**No new features. This is a refactor to unblock collaboration on
Qwen3.5 linear attention support.**

## Why attribute-based detection?

We wrap mlx_lm/mlx_vlm attention modules at runtime without modifying
their source. Since we don't own the model code, we detect attention
type by probing module attributes (e.g. `q_proj` + `o_proj` → SDPA,
`conv1d` + no `q_proj` → linear). This should works across all known
GatedDeltaNet variants (qwen3.5)

## How to add a new attention type

1. Create `attention_<type>.py` with `is_<type>()` detector +
`<type>_forward()` implementation
2. Add one `elif` branch in the wrapper dispatch (`paged_attention.py`)

## Test plan

- [x] Qwen3.5-0.8B paged attention xfail hits `NotImplementedError` on
linear attention layers
- [x] Passed the new deterministic test from #202. The refactoring
doesn't break the existing paged varlen kernel on Qwen3-0.6B

---------

Signed-off-by: ran <hzz5361@psu.edu>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants