Update deterministic test golden tokens for MLX 0.31#202
Merged
LxYuan0420 merged 1 commit intoMar 23, 2026
Merged
Conversation
Signed-off-by: ran <hzz5361@psu.edu>
2 tasks
Collaborator
|
Reproduced on my end on mlx 0.31.1 / mlx-lm 0.31.1.
Also, it seems like the divergence is due to the two rival tokens being extremely close in probability where the Metal kernel and MLX SDPA resolve the tie differently but more investigation needed to confirm it (?) one minor non blocking comments:
|
LxYuan0420
pushed a commit
that referenced
this pull request
Mar 24, 2026
…#201) ## Summary - Extract attention-type-specific logic from the monolithic wrapper into per-type modules (`attention_sdpa.py`, `attention_linear.py`) - The wrapper now dispatches based on module attributes: paged varlen SDPA path for standard dot-product attention (MHA/GQA/MQA), stub for linear attention (GatedDeltaNet), and this will also unblock MLA for glm-4.7-flash and deepseek. - Change layer patching from single-attribute (`self_attn` on all layers) to per-layer lookup — required for hybrid models like Qwen3.5 where some layers use `self_attn` and others use `linear_attn` - Add xfail integration test for Qwen/Qwen3.5-0.8B with paged attention **No new features. This is a refactor to unblock collaboration on Qwen3.5 linear attention support.** ## Why attribute-based detection? We wrap mlx_lm/mlx_vlm attention modules at runtime without modifying their source. Since we don't own the model code, we detect attention type by probing module attributes (e.g. `q_proj` + `o_proj` → SDPA, `conv1d` + no `q_proj` → linear). This should works across all known GatedDeltaNet variants (qwen3.5) ## How to add a new attention type 1. Create `attention_<type>.py` with `is_<type>()` detector + `<type>_forward()` implementation 2. Add one `elif` branch in the wrapper dispatch (`paged_attention.py`) ## Test plan - [x] Qwen3.5-0.8B paged attention xfail hits `NotImplementedError` on linear attention layers - [x] Passed the new deterministic test from #202. The refactoring doesn't break the existing paged varlen kernel on Qwen3-0.6B --------- Signed-off-by: ran <hzz5361@psu.edu>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
test_paged_deterministic.pyusingtools/gen_golden_token_ids_for_deterministics.pyon mlx 0.31.1 / mlx-lm 0.31.1Context
After upgrading to mlx 0.31.1 from mlx-lm==0.29.1, the paged attention path produces different tokens than the MLX inline cache path on 3 out of 6 golden prompts (Qwen3-0.6B, greedy decoding, max_num_seqs=1). Divergence starts at token 2-6 where top-2 logits are close.
Both outputs are valid English, not gibberish. This is floating-point non-determinism between the Metal paged attention kernel and MLX's built-in SDPA, not a correctness bug.
Prompts that diverge (paged vs mlx):
Test plan
python -m pytest tests/test_paged_deterministic.py -v -s— 6/6 passHelp
Need to confirm this is expected not just on my own machine.