Skip to content

Fix Qwen3.5 hybrid paged cache reconstruction#6

Merged
krystophny merged 1 commit intomainfrom
fix/qwen35-hybrid-paged-cache
Mar 24, 2026
Merged

Fix Qwen3.5 hybrid paged cache reconstruction#6
krystophny merged 1 commit intomainfrom
fix/qwen35-hybrid-paged-cache

Conversation

@krystophny
Copy link
Copy Markdown
Collaborator

Summary

  • fix load_model_with_fallback() so batched startup returns the successful mlx_lm.load() result instead of None
  • store hybrid recurrent cache layers as sequence-boundary snapshots while keeping block-wise concatenation for KV layers
  • add regressions for the tokenizer return path and the Qwen3.5 hybrid paged-cache boundary handling

This includes the one-line tokenizer fix from #2 because batched Qwen3.5 startup still fails on main without it. If #2 merges first, this branch can be rebased and the duplicate hunk dropped.

Verification

Test fails on main

$ PYTHONPATH=/tmp/vllm-mlx-fork-main "$HOME/Library/Application Support/tabura/llm/venv/bin/python" -m pytest /tmp/vllm-mlx-regression-XXXX.py -q
FF                                                                       [100%]
=================================== FAILURES ===================================
___________ test_load_model_with_fallback_returns_successful_result ____________

    def test_load_model_with_fallback_returns_successful_result():
        fake_model = object()
        fake_tokenizer = object()
        with patch("mlx_lm.load", return_value=(fake_model, fake_tokenizer)):
>           model, tokenizer = load_model_with_fallback("mlx-community/Qwen3.5-4B")
            ^^^^^^^^^^^^^^^^
E           TypeError: cannot unpack non-iterable NoneType object

/tmp/vllm-mlx-regression-XXXX.py:14: TypeError
___________________ test_hybrid_boundary_snapshot_round_trip ___________________

    def test_hybrid_boundary_snapshot_round_trip():
        paged_manager = PagedCacheManager(block_size=4, max_blocks=10)
        cache = BlockAwarePrefixCache(model=None, paged_cache_manager=paged_manager)
        extracted = [
            {
                "state": (
                    mx.arange(1 * 2 * 8 * 3).reshape(1, 2, 8, 3),
                    mx.arange(1000, 1000 + (1 * 2 * 8 * 3)).reshape(1, 2, 8, 3),
                ),
                "meta_state": "",
                "class_ref": KVCache,
                "class_name": "KVCache",
            },
            {
                "state": [
                    mx.arange(1 * 3 * 8).reshape(1, 3, 8),
                    mx.arange(2000, 2000 + (1 * 2 * 4 * 4)).reshape(1, 2, 4, 4),
                ],
                "meta_state": "",
                "class_ref": ArraysCache,
                "class_name": "ArraysCache",
            },
        ]
        block_table = cache.store_cache("req-1", list(range(8)), extracted)
        first_block = paged_manager.allocated_blocks[block_table.block_ids[0]]
>       assert first_block.cache_data[1] is None
               ^^^^^^^^^^^^^^^^^^^^^^^^^
E       TypeError: 'NoneType' object is not subscriptable

/tmp/vllm-mlx-regression-XXXX.py:44: TypeError
------------------------------ Captured log call -------------------------------
WARNING  vllm_mlx.prefix_cache:prefix_cache.py:679 Failed to extract block tensor slice: Too many indices for array with 3 dimensions.
WARNING  vllm_mlx.prefix_cache:prefix_cache.py:679 Failed to extract block tensor slice: Too many indices for array with 3 dimensions.
=========================== short test summary info ============================
FAILED ../../../tmp/vllm-mlx-regression-XXXX.py::test_load_model_with_fallback_returns_successful_result
FAILED ../../../tmp/vllm-mlx-regression-XXXX.py::test_hybrid_boundary_snapshot_round_trip
2 failed, 2 warnings in 2.30s

Test passes after fix

$ PYTHONPATH=/tmp/vllm-mlx-fork "$HOME/Library/Application Support/tabura/llm/venv/bin/python" -m pytest /tmp/vllm-mlx-regression-XXXX.py -q
..                                                                       [100%]
2 passed, 2 warnings in 2.28s

Additional coverage

$ PYTHONPATH=/tmp/vllm-mlx-fork "$HOME/Library/Application Support/tabura/llm/venv/bin/python" -m pytest /tmp/vllm-mlx-fork/tests/test_paged_cache.py -q
39 passed in 2.13s

$ PYTHONPATH=/tmp/vllm-mlx-fork "$HOME/Library/Application Support/tabura/llm/venv/bin/python" -m pytest /tmp/vllm-mlx-fork/tests/test_tokenizer_utils.py -q
2 passed in 2.04s

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant