Fix Qwen3.5 hybrid paged cache reconstruction by krystophny · Pull Request #6 · computor-org/vllm-mlx

krystophny · 2026-03-24T08:30:18Z

Summary

fix load_model_with_fallback() so batched startup returns the successful mlx_lm.load() result instead of None
store hybrid recurrent cache layers as sequence-boundary snapshots while keeping block-wise concatenation for KV layers
add regressions for the tokenizer return path and the Qwen3.5 hybrid paged-cache boundary handling

This includes the one-line tokenizer fix from #2 because batched Qwen3.5 startup still fails on main without it. If #2 merges first, this branch can be rebased and the duplicate hunk dropped.

Verification

Test fails on main

$ PYTHONPATH=/tmp/vllm-mlx-fork-main "$HOME/Library/Application Support/tabura/llm/venv/bin/python" -m pytest /tmp/vllm-mlx-regression-XXXX.py -q
FF                                                                       [100%]
=================================== FAILURES ===================================
___________ test_load_model_with_fallback_returns_successful_result ____________

    def test_load_model_with_fallback_returns_successful_result():
        fake_model = object()
        fake_tokenizer = object()
        with patch("mlx_lm.load", return_value=(fake_model, fake_tokenizer)):
>           model, tokenizer = load_model_with_fallback("mlx-community/Qwen3.5-4B")
            ^^^^^^^^^^^^^^^^
E           TypeError: cannot unpack non-iterable NoneType object

/tmp/vllm-mlx-regression-XXXX.py:14: TypeError
___________________ test_hybrid_boundary_snapshot_round_trip ___________________

    def test_hybrid_boundary_snapshot_round_trip():
        paged_manager = PagedCacheManager(block_size=4, max_blocks=10)
        cache = BlockAwarePrefixCache(model=None, paged_cache_manager=paged_manager)
        extracted = [
            {
                "state": (
                    mx.arange(1 * 2 * 8 * 3).reshape(1, 2, 8, 3),
                    mx.arange(1000, 1000 + (1 * 2 * 8 * 3)).reshape(1, 2, 8, 3),
                ),
                "meta_state": "",
                "class_ref": KVCache,
                "class_name": "KVCache",
            },
            {
                "state": [
                    mx.arange(1 * 3 * 8).reshape(1, 3, 8),
                    mx.arange(2000, 2000 + (1 * 2 * 4 * 4)).reshape(1, 2, 4, 4),
                ],
                "meta_state": "",
                "class_ref": ArraysCache,
                "class_name": "ArraysCache",
            },
        ]
        block_table = cache.store_cache("req-1", list(range(8)), extracted)
        first_block = paged_manager.allocated_blocks[block_table.block_ids[0]]
>       assert first_block.cache_data[1] is None
               ^^^^^^^^^^^^^^^^^^^^^^^^^
E       TypeError: 'NoneType' object is not subscriptable

/tmp/vllm-mlx-regression-XXXX.py:44: TypeError
------------------------------ Captured log call -------------------------------
WARNING  vllm_mlx.prefix_cache:prefix_cache.py:679 Failed to extract block tensor slice: Too many indices for array with 3 dimensions.
WARNING  vllm_mlx.prefix_cache:prefix_cache.py:679 Failed to extract block tensor slice: Too many indices for array with 3 dimensions.
=========================== short test summary info ============================
FAILED ../../../tmp/vllm-mlx-regression-XXXX.py::test_load_model_with_fallback_returns_successful_result
FAILED ../../../tmp/vllm-mlx-regression-XXXX.py::test_hybrid_boundary_snapshot_round_trip
2 failed, 2 warnings in 2.30s

Test passes after fix

$ PYTHONPATH=/tmp/vllm-mlx-fork "$HOME/Library/Application Support/tabura/llm/venv/bin/python" -m pytest /tmp/vllm-mlx-regression-XXXX.py -q
..                                                                       [100%]
2 passed, 2 warnings in 2.28s

Additional coverage

$ PYTHONPATH=/tmp/vllm-mlx-fork "$HOME/Library/Application Support/tabura/llm/venv/bin/python" -m pytest /tmp/vllm-mlx-fork/tests/test_paged_cache.py -q
39 passed in 2.13s

$ PYTHONPATH=/tmp/vllm-mlx-fork "$HOME/Library/Application Support/tabura/llm/venv/bin/python" -m pytest /tmp/vllm-mlx-fork/tests/test_tokenizer_utils.py -q
2 passed in 2.04s

Fix Qwen3.5 hybrid paged cache reconstruction

cc6a7e4

This was referenced Mar 24, 2026

Fix successful MLX tokenizer loads #2

Merged

tokenizer: return successful mlx-lm load result waybarrios/vllm-mlx#215

Open

krystophny merged commit fcbba21 into main Mar 24, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix Qwen3.5 hybrid paged cache reconstruction#6

Fix Qwen3.5 hybrid paged cache reconstruction#6
krystophny merged 1 commit intomainfrom
fix/qwen35-hybrid-paged-cache

krystophny commented Mar 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

krystophny commented Mar 24, 2026

Summary

Verification

Test fails on main

Test passes after fix

Additional coverage

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant