Batch generation refactoring and various fixes#1072
Conversation
Code ReviewWent through the full diff. The refactor is well-structured -- the separation into Model fixes -- correctThe qwen3_5.py conv state fix is right. Per-sequence SequenceStateMachine -- nice upgradeAho-Corasick trie for multi-token stop sequence matching is a significant improvement over the flat Potential bugs1. In the longer = tokens[:index] + best # TypeError: can only concatenate list to listSuggest guarding: 2. def pop(self):
i = 0
while i + 1 < len(self._ordering):
...
return lru_b.popleft() # IndexError if all tiers emptyCalled from 3. Server tokenizer variable scoping (low) In 4. The Design question -- thinking checkpoint detectionThe BatchRotatingKVCache.merge() fixThe switch from Overall this is solid. The right-pad prompt / left-pad decode split is a meaningful compute savings for mixed-length batches, and the StateMachine is the right abstraction for token sequence detection. |
e943e7c to
ee44cd2
Compare
|
Reviewed the latest commits (efb30e2..ee44cd2). The Qwen3.5 and GDN batching fixes look correct:
I run Qwen3.5-122B-A10B (12 attention + 36 GDN layers) in continuous batching on M2 Ultra. These fixes address the exact hybrid cache issues I've been patching around in vllm-mlx. Looking forward to rebasing on this when it lands. |
This PR refactors the batch generator.
Important bug-fixes
I will link this PR to open issues that are being fixed by this instead of the opposite as I think it will be simpler.