[Bugfix][ptd_eagle] Fix buffer overflow in PTD EAGLE speculative deco…#1
[Bugfix][ptd_eagle] Fix buffer overflow in PTD EAGLE speculative deco…#1
Conversation
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run You ask your reviewers to trigger select CI tests on top of Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add If you have any questions, please reach out to us on Slack at https://slack.vllm.ai. 🚀 |
eabd80a to
e80df5a
Compare
…ding Parallel draft methods (PTD EAGLE) generate K draft tokens in a single forward pass using mask tokens, which requires larger buffers than sequential drafting. The inherited buffer allocation formula was insufficient, causing crashes under load. Bug manifestation: - Sequential EAGLE: needs max_num_batched_tokens + max_num_seqs tokens - Parallel draft: needs max_num_batched_tokens + max_num_seqs * num_speculative_tokens tokens - Error: "AssertionError: Shape: 8213 out of considered ranges: [(1, 8192)]" This fix addresses three critical issues: 1. Buffer Allocation (ptd_eagle.py): - Corrects max_num_tokens formula for parallel draft generation pattern - Reallocates all buffers (input_ids, positions, hidden_states, slot_buffer) - Adds ~6MB memory overhead (negligible for 3-4x speedup) 2. Compilation Ranges (vllm.py): - Extends compile_ranges_split_points when parallel_draft=True - Ensures CUDA graph compilation handles expanded token counts - Adds informative logging for parallel draft detection The bug was caught during benchmarking with 100 prompts (1600 input, 600 output tokens) where batch size reached 8192 tokens + 7 requests * 3 masks = 8213 tokens, exceeding the compilation range of 8192. Tested-by: Load testing with max batch size configurations Signed-off-by: Li Zhang <lzhanga@amazon.com> Simplify updates to eagle files Signed-off-by: Li Zhang <lzhanga@amazon.com> Minor format updates
e80df5a to
df4ff49
Compare
Previous bug: when running
vllm bench serve --backend vllm --served-model-name gpt-oss-120b --endpoint /v1/completions --dataset-name random --random-input-len 1600 --random-output-len 600 --num-prompts 100, server would crash and throw errorParallel draft methods (PTD EAGLE) generate K draft tokens in a single forward pass using mask tokens, which requires larger buffers than sequential drafting. The inherited buffer allocation formula was insufficient, causing crashes under load.
Bug manifestation:
This fix addresses three critical issues:
Buffer Allocation (ptd_eagle.py):
Compilation Ranges (vllm.py):
The bug was caught during benchmarking with 100 prompts (1600 input, 600 output tokens) where batch size reached 8192 tokens + 7 requests * 3 masks = 8213 tokens, exceeding the compilation range of 8192.
Tested-by: Load testing with max batch size configurations