-
-
Notifications
You must be signed in to change notification settings - Fork 11.5k
Open
Labels
bugSomething isn't workingSomething isn't working
Description
Your current environment
| NVIDIA-SMI 550.127.05 Driver Version: 550.127.05 CUDA Version: 12.4 |
- vLLM versions: v0.10.2 and v0.11.0
- Model: GPT-OSS 120B
- Streaming: True
- --enforce-eager: Tested
🐛 Describe the bug
Issue:
When using Streaming=True, some tokens are missing or arrive in scrambled order.
- With Streaming=False, all tokens are generated correctly.
- Using --enforce-eager produces the correct token sequence but significantly slows down generation.
This issue occurs in both v0.10.2 and v0.11.0.
Expected behavior:
Streaming should produce all tokens in the correct order, similar to --enforce-eager, without the performance penalty.
Steps to reproduce:
- Run GPT-OSS 120B with vLLM v0.11.0 (or v0.10.2)
- Enable streaming (Streaming=True)
- Generate text and observe missing or scrambled tokens
Additional notes:
The problem appears to be specific to asynchronous streaming. Using eager execution ensures correct token order but reduces speed.
Command Example
CUDA_VISIBLE_DEVICES=0,1,2,3 vllm serve openai/gpt-oss-120b \
--tensor-parallel-size 4 \
--enable-expert-parallel \
--tool-call-parser openai \
--reasoning-parser openai_gptoss \
--enable-auto-tool-choice \
--async-scheduling \
--max-model-len 131072 \
--gpu-memory-utilization 0.90 \
--max-num-seqs 32 \
--host 0.0.0.0 \
--max-num-batched-tokens 8192 \
--port 20003
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working