-
-
Notifications
You must be signed in to change notification settings - Fork 10.1k
Description
Your current environment
The output of python collect_env.py
=== System Environment Information ===
Date: 2025-09-10 12:07:34.815684
Python version: 3.8.10 (default, Mar 18 2025, 20:04:55)
[GCC 9.4.0]
Platform: Linux-5.15.0-139-generic-x86_64-with-glibc2.29
Architecture: ('64bit', 'ELF')
Machine: x86_64
Processor: x86_64
System: Linux
Release: 5.15.0-139-generic
Version: #149-Ubuntu SMP Fri Apr 11 22:06:13 UTC 2025
PyTorch version: 2.4.1+cu121
CUDA available: False
nvidia-smi not available
nvcc not available
=== Environment Variables ===
CUDA_VISIBLE_DEVICES: Not set
CUDA_HOME: Not set
CUDA_PATH: Not set
PATH: /usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
LD_LIBRARY_PATH: Not set
PYTHONPATH: Not set
=== Python Packages ===
torch: 2.4.1+cu121
transformers: Not installed
accelerate: Not installed
xformers: Not installed
flash-attn: Not installed
triton: 3.0.0
🐛 Describe the bug
Missing Backend Request Cancellation in AsyncLLMEngine.generate() Method
When asyncio.CancelledError is caught, the current implementation only calls self.abort(request_id) which handles client-side stream cleanup but does not cancel the actual backend processing. This leads to:
- Resource Waste: The backend continues processing requests that are no longer needed
- Memory Leaks: Accumulated abandoned requests can cause memory issues
- Performance Degradation: Resources tied up with cancelled requests slow down processing for active users
- Inefficient Resource Utilization: GPU/CPU cycles are wasted on computations that won't be consumed
The backend engine has its own abort_request() method that should be called to immediately stop processing
This issue is particularly critical in production environments where:
- Users frequently disconnect from long-running requests
- High request volumes can lead to resource exhaustion
- Cost optimization is important for GPU-based inference workloads
- Real-time applications require immediate resource cleanup
To reproduce:
A vllm inference server was created on geema2b model on cpu mode
After request cancellation
Predictor pod continues showing logs like: 05:37:55: Running: 1 reqs, Avg generation throughput: 7.8 tokens/s
Request finally completes at 05:40:58 (3+ minutes later)
Expected Behavior
When a client disconnects during streaming:
HTTP request context should be cancelled
vLLM runtime should immediately stop token generation
Predictor pod should release resources and update metrics
Request should be marked as cancelled in logs
Before submitting a new issue...
- Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.