Skip to content

[Bug]: vLLM Runtime Fails to Honor Context Cancellation During Streaming #24584

@hardikmenger

Description

@hardikmenger

Your current environment

The output of python collect_env.py
=== System Environment Information ===
Date: 2025-09-10 12:07:34.815684
Python version: 3.8.10 (default, Mar 18 2025, 20:04:55) 
[GCC 9.4.0]
Platform: Linux-5.15.0-139-generic-x86_64-with-glibc2.29
Architecture: ('64bit', 'ELF')
Machine: x86_64
Processor: x86_64
System: Linux
Release: 5.15.0-139-generic
Version: #149-Ubuntu SMP Fri Apr 11 22:06:13 UTC 2025
PyTorch version: 2.4.1+cu121
CUDA available: False
nvidia-smi not available
nvcc not available

=== Environment Variables ===
CUDA_VISIBLE_DEVICES: Not set
CUDA_HOME: Not set
CUDA_PATH: Not set
PATH: /usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
LD_LIBRARY_PATH: Not set
PYTHONPATH: Not set

=== Python Packages ===
torch: 2.4.1+cu121
transformers: Not installed
accelerate: Not installed
xformers: Not installed
flash-attn: Not installed
triton: 3.0.0

🐛 Describe the bug

Missing Backend Request Cancellation in AsyncLLMEngine.generate() Method
When asyncio.CancelledError is caught, the current implementation only calls self.abort(request_id) which handles client-side stream cleanup but does not cancel the actual backend processing. This leads to:

  • Resource Waste: The backend continues processing requests that are no longer needed
  • Memory Leaks: Accumulated abandoned requests can cause memory issues
  • Performance Degradation: Resources tied up with cancelled requests slow down processing for active users
  • Inefficient Resource Utilization: GPU/CPU cycles are wasted on computations that won't be consumed

The backend engine has its own abort_request() method that should be called to immediately stop processing

This issue is particularly critical in production environments where:

  • Users frequently disconnect from long-running requests
  • High request volumes can lead to resource exhaustion
  • Cost optimization is important for GPU-based inference workloads
  • Real-time applications require immediate resource cleanup

To reproduce:
A vllm inference server was created on geema2b model on cpu mode
After request cancellation
Predictor pod continues showing logs like: 05:37:55: Running: 1 reqs, Avg generation throughput: 7.8 tokens/s
Request finally completes at 05:40:58 (3+ minutes later)

Expected Behavior
When a client disconnects during streaming:
HTTP request context should be cancelled
vLLM runtime should immediately stop token generation
Predictor pod should release resources and update metrics
Request should be marked as cancelled in logs

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions