[Bug]: vLLM Runtime Fails to Honor Context Cancellation During Streaming

### Your current environment

<details>
<summary>The output of <code>python collect_env.py</code></summary>

```text
=== System Environment Information ===
Date: 2025-09-10 12:07:34.815684
Python version: 3.8.10 (default, Mar 18 2025, 20:04:55) 
[GCC 9.4.0]
Platform: Linux-5.15.0-139-generic-x86_64-with-glibc2.29
Architecture: ('64bit', 'ELF')
Machine: x86_64
Processor: x86_64
System: Linux
Release: 5.15.0-139-generic
Version: #149-Ubuntu SMP Fri Apr 11 22:06:13 UTC 2025
PyTorch version: 2.4.1+cu121
CUDA available: False
nvidia-smi not available
nvcc not available

=== Environment Variables ===
CUDA_VISIBLE_DEVICES: Not set
CUDA_HOME: Not set
CUDA_PATH: Not set
PATH: /usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
LD_LIBRARY_PATH: Not set
PYTHONPATH: Not set

=== Python Packages ===
torch: 2.4.1+cu121
transformers: Not installed
accelerate: Not installed
xformers: Not installed
flash-attn: Not installed
triton: 3.0.0
```
</details>

### 🐛 Describe the bug

Missing Backend Request Cancellation in AsyncLLMEngine.generate() Method
When asyncio.CancelledError is caught, the current implementation only calls self.abort(request_id) which handles client-side stream cleanup but does not cancel the actual backend processing. This leads to:

- Resource Waste: The backend continues processing requests that are no longer needed
- Memory Leaks: Accumulated abandoned requests can cause memory issues
- Performance Degradation: Resources tied up with cancelled requests slow down processing for active users
- Inefficient Resource Utilization: GPU/CPU cycles are wasted on computations that won't be consumed

The backend engine has its own abort_request() method that should be called to immediately stop processing

This issue is particularly critical in production environments where:

- Users frequently disconnect from long-running requests
- High request volumes can lead to resource exhaustion
- Cost optimization is important for GPU-based inference workloads
- Real-time applications require immediate resource cleanup

To reproduce:
A vllm inference server was created on geema2b model on cpu mode
After request cancellation
Predictor pod continues showing logs like: 05:37:55: Running: 1 reqs, Avg generation throughput: 7.8 tokens/s
Request finally completes at 05:40:58 (3+ minutes later)

Expected Behavior
When a client disconnects during streaming:
HTTP request context should be cancelled
vLLM runtime should immediately stop token generation
Predictor pod should release resources and update metrics
Request should be marked as cancelled in logs


### Before submitting a new issue...

- [x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the [documentation page](https://docs.vllm.ai/en/latest/), which can answer lots of frequently asked questions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Bug]: vLLM Runtime Fails to Honor Context Cancellation During Streaming #24584

Your current environment

🐛 Describe the bug

Before submitting a new issue...

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[Bug]: vLLM Runtime Fails to Honor Context Cancellation During Streaming #24584

Description

Your current environment

🐛 Describe the bug

Before submitting a new issue...

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions