Skip to content

[Bugfix][Frontend] Fixed issue where requests with duplicate request IDs might be sent to EngineCore simultaneously#15326

Closed
hidva wants to merge 4 commits intovllm-project:mainfrom
hidva:main
Closed

[Bugfix][Frontend] Fixed issue where requests with duplicate request IDs might be sent to EngineCore simultaneously#15326
hidva wants to merge 4 commits intovllm-project:mainfrom
hidva:main

Conversation

@hidva
Copy link
Copy Markdown
Contributor

@hidva hidva commented Mar 22, 2025

Currently, vllm allows users to send duplicate request IDs. At the same time, numerous modules in EngineCore use request IDs as dictionary keys, such as KVCacheManager.req_to_blocks. This is based on the assumption that EngineCore always expects the Frontend to first abort a request before adding a new one with the same request ID:

# req1, req2 have the same request_id.
(EngineCoreRequestType.ADD, req1(request_id=RequestId))
(EngineCoreRequestType.ABORT, req1)
(EngineCoreRequestType.ADD, req2(request_id=RequestId))

Currently, AsyncLLM ensures that duplicate request IDs must first be aborted before they can be added through the sequence AsyncLLM._add_request -> OutputProcessor.add_request:

# OutputProcessor.add_request
request_id = request.request_id
if request_id in self.request_states:
    raise ValueError(f"Request id {request_id} already running.")

# AsyncLLM.abort
async def abort(self, request_id: str) -> None:
    """Abort RequestId in OutputProcessor and EngineCore."""

    request_ids = self.output_processor.abort_requests((request_id,))
    # BUG!
    # This operation is not atomic, and there might be a time window during which
    # the request has already been removed from OutputProcessor.request_states,
    # but the corresponding ABORT has not yet been issued to EngineCore.
    await self.engine_core.abort_requests_async(request_ids)

    if self.log_requests:
        logger.info("Aborted request %s.", request_id)

We can easily simulate the potential bug by enlarging the possible time window with an await asyncio.sleep(13) inserted at the BUG point:
image

To fix this issue, we categorized completed requests into two types:

  • abort req, handle_abort_reqs
  • finished req, _handle_finished_reqs

And ensured that the scope of request visibility in the Frontend always includes the scope of request visibility in EngineCore.

@github-actions
Copy link
Copy Markdown

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

@mergify mergify bot added the v1 label Mar 22, 2025
@robertgshaw2-redhat
Copy link
Copy Markdown
Collaborator

Thanks for your contribution! I agree that this is a race condition. Appreciate you digging in

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we call this something more descriptive? get_parent_and_children_reqs?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should probably also reflect the fact that the parent request is being removed.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the fact that the parent request is being removed

Yes, Do you have any good suggestions? How about try_pop_parent?

@robertgshaw2-redhat
Copy link
Copy Markdown
Collaborator

Thanks a ton! I reviewed the implementation in detail and you have fixed the problem! Just left some minor comments about naming the functions and comments. Ping me on slack when this is ready!

@njhill
Copy link
Copy Markdown
Member

njhill commented Mar 26, 2025

Thanks for this @hidva, I agree with @robertgshaw2-redhat's comments.

However, I was already thinking it might be more robust to have the engine return finished notifications for all requests, including those whose abort is initiated from the front-end process. Currently it just stops sending any outputs for these but we could change it so that there will be a terminating RequestOutput with "aborted" finish_reason in these cases.

Then we can clean up the output processor request states based on these responses rather than the current logic that's a bit disjoint.

Another reason to do this is that in addition to the leak that you pointed out, there may still be a bug where such aborted requests aren't captured properly in the metrics, because _update_stats_from_finished never gets called for them.

@mergify mergify bot added the tpu Related to Google TPUs label Mar 27, 2025
@hidva
Copy link
Copy Markdown
Contributor Author

hidva commented Mar 27, 2025

Apologies for the delay; I was on vacation until now. I will continue to follow up on this PR.

@hidva
Copy link
Copy Markdown
Contributor Author

hidva commented Mar 27, 2025

the engine return finished notifications for all requests,

However, there are indeed some scenarios where only the frontend can notify the engine to stop outputting, such as the presence of a stop string or when the client disconnects. If we let the engine return finished notifications for all requests, how should the engine be aware of such external conditions like client disconnection?

_update_stats_from_finished never gets called for them.

Yes, we should add a call to _update_stats_from_finished within handle_abort_reqs, and at the same time, ensure that _update_stats_from_finished is idempotent. This way, requests that are aborted due to client disconnection can also be captured properly in the metrics.

In other words, after we introduced the concepts of aborted requests and finished requests, we also introduced two interfaces: finish_request()(renamed to free_finished_reqs) and handle_abort_reqs()(renamed to free_aborted_reqs). All finished requests must ultimately call free_finished_reqs() to complete resource cleanup, and similarly, all aborted requests must call free_aborted_reqs() to complete resource cleanup. And all resource cleanup should be idempotent. See Commit: Unified the resource cleanup for aborted and finished requests

@njhill
Copy link
Copy Markdown
Member

njhill commented Mar 27, 2025

Thanks @hidva just to be clear, I think this PR would be good to merge in its current form but that we should consider a follow-on to address the other things I mentioned.

the engine return finished notifications for all requests,

However, there are indeed some scenarios where only the frontend can notify the engine to stop outputting, such as the presence of a stop string or when the client disconnects. If we let the engine return finished notifications for all requests, how should the engine be aware of such external conditions like client disconnection?

The front-end would still initiate the aborts in the same way, i.e. for client disconnection and stop strings. It's just that the engine would now be guaranteed to subsequently return a final RequestOutput for these with aborted finish reason (this will require a change in the engine of course).

_update_stats_from_finished never gets called for them.

Yes, we should add a call to _update_stats_from_finished within handle_abort_reqs, and at the same time, ensure that _update_stats_from_finished is idempotent. This way, requests that are aborted due to client disconnection can also be captured properly in the metrics.

In other words, after we introduced the concepts of aborted requests and finished requests, we also introduced two interfaces: finish_request()(renamed to free_finished_reqs) and handle_abort_reqs()(renamed to free_aborted_reqs). All finished requests must ultimately call free_finished_reqs() to complete resource cleanup, and similarly, all aborted requests must call free_aborted_reqs() to complete resource cleanup. And all resource cleanup should be idempotent. See Commit: Unified the resource cleanup for aborted and finished requests

Regardless of the idempotence I think that it would be nice if we always do the cleanup when receiving the final response for a given request, irrespective of how it was terminated.

@mergify mergify bot removed the tpu Related to Google TPUs label Mar 28, 2025
@hidva
Copy link
Copy Markdown
Contributor Author

hidva commented Apr 1, 2025

@njhill Is there anything else that needs to be done for this PR? Also, I'm not sure why the two tests are failing.

@njhill
Copy link
Copy Markdown
Member

njhill commented Apr 2, 2025

@hidva it seems that the test is hanging. Could you try merging in the latest main again? It's possible that it's a side-effect of the changes.

@mergify mergify bot added tpu Related to Google TPUs and removed tpu Related to Google TPUs labels Apr 9, 2025
@github-actions github-actions bot added the stale Over 90 days of inactivity label Jul 15, 2025
@mergify
Copy link
Copy Markdown

mergify bot commented Jul 15, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @hidva.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Jul 15, 2025
@github-actions github-actions bot added unstale Recieved activity after being labelled stale and removed stale Over 90 days of inactivity labels Jul 16, 2025
@njhill
Copy link
Copy Markdown
Member

njhill commented Jul 16, 2025

@hidva apologies, could you merge in the latest main branch. Hopefully that should also resolve the test failure.

@github-actions
Copy link
Copy Markdown

This pull request has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this pull request should remain open. Thank you!

@github-actions github-actions bot added stale Over 90 days of inactivity and removed unstale Recieved activity after being labelled stale labels Oct 15, 2025
@github-actions
Copy link
Copy Markdown

This pull request has been automatically closed due to inactivity. Please feel free to reopen if you intend to continue working on it. Thank you!

@github-actions github-actions bot closed this Nov 15, 2025
markmc added a commit to markmc/vllm that referenced this pull request Dec 6, 2025
Since vllm-project#9550 and vllm-project#10968 we support client's supplying a custom
request ID. The motivation for this is that it can be very helpful
when you need to correlate vLLM logs with logs of a related service.

Since the request ID is used ubiquitously across vLLM as a unique
key, it obviously is problematic if we ever have multiple in-flight
requests using the same client-provided request ID.

We saw this happening recently when `vllm serve bench` started
including a request ID and the request IDs from multiple concurrent
instances caused collisions. See vllm-project#27723

We try to guard against request ID collisions currently in the
frontend in OutputProcessor:

```
    def add_request(...):
        if request_id in self.request_states:
            raise ValueError(f"Request id {request_id} already running.")
```

however, this is not always effective:

1) We can have abort race conditions where a request is no longer
   tracked by the frontend, but still not completed in the engine.
   See vllm-project#15326 for an attempt to fix this.
2) We can have async scheduling race conditions where a request
   ID is removed from the output processor and being scheduled
   while the older request with that ID is still being completed
   by the model runner. See vllm-project#29355
3) With P/D, a request will continue to be tracked by the prefill
   engine long after the prefill request has been completed in
   the frontend, while we wait for the decode side to fetch the
   KV blocks. See vllm-project#20139

Let's instead ensure we use a unique request ID internally, even
when a client provides a custom request ID. We can do this simply
by appending a short random suffix to any request ID provided
by the frontend.

A full 32 character random UUID would be overkill as a suffix,
so how many random characters would be sufficient? 8 characters
gives us 32 bits of entropy, or 16^8 possible prefixes.

Using the collision probability approximation from
https://preshing.com/20110504/hash-collision-probabilities:

N = 16^8 and k is the number of generated suffixes, then the
probability of collision is (k^2)/(2N), so If a client somehow
caused vLLM to hold 10k requests that reuse the same client-provided
ID, then there would be a 1.16% chance of collision:

```
>>> (k**2)/(2*N)
0.011641532182693481
```

That seems (super good enough)[https://hownot2.com/products/hownot2-super-good-enough-t-shirt].

The key changes to support this are:

1. `InputProcessor.process_inputs()` - we add some randomness to the
request ID just before creating an `EngineCoreRequest`, and store both
the random "internal" request ID (as `request_id`) and the supplied
"external" request ID (as `external_req_id`) in the
`EngineCoreRequest`.
2. `RequestState.make_request_output()` - we ensure that
`RequestOutput.request_id` continues to be the external request ID
(for backwards compat) and add `internal_request_id`.
3. `OutputProcessor.abort_requests()` - we make `OutputProcessor`
track a mapping from external request ID to internal request IDs, so
`abort_requests()` can abort based on either ID.
4. `AsyncLLM` - we use `RequestOutputCollector` to track the internal
request ID, so we can use the internal ID to abort an in-progress
request. We also add an `internal` boolean flag to `abort()` so API
users can abort based on either ID.
5. `ParentRequest` - in the case of parallel sampling, we need to
track both the internal and external ID for the later creation of
`RequestOutput` aggregating the child outputs.

We need to ensure we track the external->internal request ID
mapping because abort() will be supplied an external request ID.
In the case where an external request ID maps to multiple running
requests, we assume the caller requires all of those requests
to be aborted. The caller can use EngineCoreRequest.request_id
as the request ID if they want to be more specific.

Signed-off-by: Mark McLoughlin <markmc@redhat.com>
markmc added a commit to markmc/vllm that referenced this pull request Dec 18, 2025
Since vllm-project#9550 and vllm-project#10968 we support client's supplying a custom
request ID. The motivation for this is that it can be very helpful
when you need to correlate vLLM logs with logs of a related service.

Since the request ID is used ubiquitously across vLLM as a unique
key, it obviously is problematic if we ever have multiple in-flight
requests using the same client-provided request ID.

We saw this happening recently when `vllm serve bench` started
including a request ID and the request IDs from multiple concurrent
instances caused collisions. See vllm-project#27723

We try to guard against request ID collisions currently in the
frontend in OutputProcessor:

```
    def add_request(...):
        if request_id in self.request_states:
            raise ValueError(f"Request id {request_id} already running.")
```

however, this is not always effective:

1) We can have abort race conditions where a request is no longer
   tracked by the frontend, but still not completed in the engine.
   See vllm-project#15326 for an attempt to fix this.
2) We can have async scheduling race conditions where a request
   ID is removed from the output processor and being scheduled
   while the older request with that ID is still being completed
   by the model runner. See vllm-project#29355
3) With P/D, a request will continue to be tracked by the prefill
   engine long after the prefill request has been completed in
   the frontend, while we wait for the decode side to fetch the
   KV blocks. See vllm-project#20139

Let's instead ensure we use a unique request ID internally, even
when a client provides a custom request ID. We can do this simply
by appending a short random suffix to any request ID provided
by the frontend.

A full 32 character random UUID would be overkill as a suffix,
so how many random characters would be sufficient? 8 characters
gives us 32 bits of entropy, or 16^8 possible prefixes.

Using the collision probability approximation from
https://preshing.com/20110504/hash-collision-probabilities:

N = 16^8 and k is the number of generated suffixes, then the
probability of collision is (k^2)/(2N), so If a client somehow
caused vLLM to hold 10k requests that reuse the same client-provided
ID, then there would be a 1.16% chance of collision:

```
>>> (k**2)/(2*N)
0.011641532182693481
```

That seems (super good enough)[https://hownot2.com/products/hownot2-super-good-enough-t-shirt].

The key changes to support this are:

1. `InputProcessor.process_inputs()` - we add some randomness to the
request ID just before creating an `EngineCoreRequest`, and store both
the random "internal" request ID (as `request_id`) and the supplied
"external" request ID (as `external_req_id`) in the
`EngineCoreRequest`.
2. `RequestState.make_request_output()` - we ensure that
`RequestOutput.request_id` continues to be the external request ID
(for backwards compat) and add `internal_request_id`.
3. `OutputProcessor.abort_requests()` - we make `OutputProcessor`
track a mapping from external request ID to internal request IDs, so
`abort_requests()` can abort based on either ID.
4. `AsyncLLM` - we use `RequestOutputCollector` to track the internal
request ID, so we can use the internal ID to abort an in-progress
request. We also add an `internal` boolean flag to `abort()` so API
users can abort based on either ID.
5. `ParentRequest` - in the case of parallel sampling, we need to
track both the internal and external ID for the later creation of
`RequestOutput` aggregating the child outputs.

We need to ensure we track the external->internal request ID
mapping because abort() will be supplied an external request ID.
In the case where an external request ID maps to multiple running
requests, we assume the caller requires all of those requests
to be aborted. The caller can use EngineCoreRequest.request_id
as the request ID if they want to be more specific.

Signed-off-by: Mark McLoughlin <markmc@redhat.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

needs-rebase stale Over 90 days of inactivity v1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants