Skip to content

Conversation

@zerofishnoodles
Copy link
Collaborator

[Bugfix] Fix routing to delete endpoint

  • Make k8s watch stream long-lasting connection.
  • Delete event should be handled immediately.

FIX #656 (link existing issues this PR will resolve)

BEFORE SUBMITTING, PLEASE READ THE CHECKLIST BELOW AND FILL IN THE DESCRIPTION ABOVE


  • Make sure the code changes pass the pre-commit checks.
  • Sign-off your commit by using -s when doing git commit
  • Try to classify PRs for easy understanding of the type of changes, such as [Bugfix], [Feat], and [CI].
Detailed Checklist (Click to Expand)

Thank you for your contribution to production-stack! Before submitting the pull request, please ensure the PR meets the following criteria. This helps us maintain the code quality and improve the efficiency of the review process.

PR Title and Classification

Please try to classify PRs for easy understanding of the type of changes. The PR title is prefixed appropriately to indicate the type of change. Please use one of the following:

  • [Bugfix] for bug fixes.
  • [CI/Build] for build or continuous integration improvements.
  • [Doc] for documentation fixes and improvements.
  • [Feat] for new features in the cluster (e.g., autoscaling, disaggregated prefill, etc.).
  • [Router] for changes to the vllm_router (e.g., routing algorithm, router observability, etc.).
  • [Misc] for PRs that do not fit the above categories. Please use this sparingly.

Note: If the PR spans more than one category, please include all relevant prefixes.

Code Quality

The PR need to meet the following code quality standards:

  • Pass all linter checks. Please use pre-commit to format your code. See README.md for installation.
  • The code need to be well-documented to ensure future contributors can easily understand the code.
  • Please include sufficient tests to ensure the change is stay correct and robust. This includes both unit tests and integration tests.

DCO and Signed-off-by

When contributing changes to this project, you must agree to the DCO. Commits must include a Signed-off-by: header which certifies agreement with the terms of the DCO.

Using -s with git commit will automatically add this header.

What to Expect for the Reviews

We aim to address all PRs in a timely manner. If no one reviews your PR within 5 days, please @-mention one of YuhanLiu11
, Shaoting-Feng or ApostaC.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary of Changes

Hello @zerofishnoodles, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request addresses a bug in the Kubernetes service discovery mechanism, specifically related to how the vllm_router handles pod lifecycle events. The primary goal is to ensure that Kubernetes watch streams maintain long-lasting connections and that pod deletion events are processed immediately, preventing stale entries in the list of available engines.

Highlights

  • Kubernetes Watcher Timeout Adjustment: The default timeout for Kubernetes watcher streams has been changed from 30 seconds to 0, effectively making the watch connections long-lasting.
  • Removed Timeout Validation: The validation that required "k8s-watcher-timeout-seconds" to be greater than 0 has been removed, allowing for a 0-second timeout.
  • Immediate Deletion Event Handling: Logic has been added to the Kubernetes engine watcher to immediately process "DELETED" events for pods, ensuring that terminated engines are promptly removed from the active pool.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request addresses a bug in the Kubernetes service discovery by making the watch stream a long-lasting connection and ensuring DELETED events are handled immediately. The changes in K8sPodIPServiceDiscovery to handle DELETED events promptly and the adjustment of the default k8s-watcher-timeout-seconds to 0 are correct and effectively resolve the issue.

While the changes are good, I have one suggestion for improvement in the argument validation. Additionally, for full consistency across the Kubernetes service discovery mechanisms, you might consider two points for a follow-up:

  1. The default watcher_timeout_seconds in K8sServiceNameServiceDiscovery is still 30. It could be changed to 0 to match the behavior of K8sPodIPServiceDiscovery.
  2. A similar bug with handling DELETED events may exist in K8sServiceNameServiceDiscovery, where an exception could be raised before the service is removed from the list of available engines. Applying a similar fix there would improve robustness.

Overall, this is a good fix that improves the reliability of the router.

@jonoillar
Copy link

jonoillar commented Aug 27, 2025

I'll test tmr further if your fix works for my specific use case.

While this might resolve it, there is one use case that becomes problematic with your quick fix:
Situation: 2 model pods (pod_1 and pod_2)

  • pod_1 gets an event ADDED associated to it -> it triggers a blocking /v1/models endpoint. The consumption of the event loop is blocked, i.e. the watcher doesn't watch
  • While the watcher doesn't watch, pod_2 is deleted -> it's associated entry in self.available_engine won't be updated until the watcher resumes watching

For the entire time the watcher is blocked, the request directed to pod_2 will fail.

This is why I think we need to decouple the watch loop and handling the events, which is the purpose of this MR: #666. And handling the events in an asynchronous manner

Furthermore, I think that this MR still allows an unexperienced user to set k8s-watcher-timeout-seconds>0 and experience the issue I've been experiencing initially in #656

@zerofishnoodles
Copy link
Collaborator Author

zerofishnoodles commented Aug 27, 2025

I'll test tmr further if your fix works for my specific use case.

While this might resolve it, there is one use case that becomes problematic with your quick fix: Situation: 2 model pods (pod_1 and pod_2)

  • pod_1 gets an event ADDED associated to it -> it triggers a blocking /v1/models endpoint. The consumption of the event loop is blocked, i.e. the watcher doesn't watch
  • While the watcher doesn't watch, pod_2 is deleted -> it's associated entry in self.available_engine won't be updated until the watcher resumes watching

For the entire time the watcher is blocked, the request directed to pod_2 will fail.

This is why I think we need to decouple the watch loop and handling the events, which is the purpose of this MR: #666. And handling the events in an asynchronous manner

Furthermore, I think that this MR still allows an unexperienced user to set k8s-watcher-timeout-seconds>0 and experience the issue I've been experiencing initially in #656

So from another perspective, can we just handle v1/models so that when it is way longer than expected, we just abandon it and continue the loop? because the v1/models is supposed to be fast and "non-blocking".

And I do believe that using a consumer process will make the design more efficient, but I think we may need more test for the big change.

@jonoillar
Copy link

So from another perspective, can we just handle v1/models so that when it is way longer than expected, we just abandon it and continue the loop? because the v1/models is supposed to be fast and "non-blocking".

We could, but I don't think we should assume anything about how fast the v1/models answers. I think we should just be prepared to any case (backend being fast and backend being slow). From my usage, I've been facing backends that displayed this exact behaviour, due sometimes to connectivity slowdowns or other.

In the end, I don't think we should assume anything about the healthiness backend: it can be fast or slow, some endpoint can be fast or slow, there could be connectivity errors. But the vllm router should be agnostic to this and be prepared to handle different backends

@zerofishnoodles
Copy link
Collaborator Author

So from another perspective, can we just handle v1/models so that when it is way longer than expected, we just abandon it and continue the loop? because the v1/models is supposed to be fast and "non-blocking".

We could, but I don't think we should assume anything about how fast the v1/models answers. I think we should just be prepared to any case (backend being fast and backend being slow). From my usage, I've been facing backends that displayed this exact behaviour, due sometimes to connectivity slowdowns or other.

In the end, I don't think we should assume anything about the healthiness backend: it can be fast or slow, some endpoint can be fast or slow, there could be connectivity errors. But the vllm router should be agnostic to this and be prepared to handle different backends

I understand it and I think this statement is legit, but I am still a little confused about the event consumer process, if the endpoint is slow, it will block the event consumer process and not delete the endpoint in time in the same manner as you said before.

@jonoillar
Copy link

if the endpoint is slow, it will block the event consumer process and not delete the endpoint in time in the same manner as you said before.

True, it will block the event consumer thread if the call to the endpoint is done within this thread. Hence making this call in another thread (by decoupling event consumer and event handler) asynchronously would do:

  • call the blocking endpoint for pod_1
  • event loop processes another event in the meantime -> successfully deleted pod_2
  • event loop goes back checking if the blocking endpoint for pod_1 has finished

I have to say I'm a bit confused about your statement though, could you explain your point a bit more in detail? Thanks !! 🙏

By the way, the kubernetes watcher is synchronous by default. There is an asynchronous watcher available in https://github.com/tomplus/kubernetes_asyncio , but I didn't want to use it since I don't think it receives that much support as the official kubernetes python api 😄

@zerofishnoodles
Copy link
Collaborator Author

if the endpoint is slow, it will block the event consumer process and not delete the endpoint in time in the same manner as you said before.

True, it will block the event consumer thread if the call to the endpoint is done within this thread. Hence making this call in another thread (by decoupling event consumer and event handler) asynchronously would do:

  • call the blocking endpoint for pod_1
  • event loop processes another event in the meantime -> successfully deleted pod_2
  • event loop goes back checking if the blocking endpoint for pod_1 has finished

I have to say I'm a bit confused about your statement though, could you explain your point a bit more in detail? Thanks !! 🙏

By the way, the kubernetes watcher is synchronous by default. There is an asynchronous watcher available in https://github.com/tomplus/kubernetes_asyncio , but I didn't want to use it since I don't think it receives that much support as the official kubernetes python api 😄

I see what you mean. Could you do some test and stress test to make sure the consistency for the shared resource like available_engines?

Yeah I don't think the asyncio k8s will be a good choice at this moment as well lolll.

@jonoillar
Copy link

I see what you mean. Could you do some test and stress test to make sure the consistency for the shared resource like available_engines?

For my PR #666 ? Yep I can do it, do you have specific test and stress test scenarios in mind ?

@zerofishnoodles
Copy link
Collaborator Author

zerofishnoodles commented Aug 27, 2025

I see what you mean. Could you do some test and stress test to make sure the consistency for the shared resource like available_engines?

For my PR #666 ? Yep I can do it, do you have specific test and stress test scenarios in mind ?

Yep, for PR #666 .

There is a simple stress test script under the tests/e2e, but it is request level not cluster operations
Other than that, we might need to imitate some scenarios including multiple operations like add pod/delete pods in different order.

@YuhanLiu11
Copy link
Collaborator

Hey @jonoillar, thanks for submitting PR #666 to address the zombie IP issue in the production stack. As discussed in this PR, I think this change can be a temporary fix for #656 with minimal changes on the current codebase. For this reason, I’m merging this PR first as a quick fix for issue #656.

That said, I’ll be happy to review PR #666 once you have some testing numbers and have verified that it doesn’t break any functionalities in the router.

@YuhanLiu11 YuhanLiu11 merged commit 8367d43 into vllm-project:main Aug 27, 2025
14 checks passed
@jonoillar
Copy link

@YuhanLiu11 @zerofishnoodles I tested your MR and it does fix the issue I was facing.

Now it also created 2 problematic scenarios:

Scenario 1: unhealthy pod removed forever

Set up:

  • 1 router pod
  • 1 model pod
  • Add logic within router so that the /v1/models endpoint fails at the 2nd iteration

What happens:

  1. After a round of ADDED - MODIFIED MODIFIED MODIFIED ... , the pod finally gets added in the service discovery -> the /v1/models has been called once
  2. Here do a "dummy" modif to the pod -> for example, add a label to the pod
    • a new MODIFIED event related to the pod will be triggered. This event will trigger a faulty /v1/models endpoint
    • The endpoint fails -> the pod gets deleted

And the pod will be never again added to the available engine.

REAL LIFE SCENARIO: with an instable backend that has this characteristics:

  • /v1/models fails a few times (for any reason - connectivity, pod freeze, etc)
  • the pod never changes so that there is never a MODIFIED event triggered related to the pod

With this setup, then the first time the pod is deleted, it will never be discovered again

DAMAGE: service discovery won't discover a pod. The pod will use up GPU's for nothing, and other pod will suffer higher load

Scenario 2: Pod deleted, some request still routed to it

Set up:

  • 1 router pod
  • 1 old pod
  • 9 new pods with blocking v1/models call

What happens:
events_queue = []

  • The first 9 pods are created and ready at the same time. There are 9 "ADDED" event that will lead to a call to v1/models lined up ->events_queue = ["ADDED", "ADDED", ..., "ADDED"]
  • The old pod got deleted: one event DELETED is added to the list of events to process: events_queue = ["ADDED", "ADDED", ..., "ADDED", "DELETED"]

Each of the "ADDED" event takes self.health_check_timeout_seconds=10s to be processed. So in total 9*self.health_check_timeout_seconds=90s until the "DELETED" event is processed

During this time, some request will be routed to the old pod -> they will all fail

DAMAGE: Within a window of time, request will be routed to deleted pod. And this window of time can be proportional to the number of pods to be served

POSSIBLE SOLUTION: the events related to the pods are tackled successively, whereas they could be tackled concurrently. The events related to pod_1 should not wait for the event related to pod_2 to be processed, before being processed.

Conclusion

Scenario 1 is quite likely to happen and is harmful, as one model pod will be silently ignored.

Scenario 2 is more likely to happen, with an impact (a window within which request are routed to deleted pod) that is varying. I'd say for most of the time this won't be critical, but if this code is deployed in a prod environment were very high SLA (Availability) are required, this can matter a lot

Please tell me what are your thoughts !! If I'm not clear, please ask 😄

ikaadil pushed a commit to ikaadil/production-stack that referenced this pull request Sep 3, 2025
* [CI] remove docker image before building router image

Signed-off-by: Rui Zhang <[email protected]>

* [Bugfix] Fix routing to deleted endpoint

Signed-off-by: Rui Zhang <[email protected]>

* Revert "[CI] remove docker image before building router image"

This reverts commit 8319e01.

Signed-off-by: Rui Zhang <[email protected]>

* make health check timeout configurable

Signed-off-by: Rui Zhang <[email protected]>

---------

Signed-off-by: Rui Zhang <[email protected]>
Signed-off-by: Ifta Khairul Alam Adil <[email protected]>
Senne-Mennes pushed a commit to Senne-Mennes/production-stack that referenced this pull request Oct 22, 2025
* [CI] remove docker image before building router image

Signed-off-by: Rui Zhang <[email protected]>

* [Bugfix] Fix routing to deleted endpoint

Signed-off-by: Rui Zhang <[email protected]>

* Revert "[CI] remove docker image before building router image"

This reverts commit 8319e01.

Signed-off-by: Rui Zhang <[email protected]>

* make health check timeout configurable

Signed-off-by: Rui Zhang <[email protected]>

---------

Signed-off-by: Rui Zhang <[email protected]>
Signed-off-by: [email protected] <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug][Router] ZombieIP: router routes to model pod that doesn't exist anymore

3 participants