-
Notifications
You must be signed in to change notification settings - Fork 318
[Bugfix] Fix routing to delete endpoint #668
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bugfix] Fix routing to delete endpoint #668
Conversation
Signed-off-by: Rui Zhang <[email protected]>
Signed-off-by: Rui Zhang <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Summary of Changes
Hello @zerofishnoodles, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
This pull request addresses a bug in the Kubernetes service discovery mechanism, specifically related to how the vllm_router handles pod lifecycle events. The primary goal is to ensure that Kubernetes watch streams maintain long-lasting connections and that pod deletion events are processed immediately, preventing stale entries in the list of available engines.
Highlights
- Kubernetes Watcher Timeout Adjustment: The default timeout for Kubernetes watcher streams has been changed from 30 seconds to 0, effectively making the watch connections long-lasting.
- Removed Timeout Validation: The validation that required "k8s-watcher-timeout-seconds" to be greater than 0 has been removed, allowing for a 0-second timeout.
- Immediate Deletion Event Handling: Logic has been added to the Kubernetes engine watcher to immediately process "DELETED" events for pods, ensuring that terminated engines are promptly removed from the active pool.
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.
| Feature | Command | Description |
|---|---|---|
| Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
| Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
| Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
| Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request addresses a bug in the Kubernetes service discovery by making the watch stream a long-lasting connection and ensuring DELETED events are handled immediately. The changes in K8sPodIPServiceDiscovery to handle DELETED events promptly and the adjustment of the default k8s-watcher-timeout-seconds to 0 are correct and effectively resolve the issue.
While the changes are good, I have one suggestion for improvement in the argument validation. Additionally, for full consistency across the Kubernetes service discovery mechanisms, you might consider two points for a follow-up:
- The default
watcher_timeout_secondsinK8sServiceNameServiceDiscoveryis still30. It could be changed to0to match the behavior ofK8sPodIPServiceDiscovery. - A similar bug with handling
DELETEDevents may exist inK8sServiceNameServiceDiscovery, where an exception could be raised before the service is removed from the list of available engines. Applying a similar fix there would improve robustness.
Overall, this is a good fix that improves the reliability of the router.
This reverts commit 8319e01. Signed-off-by: Rui Zhang <[email protected]>
cc43214 to
aeb57e5
Compare
|
I'll test tmr further if your fix works for my specific use case. While this might resolve it, there is one use case that becomes problematic with your quick fix:
For the entire time the watcher is blocked, the request directed to This is why I think we need to decouple the watch loop and handling the events, which is the purpose of this MR: #666. And handling the events in an asynchronous manner Furthermore, I think that this MR still allows an unexperienced user to set |
So from another perspective, can we just handle v1/models so that when it is way longer than expected, we just abandon it and continue the loop? because the v1/models is supposed to be fast and "non-blocking". And I do believe that using a consumer process will make the design more efficient, but I think we may need more test for the big change. |
We could, but I don't think we should assume anything about how fast the In the end, I don't think we should assume anything about the healthiness backend: it can be fast or slow, some endpoint can be fast or slow, there could be connectivity errors. But the vllm router should be agnostic to this and be prepared to handle different backends |
I understand it and I think this statement is legit, but I am still a little confused about the event consumer process, if the endpoint is slow, it will block the event consumer process and not delete the endpoint in time in the same manner as you said before. |
True, it will block the event consumer thread if the call to the endpoint is done within this thread. Hence making this call in another thread (by decoupling event consumer and event handler) asynchronously would do:
I have to say I'm a bit confused about your statement though, could you explain your point a bit more in detail? Thanks !! 🙏 By the way, the kubernetes watcher is synchronous by default. There is an asynchronous watcher available in https://github.com/tomplus/kubernetes_asyncio , but I didn't want to use it since I don't think it receives that much support as the official kubernetes python api 😄 |
I see what you mean. Could you do some test and stress test to make sure the consistency for the shared resource like available_engines? Yeah I don't think the asyncio k8s will be a good choice at this moment as well lolll. |
For my PR #666 ? Yep I can do it, do you have specific test and stress test scenarios in mind ? |
Yep, for PR #666 . There is a simple stress test script under the tests/e2e, but it is request level not cluster operations |
Signed-off-by: Rui Zhang <[email protected]>
|
Hey @jonoillar, thanks for submitting PR #666 to address the zombie IP issue in the production stack. As discussed in this PR, I think this change can be a temporary fix for #656 with minimal changes on the current codebase. For this reason, I’m merging this PR first as a quick fix for issue #656. That said, I’ll be happy to review PR #666 once you have some testing numbers and have verified that it doesn’t break any functionalities in the router. |
|
@YuhanLiu11 @zerofishnoodles I tested your MR and it does fix the issue I was facing. Now it also created 2 problematic scenarios: Scenario 1: unhealthy pod removed foreverSet up:
What happens:
And the pod will be never again added to the available engine. REAL LIFE SCENARIO: with an instable backend that has this characteristics:
With this setup, then the first time the pod is deleted, it will never be discovered again DAMAGE: service discovery won't discover a pod. The pod will use up GPU's for nothing, and other pod will suffer higher load Scenario 2: Pod deleted, some request still routed to itSet up:
What happens:
Each of the "ADDED" event takes During this time, some request will be routed to the old pod -> they will all fail DAMAGE: Within a window of time, request will be routed to deleted pod. And this window of time can be proportional to the number of pods to be served POSSIBLE SOLUTION: the events related to the pods are tackled successively, whereas they could be tackled concurrently. The events related to ConclusionScenario 1 is quite likely to happen and is harmful, as one model pod will be silently ignored. Scenario 2 is more likely to happen, with an impact (a window within which request are routed to deleted pod) that is varying. I'd say for most of the time this won't be critical, but if this code is deployed in a prod environment were very high SLA (Availability) are required, this can matter a lot Please tell me what are your thoughts !! If I'm not clear, please ask 😄 |
* [CI] remove docker image before building router image Signed-off-by: Rui Zhang <[email protected]> * [Bugfix] Fix routing to deleted endpoint Signed-off-by: Rui Zhang <[email protected]> * Revert "[CI] remove docker image before building router image" This reverts commit 8319e01. Signed-off-by: Rui Zhang <[email protected]> * make health check timeout configurable Signed-off-by: Rui Zhang <[email protected]> --------- Signed-off-by: Rui Zhang <[email protected]> Signed-off-by: Ifta Khairul Alam Adil <[email protected]>
* [CI] remove docker image before building router image Signed-off-by: Rui Zhang <[email protected]> * [Bugfix] Fix routing to deleted endpoint Signed-off-by: Rui Zhang <[email protected]> * Revert "[CI] remove docker image before building router image" This reverts commit 8319e01. Signed-off-by: Rui Zhang <[email protected]> * make health check timeout configurable Signed-off-by: Rui Zhang <[email protected]> --------- Signed-off-by: Rui Zhang <[email protected]> Signed-off-by: [email protected] <[email protected]>
[Bugfix] Fix routing to delete endpoint
FIX #656 (link existing issues this PR will resolve)
BEFORE SUBMITTING, PLEASE READ THE CHECKLIST BELOW AND FILL IN THE DESCRIPTION ABOVE
-swhen doinggit commit[Bugfix],[Feat], and[CI].Detailed Checklist (Click to Expand)
Thank you for your contribution to production-stack! Before submitting the pull request, please ensure the PR meets the following criteria. This helps us maintain the code quality and improve the efficiency of the review process.
PR Title and Classification
Please try to classify PRs for easy understanding of the type of changes. The PR title is prefixed appropriately to indicate the type of change. Please use one of the following:
[Bugfix]for bug fixes.[CI/Build]for build or continuous integration improvements.[Doc]for documentation fixes and improvements.[Feat]for new features in the cluster (e.g., autoscaling, disaggregated prefill, etc.).[Router]for changes to thevllm_router(e.g., routing algorithm, router observability, etc.).[Misc]for PRs that do not fit the above categories. Please use this sparingly.Note: If the PR spans more than one category, please include all relevant prefixes.
Code Quality
The PR need to meet the following code quality standards:
pre-committo format your code. SeeREADME.mdfor installation.DCO and Signed-off-by
When contributing changes to this project, you must agree to the DCO. Commits must include a
Signed-off-by:header which certifies agreement with the terms of the DCO.Using
-swithgit commitwill automatically add this header.What to Expect for the Reviews
We aim to address all PRs in a timely manner. If no one reviews your PR within 5 days, please @-mention one of YuhanLiu11
, Shaoting-Feng or ApostaC.