-
Notifications
You must be signed in to change notification settings - Fork 836
Add a polling fallback for gpu label #7779
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
| # Fall back to polling instead of watch API | ||
| # The watch API is unreliable when resource versions are changing | ||
| # rapidly or when there are multiple API server instances with | ||
| # different cache states | ||
| return _poll_jobs_completion(jobs_to_node_names, namespace, context, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we just retry watch on this error? Given that this only happens when the client retried another server
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried the retry watch in seconds, but it didn't work. I think if we insist on using retry, we might need a longer timeout for the watch to work with Nebius K8s.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then is there any other reason than the issue description? If the reason is multi-replica API server failover as the description analyzed, then retry list and watch should address the resource version issue since we've rebuild the connection to a new server replica.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will look into this again.
Resolve #7762
Problem: GPU Labeler Fails with "Too large resource version" Error on nebius
Root Cause
The Kubernetes Watch API fails with a 504 error when there's a resource version mismatch between the client and the k8s API server's watch cache:
Why the Watch API Fails
Watch API relies on sequential resource versions: The Kubernetes watch mechanism requires the client's requested resource version to be within the range that the API server's watch cache holds.
Multiple API server instances cause version skew: In managed Kubernetes clusters (GKE, EKS, Nebius, etc.), multiple API server instances exist for high availability. Each instance maintains its own watch cache that may be at different resource versions.
The watch stream automatically tracks resource versions: When
watch.stream()is called, the Kubernetes Python client:Resource versions increment during the watch: As the watch processes events, it automatically updates its internal
resource_versionfield from each event's metadata. When the watch fails and is retried, even creating a new Watch object doesn't help because the new LIST call returns an even newer resource version.Why Polling Works
No resource version dependency: Polling uses simple
list_namespaced_job()calls without the watch mechanism, which always returns the current state without requiring sequential resource versions.Each poll is independent: Every polling request is a fresh LIST operation that doesn't depend on previous state or resource versions.
No watch cache involved: The LIST API endpoint reads from etcd or the API server's cache directly, not from the watch cache that may be stale.
Tolerant of API server switching: Even if requests are load-balanced across different API server instances with different cache states, each LIST call returns the current state from that instance.
The Fix
This PR implements a fallback mechanism:
Test
Tested (run the relevant ones):
bash format.sh/smoke-test(CI) orpytest tests/test_smoke.py(local)/smoke-test -k test_name(CI) orpytest tests/test_smoke.py::test_name(local)/quicktest-core(CI) orpytest tests/smoke_tests/test_backward_compat.py(local)