Add a polling fallback for gpu label #7779

zpoint · 2025-10-29T03:17:50Z

Resolve #7762

Problem: GPU Labeler Fails with "Too large resource version" Error on nebius

Root Cause

The Kubernetes Watch API fails with a 504 error when there's a resource version mismatch between the client and the k8s API server's watch cache:

python -m sky.utils.kubernetes.gpu_labeler

Found 5 unlabeled GPU nodes in the cluster
Using nvidia RuntimeClass for GPU labeling.
Created GPU labeler job for node computeinstance-e00c2pvvejrgxgp35g
Created GPU labeler job for node computeinstance-e00n7hs0fjmqqhk0y3
Created GPU labeler job for node computeinstance-e00sfp1hy7zkhv83r4
Created GPU labeler job for node computeinstance-e00t1t95cx9drjzr6g
Created GPU labeler job for node computeinstance-e00zpg9ntzyyy5w2qg
Traceback (most recent call last):
  File "/home/buildkite/miniconda3/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/buildkite/miniconda3/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/home/buildkite/sky_workdir/skypilot/sky/utils/kubernetes/gpu_labeler.py", line 256, in <module>
    main()
  File "/home/buildkite/sky_workdir/skypilot/sky/utils/kubernetes/gpu_labeler.py", line 252, in main
    label(context=context, wait_for_completion=not args.async_completion)
  File "/home/buildkite/sky_workdir/skypilot/sky/utils/kubernetes/gpu_labeler.py", line 147, in label
    success = wait_for_jobs_completion(jobs_to_node_names,
  File "/home/buildkite/sky_workdir/skypilot/sky/utils/kubernetes/gpu_labeler.py", line 186, in wait_for_jobs_completion
    for event in w.stream(func=batch_v1.list_namespaced_job,
  File "/home/buildkite/miniconda3/lib/python3.10/site-packages/kubernetes/watch/watch.py", line 202, in stream
    raise client.rest.ApiException(
kubernetes.client.exceptions.ApiException: (504)
Reason: Timeout: Timeout: Too large resource version: 4400, current: 4395

Why the Watch API Fails

Watch API relies on sequential resource versions: The Kubernetes watch mechanism requires the client's requested resource version to be within the range that the API server's watch cache holds.
Multiple API server instances cause version skew: In managed Kubernetes clusters (GKE, EKS, Nebius, etc.), multiple API server instances exist for high availability. Each instance maintains its own watch cache that may be at different resource versions.
The watch stream automatically tracks resource versions: When watch.stream() is called, the Kubernetes Python client:
- First does a LIST operation to get the current state
- The LIST response includes a resourceVersion from one API server instance
- The watch then tries to stream events starting from that version
- If the request is routed to a different API server instance whose watch cache hasn't reached that version yet, it returns a 504 error
Resource versions increment during the watch: As the watch processes events, it automatically updates its internal resource_version field from each event's metadata. When the watch fails and is retried, even creating a new Watch object doesn't help because the new LIST call returns an even newer resource version.

Why Polling Works

No resource version dependency: Polling uses simple list_namespaced_job() calls without the watch mechanism, which always returns the current state without requiring sequential resource versions.
Each poll is independent: Every polling request is a fresh LIST operation that doesn't depend on previous state or resource versions.
No watch cache involved: The LIST API endpoint reads from etcd or the API server's cache directly, not from the watch cache that may be stale.
Tolerant of API server switching: Even if requests are load-balanced across different API server instances with different cache states, each LIST call returns the current state from that instance.

The Fix

This PR implements a fallback mechanism:

Primary method: Try the Watch API first (most efficient, real-time updates)
Fallback method: If Watch fails with 504 "Too large resource version" error, fall back to polling
Polling strategy: Check job status every 5 seconds by listing all jobs in the namespace

Test

Tested (run the relevant ones):

Code formatting: install pre-commit (auto-check on commit) or bash format.sh
Any manual or new tests for this PR (please specify below)
All smoke tests: /smoke-test (CI) or pytest tests/test_smoke.py (local)
Relevant individual tests: /smoke-test -k test_name (CI) or pytest tests/test_smoke.py::test_name (local)
Backward compatibility: /quicktest-core (CI) or pytest tests/smoke_tests/test_backward_compat.py (local)

aylei · 2025-10-31T04:43:47Z

sky/utils/kubernetes/gpu_labeler.py

+            # Fall back to polling instead of watch API
+            # The watch API is unreliable when resource versions are changing
+            # rapidly or when there are multiple API server instances with
+            # different cache states
+            return _poll_jobs_completion(jobs_to_node_names, namespace, context,


Can we just retry watch on this error? Given that this only happens when the client retried another server

I tried the retry watch in seconds, but it didn't work. I think if we insist on using retry, we might need a longer timeout for the watch to work with Nebius K8s.

Then is there any other reason than the issue description? If the reason is multi-replica API server failover as the description analyzed, then retry list and watch should address the resource version issue since we've rebuild the connection to a new server replica.

Will look into this again.

retry for gpu label

60aca2e

zpoint changed the title ~~Retry for gpu label watch~~ Add a fallback to polling for gpu label watch Oct 29, 2025

poll mechanism

5df3410

zpoint changed the title ~~Add a fallback to polling for gpu label watch~~ Add a polling fallback for gpu label Oct 29, 2025

revert

656991c

zpoint requested review from Michaelvll and SeungjinYang October 29, 2025 09:49

Merge branch 'master' into dev/zeping/retry_for_gpu_label

e5cc62a

zpoint requested a review from aylei October 30, 2025 09:07

aylei reviewed Oct 31, 2025

View reviewed changes

Merge branch 'master' into dev/zeping/retry_for_gpu_label

bbd1663

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add a polling fallback for gpu label #7779

Add a polling fallback for gpu label #7779

zpoint commented Oct 29, 2025 •

edited

Loading

Uh oh!

aylei Oct 31, 2025

Uh oh!

zpoint Oct 31, 2025

Uh oh!

aylei Nov 3, 2025

Uh oh!

zpoint Nov 7, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Add a polling fallback for gpu label #7779

Are you sure you want to change the base?

Add a polling fallback for gpu label #7779

Conversation

zpoint commented Oct 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem: GPU Labeler Fails with "Too large resource version" Error on nebius

Root Cause

Why the Watch API Fails

Why Polling Works

The Fix

Test

Uh oh!

aylei Oct 31, 2025

Choose a reason for hiding this comment

Uh oh!

zpoint Oct 31, 2025

Choose a reason for hiding this comment

Uh oh!

aylei Nov 3, 2025

Choose a reason for hiding this comment

Uh oh!

zpoint Nov 7, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

zpoint commented Oct 29, 2025 •

edited

Loading