Skip to content

Conversation

@zpoint
Copy link
Collaborator

@zpoint zpoint commented Oct 29, 2025

Resolve #7762

Problem: GPU Labeler Fails with "Too large resource version" Error on nebius

Root Cause

The Kubernetes Watch API fails with a 504 error when there's a resource version mismatch between the client and the k8s API server's watch cache:

python -m sky.utils.kubernetes.gpu_labeler

Found 5 unlabeled GPU nodes in the cluster
Using nvidia RuntimeClass for GPU labeling.
Created GPU labeler job for node computeinstance-e00c2pvvejrgxgp35g
Created GPU labeler job for node computeinstance-e00n7hs0fjmqqhk0y3
Created GPU labeler job for node computeinstance-e00sfp1hy7zkhv83r4
Created GPU labeler job for node computeinstance-e00t1t95cx9drjzr6g
Created GPU labeler job for node computeinstance-e00zpg9ntzyyy5w2qg
Traceback (most recent call last):
  File "/home/buildkite/miniconda3/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/buildkite/miniconda3/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/home/buildkite/sky_workdir/skypilot/sky/utils/kubernetes/gpu_labeler.py", line 256, in <module>
    main()
  File "/home/buildkite/sky_workdir/skypilot/sky/utils/kubernetes/gpu_labeler.py", line 252, in main
    label(context=context, wait_for_completion=not args.async_completion)
  File "/home/buildkite/sky_workdir/skypilot/sky/utils/kubernetes/gpu_labeler.py", line 147, in label
    success = wait_for_jobs_completion(jobs_to_node_names,
  File "/home/buildkite/sky_workdir/skypilot/sky/utils/kubernetes/gpu_labeler.py", line 186, in wait_for_jobs_completion
    for event in w.stream(func=batch_v1.list_namespaced_job,
  File "/home/buildkite/miniconda3/lib/python3.10/site-packages/kubernetes/watch/watch.py", line 202, in stream
    raise client.rest.ApiException(
kubernetes.client.exceptions.ApiException: (504)
Reason: Timeout: Timeout: Too large resource version: 4400, current: 4395

Why the Watch API Fails

  • Watch API relies on sequential resource versions: The Kubernetes watch mechanism requires the client's requested resource version to be within the range that the API server's watch cache holds.

  • Multiple API server instances cause version skew: In managed Kubernetes clusters (GKE, EKS, Nebius, etc.), multiple API server instances exist for high availability. Each instance maintains its own watch cache that may be at different resource versions.

  • The watch stream automatically tracks resource versions: When watch.stream() is called, the Kubernetes Python client:

    • First does a LIST operation to get the current state
    • The LIST response includes a resourceVersion from one API server instance
    • The watch then tries to stream events starting from that version
    • If the request is routed to a different API server instance whose watch cache hasn't reached that version yet, it returns a 504 error
  • Resource versions increment during the watch: As the watch processes events, it automatically updates its internal resource_version field from each event's metadata. When the watch fails and is retried, even creating a new Watch object doesn't help because the new LIST call returns an even newer resource version.

Why Polling Works

  • No resource version dependency: Polling uses simple list_namespaced_job() calls without the watch mechanism, which always returns the current state without requiring sequential resource versions.

  • Each poll is independent: Every polling request is a fresh LIST operation that doesn't depend on previous state or resource versions.

  • No watch cache involved: The LIST API endpoint reads from etcd or the API server's cache directly, not from the watch cache that may be stale.

  • Tolerant of API server switching: Even if requests are load-balanced across different API server instances with different cache states, each LIST call returns the current state from that instance.

The Fix

This PR implements a fallback mechanism:

  • Primary method: Try the Watch API first (most efficient, real-time updates)
  • Fallback method: If Watch fails with 504 "Too large resource version" error, fall back to polling
  • Polling strategy: Check job status every 5 seconds by listing all jobs in the namespace

Test

image

Tested (run the relevant ones):

  • Code formatting: install pre-commit (auto-check on commit) or bash format.sh
  • Any manual or new tests for this PR (please specify below)
  • All smoke tests: /smoke-test (CI) or pytest tests/test_smoke.py (local)
  • Relevant individual tests: /smoke-test -k test_name (CI) or pytest tests/test_smoke.py::test_name (local)
  • Backward compatibility: /quicktest-core (CI) or pytest tests/smoke_tests/test_backward_compat.py (local)

@zpoint zpoint changed the title Retry for gpu label watch Add a fallback to polling for gpu label watch Oct 29, 2025
@zpoint zpoint changed the title Add a fallback to polling for gpu label watch Add a polling fallback for gpu label Oct 29, 2025
@zpoint zpoint requested a review from aylei October 30, 2025 09:07
Comment on lines +297 to +301
# Fall back to polling instead of watch API
# The watch API is unreliable when resource versions are changing
# rapidly or when there are multiple API server instances with
# different cache states
return _poll_jobs_completion(jobs_to_node_names, namespace, context,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we just retry watch on this error? Given that this only happens when the client retried another server

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried the retry watch in seconds, but it didn't work. I think if we insist on using retry, we might need a longer timeout for the watch to work with Nebius K8s.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then is there any other reason than the issue description? If the reason is multi-replica API server failover as the description analyzed, then retry list and watch should address the resource version issue since we've rebuild the connection to a new server replica.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will look into this again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

GPU label fail on nebius L40S clusters

2 participants