[FIX] raise error if job does not terminate in tail_job_logs() #57037

machichima · 2025-09-30T12:59:00Z

Why are these changes needed?

During the execution of tail_job_logs() after the job submission, if the ray head connection breaks, the tail_job_logs() will not raise any error. The error should be raised.

Query the rayjob status when receiving the message, and raise error if connection closed with rayjob not in terminate stage.

Related issue number

Closes: #57002

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Note

Enhances tail_job_logs to track job status and raise errors if the WebSocket closes or errors before the job reaches a terminal state.

Job SDK (python/ray/dashboard/modules/job/sdk.py):
- Track job status on each log message via get_job_info and store job_status.
- On WebSocket CLOSED or ERROR, raise RuntimeError if job_status is not terminal; otherwise exit cleanly.
- Update Raises docstring to include unexpected connection closure before terminal state.

^{Written by Cursor Bugbot for commit c011722. This will update automatically on new commits. Configure here.}

Signed-off-by: machichima <[email protected]>

gemini-code-assist

Code Review

This pull request fixes an issue where tail_job_logs() would not raise an error if the connection to the Ray head node was lost while a job was still running. The change introduces a check for the job's status when the log-tailing websocket is closed.

My review focuses on improving the performance and correctness of this new logic. The current implementation introduces a blocking, synchronous call inside an async function, and it inefficiently queries the job status on every log message. I've provided a suggestion to make the call non-blocking and to only query the status when necessary, which improves both performance and robustness.

Signed-off-by: machichima <[email protected]>

Future-Outlier · 2025-10-01T06:11:36Z

python/ray/dashboard/modules/job/sdk.py

+                # Query job status after receiving each message to track state
+                try:
+                    job_info = self.get_job_info(job_id)
+                    job_status = job_info.status
+                except Exception as e:
+                    raise RuntimeError(f"Failed to get job status for {job_id}.") from e
+


Should we query job info and job status outside of the loop?
in this case we only have to query 1 time.

We need to get the up-to-date status before the connection closed, that's why we need to do it in the while loop.
The job info query will be executed each time we got new message, which is not that frequent

is it possible to check the msg to detect loss of connection?

I tried checking msg and ws, but cannot really identify the difference between normal close and abnormal one

This is the output for normal close:

❯ python reproduce_tail_logs_issue.py Job submitted with ID: raysubmit_SVbekdrqykf3mz7c Starting to tail job logs... 2025-10-02 20:49:50,188 INFO sdk.py:502 -- [DEBUG] msg attributes: type=1, data=2025-10-02 20:49:47,066 INFO job_manager.py:568 -- Runtime env is setting up. Job started 1 2 , extra= 2025-10-02 20:49:50,188 INFO sdk.py:505 -- [DEBUG] ws attributes: closed=False, close_code=None, protocol=None LOG: 2025-10-02 20:49:47,066 INFO job_manager.py:568 -- Runtime env is setting up. Job started 1 2 job status: RUNNING 2025-10-02 20:49:51,190 INFO sdk.py:502 -- [DEBUG] msg attributes: type=1, data=3 , extra= 2025-10-02 20:49:51,191 INFO sdk.py:505 -- [DEBUG] ws attributes: closed=False, close_code=None, protocol=None LOG: 3 job status: RUNNING 2025-10-02 20:49:52,190 INFO sdk.py:502 -- [DEBUG] msg attributes: type=1, data=4 , extra= 2025-10-02 20:49:52,190 INFO sdk.py:505 -- [DEBUG] ws attributes: closed=False, close_code=None, protocol=None LOG: 4 job status: RUNNING 2025-10-02 20:49:53,192 INFO sdk.py:502 -- [DEBUG] msg attributes: type=1, data=5 , extra= 2025-10-02 20:49:53,192 INFO sdk.py:505 -- [DEBUG] ws attributes: closed=False, close_code=None, protocol=None LOG: 5 job status: SUCCEEDED 2025-10-02 20:49:54,193 INFO sdk.py:502 -- [DEBUG] msg attributes: type=1, data=Job completed , extra= 2025-10-02 20:49:54,193 INFO sdk.py:505 -- [DEBUG] ws attributes: closed=False, close_code=None, protocol=None LOG: Job completed job status: SUCCEEDED 2025-10-02 20:49:57,197 INFO sdk.py:502 -- [DEBUG] msg attributes: type=8, data=1000, extra= 2025-10-02 20:49:57,197 INFO sdk.py:505 -- [DEBUG] ws attributes: closed=True, close_code=1000, protocol=None 2025-10-02 20:49:57,197 INFO sdk.py:502 -- [DEBUG] msg attributes: type=257, data=None, extra=None 2025-10-02 20:49:57,197 INFO sdk.py:505 -- [DEBUG] ws attributes: closed=True, close_code=1000, protocol=None tail_job_logs() returned normally (no exception)

This is terminate ray head before finished

❯ python reproduce_tail_logs_issue.py Job submitted with ID: raysubmit_DFgSjCpApJcswQFT Starting to tail job logs... 2025-10-02 20:49:28,838 INFO sdk.py:502 -- [DEBUG] msg attributes: type=1, data=2025-10-02 20:49:25,729 INFO job_manager.py:568 -- Runtime env is setting up. Job started 1 2 , extra= 2025-10-02 20:49:28,838 INFO sdk.py:505 -- [DEBUG] ws attributes: closed=False, close_code=None, protocol=None LOG: 2025-10-02 20:49:25,729 INFO job_manager.py:568 -- Runtime env is setting up. Job started 1 2 job status: RUNNING 2025-10-02 20:49:29,847 INFO sdk.py:502 -- [DEBUG] msg attributes: type=1, data=3 , extra= 2025-10-02 20:49:29,847 INFO sdk.py:505 -- [DEBUG] ws attributes: closed=False, close_code=None, protocol=None LOG: 3 job status: RUNNING 2025-10-02 20:49:30,076 INFO sdk.py:502 -- [DEBUG] msg attributes: type=8, data=1000, extra= 2025-10-02 20:49:30,076 INFO sdk.py:505 -- [DEBUG] ws attributes: closed=True, close_code=1000, protocol=None 2025-10-02 20:49:30,076 INFO sdk.py:502 -- [DEBUG] msg attributes: type=257, data=None, extra=None 2025-10-02 20:49:30,076 INFO sdk.py:505 -- [DEBUG] ws attributes: closed=True, close_code=1000, protocol=None tail_job_logs() returned normally (no exception)

rueian · 2025-10-01T12:08:58Z

python/ray/dashboard/modules/job/sdk.py

            )

            while True:
                msg = await ws.receive()


Won’t this ws.receive raise exception when the connection is broken? Isn’t that enough? I actually think there is no need to query job status.

It will not raise error when connection lost actually. I think in older version it will (what we used in python/ray/dashboard/modules/job/tests/backwards_compatibility_scripts/test_backwards_compatibility.sh), but in newer version it will not and just close

Even the closed_code does not have difference compare to normal close

Signed-off-by: machichima <[email protected]>

Future-Outlier · 2025-10-06T02:49:20Z

python/ray/dashboard/modules/job/sdk.py

+                    print(f"Close code: {ws.close_code}")
+                    if ws.close_code == aiohttp.WSCloseCode.ABNORMAL_CLOSURE:
+                        raise RuntimeError(
+                            f"WebSocket connection closed unexpectedly while job with close code {ws.close_code}"
+                        )
                    break
                elif msg.type == aiohttp.WSMsgType.ERROR:
-                    pass
+                    # Old Ray versions may send ERROR on connection close
+                    raise RuntimeError(
+                        f"WebSocket error while tailing logs for job {job_id}. Err: {ws.exception()}"
+                    )
+                    break


I LOVE THIS!

jjyao · 2025-10-06T16:53:18Z

python/ray/dashboard/modules/job/sdk.py

                if msg.type == aiohttp.WSMsgType.TEXT:
                    yield msg.data
                elif msg.type == aiohttp.WSMsgType.CLOSED:
+                    print(f"Close code: {ws.close_code}")


del or use logger

jjyao · 2025-10-06T16:54:59Z

python/ray/dashboard/modules/job/sdk.py

+                    # Old Ray versions may send ERROR on connection close
+                    raise RuntimeError(
+                        f"WebSocket error while tailing logs for job {job_id}. Err: {ws.exception()}"
+                    )


why do we need to handle old Ray versions here. I think we only support job client and job server with the same version?

The client only uses HTTP/websocket protocol so the compatibility requirements are looser than that. We don't give an exact guarantee though.

Signed-off-by: machichima <[email protected]>

jjyao · 2025-10-08T16:29:25Z

python/ray/dashboard/modules/job/sdk.py

        Raises:
-            RuntimeError: If the job does not exist or if the request to the
-                job server fails.
+            RuntimeError: If the job does not exist, if the request to the


Is it easy to write a test for it?

Let me have a try!

Added in 899c05d

Signed-off-by: machichima <[email protected]>

…og-error-handle Signed-off-by: machichima <[email protected]>

Signed-off-by: machichima <[email protected]>

edoakes · 2025-10-14T13:16:27Z

python/ray/dashboard/modules/job/tests/test_http_job_server.py

+    finally:
+        # Ensure Ray is stopped even if test fails
+        subprocess.check_output(["ray", "stop", "--force"])


please use existing fixtures for this, I believe the equivalent one to what you've written is ray_start_regular

ex:

ray/python/ray/dashboard/tests/test_dashboard.py

Line 1110 in f6f14aa

def test_dashboard_requests_fail_on_missing_deps(ray_start_regular):

you might need to import it on this line:

ray/python/ray/dashboard/modules/job/tests/test_http_job_server.py

Line 37 in f6f14aa

from ray.tests.conftest import _ray_start

Updated in 5ed2be4
Thanks!

edoakes · 2025-10-14T13:17:03Z

python/ray/dashboard/modules/job/sdk.py

                    break
                elif msg.type == aiohttp.WSMsgType.ERROR:
-                    pass
+                    # Old Ray versions may send ERROR on connection close


when did this behavior change?

I encounter the error in test python/ray/dashboard/modules/job/tests/test_backwards_compatibility.py::TestBackwardsCompatibility::test_cli. This gets into here. The test is running test_backwards_compatibility.sh which is using ray version 2.0.1

Signed-off-by: machichima <[email protected]>

edoakes

Looks good, a few minor nits. Kicked off premerge CI tests here: https://buildkite.com/ray-project/premerge/builds/51666

edoakes · 2025-10-15T14:18:35Z

python/ray/dashboard/modules/job/tests/test_http_job_server.py

+
+            # Kill the dashboard after receiving a few log lines
+            if i == 3:
+                from ray._private import ray_constants


no need for lazy import here, move it to the top of the file

edoakes · 2025-10-15T14:19:30Z

python/ray/dashboard/modules/job/tests/test_http_job_server.py

 from ray.runtime_env.runtime_env import RuntimeEnv, RuntimeEnvConfig
 from ray.tests.conftest import _ray_start

+import psutil


this should go before pytest (import ordering is builtins, thirdparty, then ray internal)

I think we have a linter that usually enforces this 🤔

By running pre-commit run --all, the psutil will be put in the end of imports

Signed-off-by: machichima <[email protected]>

machichima · 2025-10-16T12:27:35Z

Moved the test to file test_sdk.py to prevent ray cluster collision in 905d742

When I use ray_start_regular , will get following error as ray cluster already exists.

RuntimeError: Maybe you called ray.init twice by accident? This error can be suppressed by passing in 'ignore_reinit_error=True' or by calling 'ray.shutdown()' prior to 'ray.init()'.

…roject#57037) During the execution of tail_job_logs() after the job submission, if the ray head connection breaks, the tail_job_logs() will not raise any error. The error should be raised. Query the rayjob status when receiving the message, and raise error if connection closed with rayjob not in terminate stage. ## Related issue number Closes: ray-project#57002 --------- Signed-off-by: machichima <[email protected]> Signed-off-by: xgui <[email protected]>

During the execution of tail_job_logs() after the job submission, if the ray head connection breaks, the tail_job_logs() will not raise any error. The error should be raised. Query the rayjob status when receiving the message, and raise error if connection closed with rayjob not in terminate stage. ## Related issue number Closes: #57002 --------- Signed-off-by: machichima <[email protected]> Signed-off-by: elliot-barn <[email protected]>

…roject#57037) During the execution of tail_job_logs() after the job submission, if the ray head connection breaks, the tail_job_logs() will not raise any error. The error should be raised. Query the rayjob status when receiving the message, and raise error if connection closed with rayjob not in terminate stage. ## Related issue number Closes: ray-project#57002 --------- Signed-off-by: machichima <[email protected]>

…roject#57037) During the execution of tail_job_logs() after the job submission, if the ray head connection breaks, the tail_job_logs() will not raise any error. The error should be raised. Query the rayjob status when receiving the message, and raise error if connection closed with rayjob not in terminate stage. ## Related issue number Closes: ray-project#57002 --------- Signed-off-by: machichima <[email protected]> Signed-off-by: Aydin Abiar <[email protected]>

fix: raise err if job not terminate or has err

efa79d6

Signed-off-by: machichima <[email protected]>

machichima requested a review from a team as a code owner September 30, 2025 12:59

gemini-code-assist bot reviewed Sep 30, 2025

View reviewed changes

refactor: fmt

aea0717

Signed-off-by: machichima <[email protected]>

machichima force-pushed the 57002-tail-log-error-handle branch from 31cde67 to aea0717 Compare September 30, 2025 13:01

ray-gardener bot added core Issues that should be addressed in Ray Core community-contribution Contributed by the community labels Sep 30, 2025

Future-Outlier reviewed Oct 1, 2025

View reviewed changes

rueian reviewed Oct 1, 2025

View reviewed changes

fix: not raise err if job terminated

c011722

Signed-off-by: machichima <[email protected]>

This comment was marked as outdated.

Sign in to view

MengjinYan assigned Future-Outlier Oct 1, 2025

fix: get abnormal close from ws

d4b73e6

Signed-off-by: machichima <[email protected]>

Future-Outlier reviewed Oct 6, 2025

View reviewed changes

jjyao reviewed Oct 6, 2025

View reviewed changes

refactor: remove print and use logger

83ad810

Signed-off-by: machichima <[email protected]>

This comment was marked as outdated.

Sign in to view

jjyao reviewed Oct 8, 2025

View reviewed changes

machichima and others added 7 commits October 9, 2025 20:32

fix: log instead of raise error when ws Error

21e1ecc

Signed-off-by: machichima <[email protected]>

Merge branch 'master' of github.com:ray-project/ray into 57002-tail-l…

0dcb9aa

…og-error-handle Signed-off-by: machichima <[email protected]>

refactor: format

f1bfdd5

Signed-off-by: machichima <[email protected]>

Merge branch 'master' into 57002-tail-log-error-handle

e62dac9

Merge branch 'master' into 57002-tail-log-error-handle

fc81dee

refactor: fix error message

d1c0c29

Signed-off-by: machichima <[email protected]>

test: add test for abnormal close for tail job log

899c05d

Signed-off-by: machichima <[email protected]>

This comment was marked as outdated.

Sign in to view

machichima and others added 4 commits October 13, 2025 20:57

fix: remove unused tmp dir

a21145f

Signed-off-by: machichima <[email protected]>

fix: dashboard-port to port

a2e76fd

Signed-off-by: machichima <[email protected]>

refactor: lint

2df9d50

Signed-off-by: machichima <[email protected]>

Merge branch 'master' into 57002-tail-log-error-handle

b51feb2

machichima added 2 commits October 14, 2025 20:05

test: update test to use existing ray cluster

8035a6b

Signed-off-by: machichima <[email protected]>

refactor: lint

681a536

Signed-off-by: machichima <[email protected]>

edoakes reviewed Oct 14, 2025

View reviewed changes

machichima added 2 commits October 15, 2025 21:12

test: use ray_start_regular fixture

5ed2be4

Signed-off-by: machichima <[email protected]>

test: kill dashboard instead

1406cd3

Signed-off-by: machichima <[email protected]>

machichima force-pushed the 57002-tail-log-error-handle branch from a4fb045 to 1406cd3 Compare October 15, 2025 13:41

refactor: lint

c1e93cd

Signed-off-by: machichima <[email protected]>

edoakes added the go add ONLY when ready to merge, run all tests label Oct 15, 2025

edoakes reviewed Oct 15, 2025

View reviewed changes

fix: move test to test_sdk to prevent ray cluster collision

905d742

Signed-off-by: machichima <[email protected]>

Merge branch 'master' into 57002-tail-log-error-handle

238e0fb

This comment was marked as outdated.

Sign in to view

edoakes enabled auto-merge (squash) October 16, 2025 14:18

Merge branch 'master' into 57002-tail-log-error-handle

5186f4c

github-actions bot disabled auto-merge October 18, 2025 00:56

edoakes approved these changes Oct 20, 2025

View reviewed changes

edoakes merged commit fe7ad00 into ray-project:master Oct 20, 2025
6 checks passed

Future-Outlier mentioned this pull request Oct 28, 2025

[Bug] Sidecar mode shouldn't restart head pod when head pod is deleted ray-project/kuberay#4141

Merged

4 tasks

Future-Outlier mentioned this pull request Nov 22, 2025

[Bug] ray-project/kuberay#4217

Closed

2 tasks

[FIX] raise error if job does not terminate in tail_job_logs() #57037

[FIX] raise error if job does not terminate in tail_job_logs() #57037

Conversation

machichima commented Sep 30, 2025 • edited by cursor bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why are these changes needed?

Related issue number

Checks

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

machichima Oct 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

This comment was marked as outdated.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

This comment was marked as outdated.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

This comment was marked as outdated.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

machichima Oct 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

edoakes left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

machichima commented Oct 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

This comment was marked as outdated.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

machichima commented Sep 30, 2025 •

edited by cursor bot

Loading

machichima Oct 1, 2025 •

edited

Loading

machichima Oct 15, 2025 •

edited

Loading

machichima commented Oct 16, 2025 •

edited

Loading