Skip to content
Merged
Changes from 4 commits
Commits
Show all changes
24 commits
Select commit Hold shift + click to select a range
efa79d6
fix: raise err if job not terminate or has err
machichima Sep 30, 2025
aea0717
refactor: fmt
machichima Sep 30, 2025
c011722
fix: not raise err if job terminated
machichima Oct 1, 2025
d4b73e6
fix: get abnormal close from ws
machichima Oct 5, 2025
83ad810
refactor: remove print and use logger
machichima Oct 8, 2025
21e1ecc
fix: log instead of raise error when ws Error
machichima Oct 9, 2025
0dcb9aa
Merge branch 'master' of github.com:ray-project/ray into 57002-tail-l…
machichima Oct 9, 2025
f1bfdd5
refactor: format
machichima Oct 9, 2025
e62dac9
Merge branch 'master' into 57002-tail-log-error-handle
machichima Oct 10, 2025
fc81dee
Merge branch 'master' into 57002-tail-log-error-handle
machichima Oct 11, 2025
d1c0c29
refactor: fix error message
machichima Oct 13, 2025
899c05d
test: add test for abnormal close for tail job log
machichima Oct 13, 2025
a21145f
fix: remove unused tmp dir
machichima Oct 13, 2025
a2e76fd
fix: dashboard-port to port
machichima Oct 13, 2025
2df9d50
refactor: lint
machichima Oct 13, 2025
b51feb2
Merge branch 'master' into 57002-tail-log-error-handle
machichima Oct 13, 2025
8035a6b
test: update test to use existing ray cluster
machichima Oct 14, 2025
681a536
refactor: lint
machichima Oct 14, 2025
5ed2be4
test: use ray_start_regular fixture
machichima Oct 15, 2025
1406cd3
test: kill dashboard instead
machichima Oct 15, 2025
c1e93cd
refactor: lint
machichima Oct 15, 2025
905d742
fix: move test to test_sdk to prevent ray cluster collision
machichima Oct 16, 2025
238e0fb
Merge branch 'master' into 57002-tail-log-error-handle
machichima Oct 16, 2025
5186f4c
Merge branch 'master' into 57002-tail-log-error-handle
machichima Oct 18, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
16 changes: 13 additions & 3 deletions python/ray/dashboard/modules/job/sdk.py
Original file line number Diff line number Diff line change
Expand Up @@ -482,8 +482,9 @@ async def tail_job_logs(self, job_id: str) -> AsyncIterator[str]:
The iterator.

Raises:
RuntimeError: If the job does not exist or if the request to the
job server fails.
RuntimeError: If the job does not exist, if the request to the
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it easy to write a test for it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me have a try!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added in 899c05d

job server fails, or if the connection closes unexpectedly
before the job reaches a terminal state.
"""
async with aiohttp.ClientSession(
cookies=self._cookies, headers=self._headers
Expand All @@ -498,6 +499,15 @@ async def tail_job_logs(self, job_id: str) -> AsyncIterator[str]:
if msg.type == aiohttp.WSMsgType.TEXT:
yield msg.data
elif msg.type == aiohttp.WSMsgType.CLOSED:
print(f"Close code: {ws.close_code}")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

del or use logger

if ws.close_code == aiohttp.WSCloseCode.ABNORMAL_CLOSURE:
raise RuntimeError(
f"WebSocket connection closed unexpectedly while job with close code {ws.close_code}"
)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: WebSocket Error Message Formatting Issue

The RuntimeError message for abnormal WebSocket closures is malformed, missing the job_id and containing grammatical errors.

Fix in Cursor Fix in Web

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: WebSocket Closure Triggers Unintended Job Errors

The tail_job_logs function raises a RuntimeError on any abnormal WebSocket closure without checking the job's status. The original intent was to only raise if the job isn't in a terminal state. This can lead to false positive errors, reporting successful jobs as failed due to unrelated network issues.

Fix in Cursor Fix in Web

break
elif msg.type == aiohttp.WSMsgType.ERROR:
pass
# Old Ray versions may send ERROR on connection close
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when did this behavior change?

Copy link
Contributor Author

@machichima machichima Oct 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I encounter the error in test python/ray/dashboard/modules/job/tests/test_backwards_compatibility.py::TestBackwardsCompatibility::test_cli. This gets into here. The test is running test_backwards_compatibility.sh which is using ray version 2.0.1

raise RuntimeError(
f"WebSocket error while tailing logs for job {job_id}. Err: {ws.exception()}"
)
Comment on lines 511 to 514
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need to handle old Ray versions here. I think we only support job client and job server with the same version?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The client only uses HTTP/websocket protocol so the compatibility requirements are looser than that. We don't give an exact guarantee though.

break
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I LOVE THIS!