Skip to content

Conversation

@terrykong
Copy link
Contributor

Closes: #1019

Example output:

+ echo '[ERROR] Background srun '\''ray-worker-1'\'' died (pid=123456). Could be a failure in startup or an issue with the node preventing the srun to start. Attempting to exit.'
[ERROR] Background srun 'ray-worker-1' died (pid=123456). Could be a failure in startup or an issue with the node preventing the srun to start. Attempting to exit.
+ touch /logdir/5336464-logs/ENDED
+ exit 1

Signed-off-by: Terry Kong <[email protected]>
This reverts commit 63e2f6a.

Signed-off-by: Terry Kong <[email protected]>
This reverts commit c7ad390.

Signed-off-by: Terry Kong <[email protected]>
Example output:
```
+ echo '[ERROR] Background srun '\''ray-worker-1'\'' died (pid=123456). Could be a failure in startup or an issue with the node preventing the srun to start. Attempting to exit.'
[ERROR] Background srun 'ray-worker-1' died (pid=123456). Could be a failure in startup or an issue with the node preventing the srun to start. Attempting to exit.
+ touch /logdir/5336464-logs/ENDED
+ exit 1
```

Signed-off-by: Terry Kong <[email protected]>
@terrykong terrykong requested a review from hemildesai August 29, 2025 05:40
@terrykong terrykong enabled auto-merge August 29, 2025 05:40
@terrykong terrykong added this pull request to the merge queue Sep 2, 2025
github-merge-queue bot pushed a commit that referenced this pull request Sep 2, 2025
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to no response for status checks Sep 3, 2025
@terrykong terrykong added this pull request to the merge queue Sep 3, 2025
Merged via the queue into main with commit acabc79 Sep 3, 2025
21 checks passed
@terrykong terrykong deleted the tk/error-on-meta-failure branch September 3, 2025 22:19
wangshangsam pushed a commit that referenced this pull request Sep 4, 2025
terrykong added a commit that referenced this pull request Sep 6, 2025
guyueh1 pushed a commit to guyueh1/NeMo-RL that referenced this pull request Sep 15, 2025
PrinsYin pushed a commit to PrinsYin/RL that referenced this pull request Nov 30, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

ray.sub hangs unnecessarily if one node is busted

3 participants