Commit 96bc3b6
authored
[train] Exit actor and log appropriately when poll_workers is in terminal state (#58287)
1) Add traceback to all `ControllerError`s and log it when making a
failure decision so we can see where `Worker group is not active. Call
WorkerGroup.create() to create a new worker group.` is coming from. **I
also sanity checked that this does not cause
`UserExceptionWithTraceback` to double print the traceback because this
only applies to ControllerError**
2) `_poll_workers` has the only `asyncio.sleep` in the Ray Train
controller. After waking up, it exits from the foreground asyncio task
if its state is terminal, which can happen due to the issue mentioned in
5).
---------
Signed-off-by: Timothy Seah <[email protected]>1 parent 957568d commit 96bc3b6
File tree
3 files changed
+13
-2
lines changed- python/ray/train/v2
- _internal/execution
- controller
- failure_handling
- api
3 files changed
+13
-2
lines changedLines changed: 6 additions & 0 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
274 | 274 | | |
275 | 275 | | |
276 | 276 | | |
| 277 | + | |
| 278 | + | |
| 279 | + | |
| 280 | + | |
| 281 | + | |
| 282 | + | |
277 | 283 | | |
278 | 284 | | |
279 | 285 | | |
| |||
Lines changed: 6 additions & 2 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
44 | 44 | | |
45 | 45 | | |
46 | 46 | | |
47 | | - | |
48 | | - | |
| 47 | + | |
| 48 | + | |
| 49 | + | |
| 50 | + | |
| 51 | + | |
| 52 | + | |
49 | 53 | | |
50 | 54 | | |
51 | 55 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
43 | 43 | | |
44 | 44 | | |
45 | 45 | | |
| 46 | + | |
46 | 47 | | |
47 | 48 | | |
48 | 49 | | |
0 commit comments