Commit f5abbb8
authored
[train] after_worker_group_poll_status errors result in ControllerError (#57869)
# Summary
We observed that whenever `after_worker_group_poll_status` raised an
exception, the Train Run would fail ungracefully and show up as
`ABORTED` in the dashboard. This happened in the following situations:
1) Different workers report remote checkpoints with different paths ->
`(TrainController pid=46993) RuntimeError: The storage path of the
checkpoints in the training results is not the same. This means the
checkpoints are not consistent. Got a mix of the following checkpoint
paths: {'/tmp/tmpl95kv7ax', '/tmp/tmp__8e6etk'} ` -> `ABORTED` Train Run
2) `ray.train.report("loss": ...}, checkpoint=checkpoint)` in
`train_func` -> `TypeError: Object of type 'ellipsis' is not JSON
serializable` in `CheckpointManager._save_state` -> `ABORTED` Train Run
This PR catches these exceptions, wraps them in a `ControllerError`, and
goes through the `FailurePolicy`, ultimately resulting in an `ERRORED`
Train Run, which is more intuitive because it happened due to an error
in the training workers (`The Train run failed due to an error in the
training workers.` is the comment associated with `RunStatus.ERRORED`).
I considered implementing a more general solution that caught all
`WorkerGroupCallback` errors and resurfaced them as `ControllerError`s,
but decided against it because:
* Callbacks occur in many different places and we might want to add
custom try/catch logic in each case.
* `after_worker_group_poll_status` is the only offender so far and most
of its errors are from user mistakes; other callback errors could be
legitimate bugs that should result in `ABORTED`
# Testing
Unit tests
---------
Signed-off-by: Timothy Seah <[email protected]>1 parent 91685a7 commit f5abbb8
File tree
3 files changed
+27
-7
lines changed- python/ray/train/v2
- _internal/execution/controller
- tests
3 files changed
+27
-7
lines changedLines changed: 10 additions & 1 deletion
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
408 | 408 | | |
409 | 409 | | |
410 | 410 | | |
411 | | - | |
| 411 | + | |
| 412 | + | |
| 413 | + | |
| 414 | + | |
| 415 | + | |
| 416 | + | |
| 417 | + | |
| 418 | + | |
| 419 | + | |
| 420 | + | |
412 | 421 | | |
413 | 422 | | |
414 | 423 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
1 | | - | |
| 1 | + | |
2 | 2 | | |
3 | 3 | | |
4 | 4 | | |
| |||
27 | 27 | | |
28 | 28 | | |
29 | 29 | | |
| 30 | + | |
30 | 31 | | |
31 | 32 | | |
32 | 33 | | |
| |||
45 | 46 | | |
46 | 47 | | |
47 | 48 | | |
| 49 | + | |
| 50 | + | |
48 | 51 | | |
49 | 52 | | |
50 | 53 | | |
| |||
167 | 170 | | |
168 | 171 | | |
169 | 172 | | |
170 | | - | |
| 173 | + | |
171 | 174 | | |
172 | 175 | | |
173 | 176 | | |
| |||
177 | 180 | | |
178 | 181 | | |
179 | 182 | | |
180 | | - | |
| 183 | + | |
181 | 184 | | |
182 | 185 | | |
183 | 186 | | |
| |||
189 | 192 | | |
190 | 193 | | |
191 | 194 | | |
192 | | - | |
193 | 195 | | |
194 | 196 | | |
195 | 197 | | |
| |||
208 | 210 | | |
209 | 211 | | |
210 | 212 | | |
211 | | - | |
212 | 213 | | |
213 | 214 | | |
214 | 215 | | |
| |||
239 | 240 | | |
240 | 241 | | |
241 | 242 | | |
242 | | - | |
| 243 | + | |
| 244 | + | |
| 245 | + | |
| 246 | + | |
243 | 247 | | |
244 | 248 | | |
245 | 249 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
45 | 45 | | |
46 | 46 | | |
47 | 47 | | |
| 48 | + | |
48 | 49 | | |
49 | 50 | | |
50 | 51 | | |
| |||
58 | 59 | | |
59 | 60 | | |
60 | 61 | | |
| 62 | + | |
| 63 | + | |
61 | 64 | | |
62 | 65 | | |
63 | 66 | | |
| |||
97 | 100 | | |
98 | 101 | | |
99 | 102 | | |
| 103 | + | |
| 104 | + | |
| 105 | + | |
| 106 | + | |
100 | 107 | | |
101 | 108 | | |
102 | 109 | | |
| |||
0 commit comments