Skip to content

Commit 96bc3b6

Browse files
authored
[train] Exit actor and log appropriately when poll_workers is in terminal state (#58287)
1) Add traceback to all `ControllerError`s and log it when making a failure decision so we can see where `Worker group is not active. Call WorkerGroup.create() to create a new worker group.` is coming from. **I also sanity checked that this does not cause `UserExceptionWithTraceback` to double print the traceback because this only applies to ControllerError** 2) `_poll_workers` has the only `asyncio.sleep` in the Ray Train controller. After waking up, it exits from the foreground asyncio task if its state is terminal, which can happen due to the issue mentioned in 5). --------- Signed-off-by: Timothy Seah <[email protected]>
1 parent 957568d commit 96bc3b6

File tree

3 files changed

+13
-2
lines changed

3 files changed

+13
-2
lines changed

python/ray/train/v2/_internal/execution/controller/controller.py

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -274,6 +274,12 @@ async def _poll_workers(self) -> WorkerGroupPollStatus:
274274
self._health_check_interval_s - time_since_last_poll, 0
275275
)
276276
await asyncio.sleep(remaining_time)
277+
if self.get_state().is_terminal():
278+
logger.debug(
279+
f"Controller is unexpectedly in terminal state {self.get_state()} after "
280+
"sleeping and before polling workers. Exiting actor."
281+
)
282+
ray.actor.exit_actor()
277283

278284
status = self._worker_group.poll_status(timeout=self._health_check_interval_s)
279285
self._latest_poll_time = time_monotonic()

python/ray/train/v2/_internal/execution/failure_handling/default.py

Lines changed: 6 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -44,8 +44,12 @@ def _log_decision(
4444
logger.info(
4545
f"[FailurePolicy] {decision.value}\n"
4646
f" Source: {error_source}\n"
47-
f" Error count: {error_count} (max allowed: {retry_limit})\n\n"
48-
f"{training_failed_error}"
47+
f" Error count: {error_count} (max allowed: {retry_limit})\n\n",
48+
exc_info=(
49+
type(training_failed_error),
50+
training_failed_error,
51+
training_failed_error.__traceback__,
52+
),
4953
)
5054

5155
def _is_retryable_error(self, training_failed_error: TrainingFailedError) -> bool:

python/ray/train/v2/api/exceptions.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -43,6 +43,7 @@ def __init__(self, controller_failure: Exception):
4343
"Training failed due to controller error:\n" + str(controller_failure)
4444
)
4545
self.controller_failure = controller_failure
46+
self.with_traceback(controller_failure.__traceback__)
4647

4748
def __reduce__(self):
4849
return (self.__class__, (self.controller_failure,))

0 commit comments

Comments
 (0)