Skip to content

[Request] Give warning/error when job ends in 'Stopped' rather than 'Completed' #1937

@athewsey

Description

@athewsey

Describe the feature you'd like

Stopped jobs (which could have been Completed) should show some kind of warning or even an error: Not just silence as they do today.

How would this feature be used? Please describe.

Common reasons a job might be Stopped rather than Completed include:

  • The job timed out (training may or may not still have exported a checkpoint or final model)
  • A custom detective control in the environment specifically terminated the job via a Stop*Job call (e.g. out of budget, security policy violation, etc)

In many such cases the job termination was not healthy, and in the case where the job was healthy, the developer must have taken explicit steps to achieve that (e.g. implementing checkpointing, etc).

Therefore the current pattern in the SDK of treating Stopped as a success is misleading to inexperienced users ("The .fit() cell ran with no errors right? Everything must be fine") or experienced users who might not realise they're working in an environment with detective controls implemented ("Why does it keep not saving the model!? I do it right there in the script!").

Describe alternatives you've considered

  1. Current behaviour (Stopped == Completed)
    • Not ideal for the reasons described above, but backward-compatible
  2. print() a warning message on 'Stopped'
    • Still easy to ignore, particularly if the job generated a lot of logs already before stopping and the warning is just added on below. Still doesn't interrupt code execution.
    • ...but simple and not breaking
  3. Raise a Python warning on 'Stopped'
    • Nice and visible in display: IPython will render in a red box much like uncaught errors. Doesn't break existing code flows.
    • ...but default warnings settings are a bit weird in notebook kernels: Easy for users to have the warning set to "once", in which case it will only display the first time it's triggered - which could be even more confusing. Still doesn't interrupt code execution.
  4. Raise a specific error on 'Stopped'
    • Breaking change in the (unusual?) case of code flows that use job timeout as standard (rather than other stopping conditions)
    • ...but sets a nice intuitive behaviour that your notebook cell will terminate nicely if your model/processing runs successfully, and error otherwise.
    • Would also not pollute logs/warnings in the event that the condition is explicitly expected and handled, which it could be easily for users who expect the condition.

(4) seems like a nice solution, so long as the logic to catch that specific error (and not Failed) is reasonably intuitive.

Additional context

It seems like the relevant implementation is in Session._check_job_status().

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions