Saving checkpoints at interrupt #3

Ridhamz-nd · 2024-04-08T20:44:37Z

Thank you for providing example implementations!

I was wondering what signal is sent to the docker container when spot training job are interrupted. Is it SIGKILL or SIGTERM with some grace period (https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_StopTrainingJob.html)?

I was looking to implement a signal handler which, on SIGTERM, saves the latest checkpoint to S3. That way, resume happens from the exact point in time.
Is this possible? Do we need to account for the time it takes for the uploader service to upload the content of /opt/ml/checkpoints to the checkpoint_s3_uri?

Any guidelines on how to resume from the latest stop point is much appreciated

The text was updated successfully, but these errors were encountered:

Ridhamz-nd · 2024-04-09T18:27:53Z

@eitansela please lmk if this is not the correct location to ask this question and I can close this issue :)

rst0git · 2024-04-10T12:15:24Z

I was looking to implement a signal handler which, on SIGTERM, saves the latest checkpoint to S3. That way, resume happens from the exact point in time. Is this possible?

@Ridhamz-nd What would happen if SIGKILL is used instead? You would also need to make sure that a checkpoint is created only when it is necessary, not every time when SIGTERM is used as this may introduce significant performance overhead.

Ridhamz-nd · 2024-04-10T17:13:52Z

@rst0git I don't think a signal handler can be attached to a SIGKILL signal (https://man7.org/linux/man-pages/man7/signal.7.html). Once a SIGKILL is sent, the process is terminated immediately. Based on the sagemaker docs, SIGTERM is only sent once with a grace period of 120s.

eitansela · 2024-04-11T07:06:53Z

You should save checkpoint to /opt/ml/checkpoints after each EPOC, and SageMaker takes care to copy it to checkpoint_s3_uri for you. It is not a matter of speed because if it is a long training job of few hours or few days, why a SIGTERM will help here? If a Spot goes down, you lose few minutes of training and resume back after you have a new Spot, from the last checkpoint.

Ridhamz-nd · 2024-04-12T18:01:21Z

So you are right in that I will only lose a few minutes of training if I'm training on one node.
However, if I am training on p4d/p5 instances which have a > 20% interruption rate in most regions and if I'm doing multi node training, then if one node is reclaimed, the whole job needs to be paused.
In this case, there can be too many interrupts.

Also, in general, its preferable to not lose training progress that a job may have made.
Currently we try to counter the lose progress issue by checkpointing frequently but that also has a cost (especially for large models) so its much more convenient if we get some sort of a signal that tells us that our job is going to be interrupted. If SIGTERM is that signal based on the docs, then we can save and resume from same point.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Saving checkpoints at interrupt #3

Saving checkpoints at interrupt #3

Ridhamz-nd commented Apr 8, 2024

Ridhamz-nd commented Apr 9, 2024

rst0git commented Apr 10, 2024

Ridhamz-nd commented Apr 10, 2024

eitansela commented Apr 11, 2024

Ridhamz-nd commented Apr 12, 2024

Saving checkpoints at interrupt #3

Saving checkpoints at interrupt #3

Comments

Ridhamz-nd commented Apr 8, 2024

Ridhamz-nd commented Apr 9, 2024

rst0git commented Apr 10, 2024

Ridhamz-nd commented Apr 10, 2024

eitansela commented Apr 11, 2024

Ridhamz-nd commented Apr 12, 2024