Skip to content

Training successful but submitting function throws exception #1412

@dav009

Description

@dav009

Describe the bug

  • Training a simple NN on sagemaker via tensorflow estimator. things go ok if I dont use callbacks to create checkpoints.
  • If I use callbacks to create checkpoints the training goes well ending in:
2020-04-15 05:51:54,719 sagemaker-containers INFO Reporting training SUCCESS

however the script called to submit the sagemakaer training task returns an exception:

Traceback (most recent call last):
  File "train.py", line 5, in <module>
    fire.Fire(train)
  File "/Users/dav009/.virtualenvs/twitter_py36/lib/python3.6/site-packages/fire/core.py", line 138, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
  File "/Users/dav009/.virtualenvs/twitter_py36/lib/python3.6/site-packages/fire/core.py", line 468, in _Fire
    target=component.__name__)
  File "/Users/dav009/.virtualenvs/twitter_py36/lib/python3.6/site-packages/fire/core.py", line 672, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
  File "/Users/dav009/code/idio/deep2/deepground/deepground/sagemaker.py", line 212, in train
    estimator.fit({"train": train_data, "validation": eval_data})
  File "/Users/dav009/.virtualenvs/twitter_py36/lib/python3.6/site-packages/sagemaker/tensorflow/estimator.py", line 479, in fit
    fit_super()
  File "/Users/dav009/.virtualenvs/twitter_py36/lib/python3.6/site-packages/sagemaker/tensorflow/estimator.py", line 458, in fit_super
    super(TensorFlow, self).fit(inputs, wait, logs, job_name, experiment_config)
  File "/Users/dav009/.virtualenvs/twitter_py36/lib/python3.6/site-packages/sagemaker/estimator.py", line 477, in fit
    self.latest_training_job.wait(logs=logs)
  File "/Users/dav009/.virtualenvs/twitter_py36/lib/python3.6/site-packages/sagemaker/estimator.py", line 1089, in wait
    self.sagemaker_session.logs_for_job(self.job_name, wait=True, log_type=logs)
  File "/Users/dav009/.virtualenvs/twitter_py36/lib/python3.6/site-packages/sagemaker/session.py", line 3044, in logs_for_job
    self._check_job_status(job_name, description, "TrainingJobStatus")
  File "/Users/dav009/.virtualenvs/twitter_py36/lib/python3.6/site-packages/sagemaker/session.py", line 2638, in _check_job_status
    actual_status=status,
sagemaker.exceptions.UnexpectedStatusException: Error for Training job iris-2020-04-15-05-45-10-698: Failed. Reason: InternalServerError: We encountered an internal error. Please try again.

if I remove the callback the training task is successful and the submitting script exits with no exception( as expected).

the callback I am adding is:

checkpoint_path="/opt/ml/checkpoints/checkpoint-{epoch:04d}.ckpt"
tf.keras.callbacks.ModelCheckpoint(
            checkpoint_path, save_freq='epoch', save_weights_only=True, verbose=1
        )

checkpoints are actually sucessfully written during execution and also uploaded to checkpoint_s3_uri:

#015 32/100 [========>.....................] - ETA: 0s - loss: 0.5513 - accuracy: 0.7188
Epoch 00019: saving model to /opt/ml/checkpoints/checkpoint-0019.ckpt

I tried using the latest release as well as the master branch from this repository.

To reproduce
A clear, step-by-step set of instructions to reproduce the bug.

Expected behavior
since both the checkpoints files were synchronized and training went ok, I except the submit script to exit successfully

Screenshots or logs
If applicable, add screenshots or logs to help explain your problem.

System information
A description of your system. Please provide:

  • SageMaker Python SDK version: 1.55.4.dev0
  • Framework name (eg. PyTorch) or algorithm (eg. KMeans): tensorflow
  • Framework version: 2.1.0
  • Python version: 3.6
  • CPU or GPU: GPU
  • Custom Docker image (Y/N): Y

Additional context
Add any other context about the problem here.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions