-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Description
Describe the bug
- Training a simple NN on sagemaker via tensorflow estimator. things go ok if I dont use callbacks to create checkpoints.
- If I use callbacks to create checkpoints the training goes well ending in:
2020-04-15 05:51:54,719 sagemaker-containers INFO Reporting training SUCCESS
however the script called to submit the sagemakaer training task returns an exception:
Traceback (most recent call last):
File "train.py", line 5, in <module>
fire.Fire(train)
File "/Users/dav009/.virtualenvs/twitter_py36/lib/python3.6/site-packages/fire/core.py", line 138, in Fire
component_trace = _Fire(component, args, parsed_flag_args, context, name)
File "/Users/dav009/.virtualenvs/twitter_py36/lib/python3.6/site-packages/fire/core.py", line 468, in _Fire
target=component.__name__)
File "/Users/dav009/.virtualenvs/twitter_py36/lib/python3.6/site-packages/fire/core.py", line 672, in _CallAndUpdateTrace
component = fn(*varargs, **kwargs)
File "/Users/dav009/code/idio/deep2/deepground/deepground/sagemaker.py", line 212, in train
estimator.fit({"train": train_data, "validation": eval_data})
File "/Users/dav009/.virtualenvs/twitter_py36/lib/python3.6/site-packages/sagemaker/tensorflow/estimator.py", line 479, in fit
fit_super()
File "/Users/dav009/.virtualenvs/twitter_py36/lib/python3.6/site-packages/sagemaker/tensorflow/estimator.py", line 458, in fit_super
super(TensorFlow, self).fit(inputs, wait, logs, job_name, experiment_config)
File "/Users/dav009/.virtualenvs/twitter_py36/lib/python3.6/site-packages/sagemaker/estimator.py", line 477, in fit
self.latest_training_job.wait(logs=logs)
File "/Users/dav009/.virtualenvs/twitter_py36/lib/python3.6/site-packages/sagemaker/estimator.py", line 1089, in wait
self.sagemaker_session.logs_for_job(self.job_name, wait=True, log_type=logs)
File "/Users/dav009/.virtualenvs/twitter_py36/lib/python3.6/site-packages/sagemaker/session.py", line 3044, in logs_for_job
self._check_job_status(job_name, description, "TrainingJobStatus")
File "/Users/dav009/.virtualenvs/twitter_py36/lib/python3.6/site-packages/sagemaker/session.py", line 2638, in _check_job_status
actual_status=status,
sagemaker.exceptions.UnexpectedStatusException: Error for Training job iris-2020-04-15-05-45-10-698: Failed. Reason: InternalServerError: We encountered an internal error. Please try again.
if I remove the callback the training task is successful and the submitting script exits with no exception( as expected).
the callback I am adding is:
checkpoint_path="/opt/ml/checkpoints/checkpoint-{epoch:04d}.ckpt"
tf.keras.callbacks.ModelCheckpoint(
checkpoint_path, save_freq='epoch', save_weights_only=True, verbose=1
)checkpoints are actually sucessfully written during execution and also uploaded to checkpoint_s3_uri:
#015 32/100 [========>.....................] - ETA: 0s - loss: 0.5513 - accuracy: 0.7188
Epoch 00019: saving model to /opt/ml/checkpoints/checkpoint-0019.ckpt
I tried using the latest release as well as the master branch from this repository.
To reproduce
A clear, step-by-step set of instructions to reproduce the bug.
Expected behavior
since both the checkpoints files were synchronized and training went ok, I except the submit script to exit successfully
Screenshots or logs
If applicable, add screenshots or logs to help explain your problem.
System information
A description of your system. Please provide:
- SageMaker Python SDK version: 1.55.4.dev0
- Framework name (eg. PyTorch) or algorithm (eg. KMeans): tensorflow
- Framework version: 2.1.0
- Python version: 3.6
- CPU or GPU: GPU
- Custom Docker image (Y/N): Y
Additional context
Add any other context about the problem here.