You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
`2024-03-16 01:08:51 Training - Training image download completed. Training in progress........bash: cannot set terminal process group (-1): Inappropriate ioctl for device
bash: no job control in this shell
2024-03-16 01:09:58,071 sagemaker-training-toolkit INFO Imported framework sagemaker_pytorch_container.training
2024-03-16 01:09:58,124 sagemaker-training-toolkit INFO No Neurons detected (normal if no neurons installed)
2024-03-16 01:09:58,134 sagemaker_pytorch_container.training INFO Block until all host DNS lookups succeed.
2024-03-16 01:09:58,136 sagemaker_pytorch_container.training INFO Invoking TorchDistributed...
2024-03-16 01:09:58,136 sagemaker_pytorch_container.training INFO Invoking user training script.
2024-03-16 01:08:51 Training - Training image download completed.
Training in progress........bash: cannot set terminal process group (-1): Inappropriate ioctl for device
bash: no job control in this shell
2024-03-16 01:09:58,071 sagemaker-training-toolkit INFO Imported framework sagemaker_pytorch_container.training
2024-03-16 01:09:58,124 sagemaker-training-toolkit INFO No Neurons detected (normal if no neurons installed)
2024-03-16 01:09:58,134 sagemaker_pytorch_container.training INFO Block until all host DNS lookups succeed.
2024-03-16 01:09:58,136 sagemaker_pytorch_container.training INFO Invoking TorchDistributed...
2024-03-16 01:09:58,136 sagemaker_pytorch_container.training INFO Invoking user training script.
2024-03-16 01:12:13,835 sagemaker-training-toolkit INFO Installing module with the following command:
/opt/conda/bin/python3.10 -m pip install . -r requirements.txt
Processing /opt/ml/code
Preparing metadata (setup.py): started
Preparing metadata (setup.py): finished with status 'done'
Requirement already satisfied: cython in /opt/conda/lib/python3.10/site-packages (from -r requirements.txt (line 1)) (3.0.8)
Requirement already satisfied: submitit in /opt/conda/lib/python3.10/site-packages (from -r requirements.txt (line 2)) (1.5.1)
Requirement already satisfied: scipy in /opt/conda/lib/python3.10/site-packages (from -r requirements.txt (line 5)) (1.12.0)
Requirement already satisfied: onnx in /opt/conda/lib/python3.10/site-packages (from -r requirements.txt (line 6)) (1.15.0)
Requirement already satisfied: onnxruntime in /opt/conda/lib/python3.10/site-packages (from -r requirements.txt (line 7)) (1.17.1)
Requirement already satisfied: numpy in /opt/conda/lib/python3.10/site-packages (from -r requirements.txt (line 8)) (1.26.4)
Requirement already satisfied: pandas in /opt/conda/lib/python3.10/site-packages (from -r requirements.txt (line 10)) (2.2.1)
Requirement already satisfied: tabulate in /opt/conda/lib/python3.10/site-packages (from -r requirements.txt (line 11)) (0.9.0)
Requirement already satisfied: SQLAlchemy in /opt/conda/lib/python3.10/site-packages (from -r requirements.txt (line 12)) (2.0.28)
Requirement already satisfied: matplotlib in /opt/conda/lib/python3.10/site-packages (from -r requirements.txt (line 13)) (3.8.3)
Requirement already satisfied: scikit-learn in /opt/conda/lib/python3.10/site-packages (from -r requirements.txt (line 14)) (1.4.1.post1)
Requirement already satisfied: sagemaker in /opt/conda/lib/python3.10/site-packages (from -r requirements.txt (line 15)) (2.210.0)
Requirement already satisfied: sagemaker-training in /opt/conda/lib/python3.10/site-packages (from -r requirements.txt (line 16)) (4.7.4)
Requirement already satisfied: cloudpickle>=1.2.1 in /opt/conda/lib/python3.10/site-packages (from submitit->-r requirements.txt (line 2)) (2.2.1)
Requirement already satisfied: typing_extensions>=3.7.4.2 in /opt/conda/lib/python3.10/site-packages (from submitit->-r requirements.txt (line 2)) (4.10.0)
Requirement already satisfied: protobuf>=3.20.2 in /opt/conda/lib/python3.10/site-packages (from onnx->-r requirements.txt (line 6)) (3.20.3)
Requirement already satisfied: coloredlogs in /opt/conda/lib/python3.10/site-packages (from onnxruntime->-r requirements.txt (line 7)) (15.0.1)
Requirement already satisfied: flatbuffers in /opt/conda/lib/python3.10/site-packages (from onnxruntime->-r requirements.txt (line 7)) (24.3.7)
Requirement already satisfied: packaging in /opt/conda/lib/python3.10/site-packages (from onnxruntime->-r requirements.txt (line 7)) (23.1)
Requirement already satisfied: sympy in /opt/conda/lib/python3.10/site-packages (from onnxruntime->-r requirements.txt (line 7)) (1.12)
Requirement already satisfied: python-dateutil>=2.8.2 in /opt/conda/lib/python3.10/site-packages (from pandas->-r requirements.txt (line 10)) (2.8.2)
Requirement already satisfied: pytz>=2020.1 in /opt/conda/lib/python3.10/site-packages (from pandas->-r requirements.txt (line 10)) (2024.1)
Requirement already satisfied: tzdata>=2022.7 in /opt/conda/lib/python3.10/site-packages (from pandas->-r requirements.txt (line 10)) (2024.1)
Requirement already satisfied: greenlet!=0.4.17 in /opt/conda/lib/python3.10/site-packages (from SQLAlchemy->-r requirements.txt (line 12)) (3.0.3)
Requirement already satisfied: contourpy>=1.0.1 in /opt/conda/lib/python3.10/site-packages (from matplotlib->-r requirements.txt (line 13)) (1.2.0)
Requirement already satisfied: cycler>=0.10 in /opt/conda/lib/python3.10/site-packages (from matplotlib->-r requirements.txt (line 13)) (0.12.1)
Requirement already satisfied: fonttools>=4.22.0 in /opt/conda/lib/python3.10/site-packages (from matplotlib->-r requirements.txt (line 13)) (4.49.0)
Requirement already satisfied: kiwisolver>=1.3.1 in /opt/conda/lib/python3.10/site-packages (from matplotlib->-r requirements.txt (line 13)) (1.4.5)
Requirement already satisfied: pillow>=8 in /opt/conda/lib/python3.10/site-packages (from matplotlib->-r requirements.txt (line 13)) (10.2.0)
Requirement already satisfied: pyparsing>=2.3.1 in /opt/conda/lib/python3.10/site-packages (from matplotlib->-r requirements.txt (line 13)) (3.1.1)
Requirement already satisfied: joblib>=1.2.0 in /opt/conda/lib/python3.10/site-packages (from scikit-learn->-r requirements.txt (line 14)) (1.3.2)
Requirement already satisfied: threadpoolctl>=2.0.0 in /opt/conda/lib/python3.10/site-packages (from scikit-learn->-r requirements.txt (line 14)) (3.3.0)
Requirement already satisfied: attrs<24,>=23.1.0 in /opt/conda/lib/python3.10/site-packages (from sagemaker->-r requirements.txt (line 15)) (23.2.0)
Requirement already satisfied: boto3<2.0,>=1.33.3 in /opt/conda/lib/python3.10/site-packages (from sagemaker->-r requirements.txt (line 15)) (1.34.52)
Requirement already satisfied: google-pasta in /opt/conda/lib/python3.10/site-packages (from sagemaker->-r requirements.txt (line 15)) (0.2.0)
Requirement already satisfied: smdebug-rulesconfig==1.0.1 in /opt/conda/lib/python3.10/site-packages (from sagemaker->-r requirements.txt (line 15)) (1.0.1)
Requirement already satisfied: importlib-metadata<7.0,>=1.4.0 in /opt/conda/lib/python3.10/site-packages (from sagemaker->-r requirements.txt (line 15)) (6.11.0)
Requirement already satisfied: pathos in /opt/conda/lib/python3.10/site-packages (from sagemaker->-r requirements.txt (line 15)) (0.3.2)
Requirement already satisfied: schema in /opt/conda/lib/python3.10/site-packages (from sagemaker->-r requirements.txt (line 15)) (0.7.5)
Requirement already satisfied: PyYAML~=6.0 in /opt/conda/lib/python3.10/site-packages (from sagemaker->-r requirements.txt (line 15)) (6.0.1)
Requirement already satisfied: jsonschema in /opt/conda/lib/python3.10/site-packages (from sagemaker->-r requirements.txt (line 15)) (4.21.1)
Requirement already satisfied: platformdirs in /opt/conda/lib/python3.10/site-packages (from sagemaker->-r requirements.txt (line 15)) (4.2.0)
Requirement already satisfied: tblib<3,>=1.7.0 in /opt/conda/lib/python3.10/site-packages (from sagemaker->-r requirements.txt (line 15)) (2.0.0)
Requirement already satisfied: urllib3<3.0.0,>=1.26.8 in /opt/conda/lib/python3.10/site-packages (from sagemaker->-r requirements.txt (line 15)) (1.26.18)
Requirement already satisfied: requests in /opt/conda/lib/python3.10/site-packages (from sagemaker->-r requirements.txt (line 15)) (2.31.0)
Requirement already satisfied: docker in /opt/conda/lib/python3.10/site-packages (from sagemaker->-r requirements.txt (line 15)) (7.0.0)
Requirement already satisfied: tqdm in /opt/conda/lib/python3.10/site-packages (from sagemaker->-r requirements.txt (line 15)) (4.66.1)
Requirement already satisfied: psutil in /opt/conda/lib/python3.10/site-packages (from sagemaker->-r requirements.txt (line 15)) (5.9.8)
Requirement already satisfied: six in /opt/conda/lib/python3.10/site-packages (from sagemaker-training->-r requirements.txt (line 16)) (1.16.0)
Requirement already satisfied: pip in /opt/conda/lib/python3.10/site-packages (from sagemaker-training->-r requirements.txt (line 16)) (24.0)
Requirement already satisfied: retrying>=1.3.3 in /opt/conda/lib/python3.10/site-packages (from sagemaker-training->-r requirements.txt (line 16)) (1.3.4)
Requirement already satisfied: gevent in /opt/conda/lib/python3.10/site-packages (from sagemaker-training->-r requirements.txt (line 16)) (24.2.1)
Requirement already satisfied: inotify-simple==1.2.1 in /opt/conda/lib/python3.10/site-packages (from sagemaker-training->-r requirements.txt (line 16)) (1.2.1)
Requirement already satisfied: werkzeug>=0.15.5 in /opt/conda/lib/python3.10/site-packages (from sagemaker-training->-r requirements.txt (line 16)) (3.0.1)
Requirement already satisfied: paramiko>=2.4.2 in /opt/conda/lib/python3.10/site-packages (from sagemaker-training->-r requirements.txt (line 16)) (3.4.0)
Requirement already satisfied: botocore>=1.31.57 in /opt/conda/lib/python3.10/site-packages (from sagemaker-training->-r requirements.txt (line 16)) (1.34.52)
Requirement already satisfied: jmespath<2.0.0,>=0.7.1 in /opt/conda/lib/python3.10/site-packages (from boto3<2.0,>=1.33.3->sagemaker->-r requirements.txt (line 15)) (1.0.1)
Requirement already satisfied: s3transfer<0.11.0,>=0.10.0 in /opt/conda/lib/python3.10/site-packages (from boto3<2.0,>=1.33.3->sagemaker->-r requirements.txt (line 15)) (0.10.0)
Requirement already satisfied: zipp>=0.5 in /opt/conda/lib/python3.10/site-packages (from importlib-metadata<7.0,>=1.4.0->sagemaker->-r requirements.txt (line 15)) (3.17.0)
Requirement already satisfied: bcrypt>=3.2 in /opt/conda/lib/python3.10/site-packages (from paramiko>=2.4.2->sagemaker-training->-r requirements.txt (line 16)) (4.1.2)
Requirement already satisfied: cryptography>=3.3 in /opt/conda/lib/python3.10/site-packages (from paramiko>=2.4.2->sagemaker-training->-r requirements.txt (line 16)) (42.0.5)
Requirement already satisfied: pynacl>=1.5 in /opt/conda/lib/python3.10/site-packages (from paramiko>=2.4.2->sagemaker-training->-r requirements.txt (line 16)) (1.5.0)
Requirement already satisfied: MarkupSafe>=2.1.1 in /opt/conda/lib/python3.10/site-packages (from werkzeug>=0.15.5->sagemaker-training->-r requirements.txt (line 16)) (2.1.5)
Requirement already satisfied: humanfriendly>=9.1 in /opt/conda/lib/python3.10/site-packages (from coloredlogs->onnxruntime->-r requirements.txt (line 7)) (10.0)
Requirement already satisfied: charset-normalizer<4,>=2 in /opt/conda/lib/python3.10/site-packages (from requests->sagemaker->-r requirements.txt (line 15)) (3.2.0)
Requirement already satisfied: idna<4,>=2.5 in /opt/conda/lib/python3.10/site-packages (from requests->sagemaker->-r requirements.txt (line 15)) (3.4)
Requirement already satisfied: certifi>=2017.4.17 in /opt/conda/lib/python3.10/site-packages (from requests->sagemaker->-r requirements.txt (line 15)) (2024.2.2)
Requirement already satisfied: zope.event in /opt/conda/lib/python3.10/site-packages (from gevent->sagemaker-training->-r requirements.txt (line 16)) (5.0)
Requirement already satisfied: zope.interface in /opt/conda/lib/python3.10/site-packages (from gevent->sagemaker-training->-r requirements.txt (line 16)) (6.2)
Requirement already satisfied: jsonschema-specifications>=2023.03.6 in /opt/conda/lib/python3.10/site-packages (from jsonschema->sagemaker->-r requirements.txt (line 15)) (2023.12.1)
Requirement already satisfied: referencing>=0.28.4 in /opt/conda/lib/python3.10/site-packages (from jsonschema->sagemaker->-r requirements.txt (line 15)) (0.33.0)
Requirement already satisfied: rpds-py>=0.7.1 in /opt/conda/lib/python3.10/site-packages (from jsonschema->sagemaker->-r requirements.txt (line 15)) (0.18.0)
Requirement already satisfied: ppft>=1.7.6.8 in /opt/conda/lib/python3.10/site-packages (from pathos->sagemaker->-r requirements.txt (line 15)) (1.7.6.8)
Requirement already satisfied: dill>=0.3.8 in /opt/conda/lib/python3.10/site-packages (from pathos->sagemaker->-r requirements.txt (line 15)) (0.3.8)
Requirement already satisfied: pox>=0.3.4 in /opt/conda/lib/python3.10/site-packages (from pathos->sagemaker->-r requirements.txt (line 15)) (0.3.4)
Requirement already satisfied: multiprocess>=0.70.16 in /opt/conda/lib/python3.10/site-packages (from pathos->sagemaker->-r requirements.txt (line 15)) (0.70.16)
Requirement already satisfied: contextlib2>=0.5.5 in /opt/conda/lib/python3.10/site-packages (from schema->sagemaker->-r requirements.txt (line 15)) (21.6.0)
Requirement already satisfied: mpmath>=0.19 in /opt/conda/lib/python3.10/site-packages (from sympy->onnxruntime->-r requirements.txt (line 7)) (1.3.0)
Requirement already satisfied: cffi>=1.12 in /opt/conda/lib/python3.10/site-packages (from cryptography>=3.3->paramiko>=2.4.2->sagemaker-training->-r requirements.txt (line 16)) (1.15.1)
Requirement already satisfied: setuptools in /opt/conda/lib/python3.10/site-packages (from zope.event->gevent->sagemaker-training->-r requirements.txt (line 16)) (68.1.2)
Requirement already satisfied: pycparser in /opt/conda/lib/python3.10/site-packages (from cffi>=1.12->cryptography>=3.3->paramiko>=2.4.2->sagemaker-training->-r requirements.txt (line 16)) (2.21)
Building wheels for collected packages: DET
Building wheel for DET (setup.py): started
Building wheel for DET (setup.py): finished with status 'done'
Created wheel for DET: filename=DET-0.1-py3-none-any.whl size=58109 sha256=cc1e4d8bf3a6a5dce23f244ad105697e6bf3154b74819103fc6608825d5216da
Stored in directory: /tmp/pip-ephem-wheel-cache-qhc9z9qg/wheels/ee/79/1e/3fb168dd34359b627e23b53045c3eb498188294150b39e2fb0
Successfully built DET
Installing collected packages: DET
Successfully installed DET-0.1
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
2024-03-16 01:12:16,538 sagemaker-training-toolkit INFO Waiting for the process to finish and give a return code.
2024-03-16 01:12:16,538 sagemaker-training-toolkit INFO Done waiting for a return code. Received 0 from exiting process.
2024-03-16 01:12:16,618 sagemaker-training-toolkit INFO No Neurons detected (normal if no neurons installed)
2024-03-16 01:12:16,681 sagemaker-training-toolkit INFO No Neurons detected (normal if no neurons installed)
2024-03-16 01:12:16,691 sagemaker-training-toolkit INFO Starting distributed training through torchrun
2024-03-16 01:12:16,691 sagemaker-training-toolkit ERROR Reporting training FAILURE
2024-03-16 01:12:16,691 sagemaker-training-toolkit ERROR Python packages are not supported for torch_distributed. Please use a python script as the entry-point
2024-03-16 01:12:16,691 sagemaker-training-toolkit ERROR Encountered exit_code 1
2024-03-16 01:13:23 Uploading - Uploading generated training model
2024-03-16 01:13:52 Failed - Training job failed
Traceback (most recent call last):
File "/home/sky/Desktop/det/./sagemaker_train.py", line 79, in
main(args)
File "/home/sky/Desktop/det/./sagemaker_train.py", line 59, in main
estimator.fit()
File "/home/sgai/lib/python3.10/site-packages/sagemaker/workflow/pipeline_context.py", line 346, in wrapper
return run_func(*args, **kwargs)
File "/home/sgai/lib/python3.10/site-packages/sagemaker/estimator.py", line 1341, in fit
self.latest_training_job.wait(logs=logs)
File "/home/sgai/lib/python3.10/site-packages/sagemaker/estimator.py", line 2677, in wait
self.sagemaker_session.logs_for_job(self.job_name, wait=True, log_type=logs)
File "/home/sgai/lib/python3.10/site-packages/sagemaker/session.py", line 5737, in logs_for_job
_logs_for_job(self, job_name, wait, poll, log_type, timeout)
File "/home/sgai/lib/python3.10/site-packages/sagemaker/session.py", line 7880, in _logs_for_job
_check_job_status(job_name, description, "TrainingJobStatus")
File "/home/sgai/lib/python3.10/site-packages/sagemaker/session.py", line 7933, in _check_job_status
raise exceptions.UnexpectedStatusException(
sagemaker.exceptions.UnexpectedStatusException: Error for Training job de-2024-03-16-00-47-26-876: Failed. Reason: AlgorithmError: Python packages are not supported for torch_distributed. Please use a python script as the entry-point, exit code: 1
2024-03-16 01:12:13,835 sagemaker-training-toolkit INFO Installing module with the following command:
/opt/conda/bin/python3.10 -m pip install . -r requirements.txt
Processing /opt/ml/code
Preparing metadata (setup.py): started
Preparing metadata (setup.py): finished with status 'done'
Requirement already satisfied: cython in /opt/conda/lib/python3.10/site-packages (from -r requirements.txt (line 1)) (3.0.8)
Requirement already satisfied: submitit in /opt/conda/lib/python3.10/site-packages (from -r requirements.txt (line 2)) (1.5.1)
Requirement already satisfied: scipy in /opt/conda/lib/python3.10/site-packages (from -r requirements.txt (line 5)) (1.12.0)
Requirement already satisfied: onnx in /opt/conda/lib/python3.10/site-packages (from -r requirements.txt (line 6)) (1.15.0)
Requirement already satisfied: onnxruntime in /opt/conda/lib/python3.10/site-packages (from -r requirements.txt (line 7)) (1.17.1)
Requirement already satisfied: numpy in /opt/conda/lib/python3.10/site-packages (from -r requirements.txt (line 8)) (1.26.4)
Requirement already satisfied: pandas in /opt/conda/lib/python3.10/site-packages (from -r requirements.txt (line 10)) (2.2.1)
Requirement already satisfied: tabulate in /opt/conda/lib/python3.10/site-packages (from -r requirements.txt (line 11)) (0.9.0)
Requirement already satisfied: SQLAlchemy in /opt/conda/lib/python3.10/site-packages (from -r requirements.txt (line 12)) (2.0.28)
Requirement already satisfied: matplotlib in /opt/conda/lib/python3.10/site-packages (from -r requirements.txt (line 13)) (3.8.3)
Requirement already satisfied: scikit-learn in /opt/conda/lib/python3.10/site-packages (from -r requirements.txt (line 14)) (1.4.1.post1)
Requirement already satisfied: sagemaker in /opt/conda/lib/python3.10/site-packages (from -r requirements.txt (line 15)) (2.210.0)
Requirement already satisfied: sagemaker-training in /opt/conda/lib/python3.10/site-packages (from -r requirements.txt (line 16)) (4.7.4)
Requirement already satisfied: cloudpickle>=1.2.1 in /opt/conda/lib/python3.10/site-packages (from submitit->-r requirements.txt (line 2)) (2.2.1)
Requirement already satisfied: typing_extensions>=3.7.4.2 in /opt/conda/lib/python3.10/site-packages (from submitit->-r requirements.txt (line 2)) (4.10.0)
Requirement already satisfied: protobuf>=3.20.2 in /opt/conda/lib/python3.10/site-packages (from onnx->-r requirements.txt (line 6)) (3.20.3)
Requirement already satisfied: coloredlogs in /opt/conda/lib/python3.10/site-packages (from onnxruntime->-r requirements.txt (line 7)) (15.0.1)
Requirement already satisfied: flatbuffers in /opt/conda/lib/python3.10/site-packages (from onnxruntime->-r requirements.txt (line 7)) (24.3.7)
Requirement already satisfied: packaging in /opt/conda/lib/python3.10/site-packages (from onnxruntime->-r requirements.txt (line 7)) (23.1)
Requirement already satisfied: sympy in /opt/conda/lib/python3.10/site-packages (from onnxruntime->-r requirements.txt (line 7)) (1.12)
Requirement already satisfied: python-dateutil>=2.8.2 in /opt/conda/lib/python3.10/site-packages (from pandas->-r requirements.txt (line 10)) (2.8.2)
Requirement already satisfied: pytz>=2020.1 in /opt/conda/lib/python3.10/site-packages (from pandas->-r requirements.txt (line 10)) (2024.1)
Requirement already satisfied: tzdata>=2022.7 in /opt/conda/lib/python3.10/site-packages (from pandas->-r requirements.txt (line 10)) (2024.1)
Requirement already satisfied: greenlet!=0.4.17 in /opt/conda/lib/python3.10/site-packages (from SQLAlchemy->-r requirements.txt (line 12)) (3.0.3)
Requirement already satisfied: contourpy>=1.0.1 in /opt/conda/lib/python3.10/site-packages (from matplotlib->-r requirements.txt (line 13)) (1.2.0)
Requirement already satisfied: cycler>=0.10 in /opt/conda/lib/python3.10/site-packages (from matplotlib->-r requirements.txt (line 13)) (0.12.1)
Requirement already satisfied: fonttools>=4.22.0 in /opt/conda/lib/python3.10/site-packages (from matplotlib->-r requirements.txt (line 13)) (4.49.0)
Requirement already satisfied: kiwisolver>=1.3.1 in /opt/conda/lib/python3.10/site-packages (from matplotlib->-r requirements.txt (line 13)) (1.4.5)
Requirement already satisfied: pillow>=8 in /opt/conda/lib/python3.10/site-packages (from matplotlib->-r requirements.txt (line 13)) (10.2.0)
Requirement already satisfied: pyparsing>=2.3.1 in /opt/conda/lib/python3.10/site-packages (from matplotlib->-r requirements.txt (line 13)) (3.1.1)
Requirement already satisfied: joblib>=1.2.0 in /opt/conda/lib/python3.10/site-packages (from scikit-learn->-r requirements.txt (line 14)) (1.3.2)
Requirement already satisfied: threadpoolctl>=2.0.0 in /opt/conda/lib/python3.10/site-packages (from scikit-learn->-r requirements.txt (line 14)) (3.3.0)
Requirement already satisfied: attrs<24,>=23.1.0 in /opt/conda/lib/python3.10/site-packages (from sagemaker->-r requirements.txt (line 15)) (23.2.0)
Requirement already satisfied: boto3<2.0,>=1.33.3 in /opt/conda/lib/python3.10/site-packages (from sagemaker->-r requirements.txt (line 15)) (1.34.52)
Requirement already satisfied: google-pasta in /opt/conda/lib/python3.10/site-packages (from sagemaker->-r requirements.txt (line 15)) (0.2.0)
Requirement already satisfied: smdebug-rulesconfig==1.0.1 in /opt/conda/lib/python3.10/site-packages (from sagemaker->-r requirements.txt (line 15)) (1.0.1)
Requirement already satisfied: importlib-metadata<7.0,>=1.4.0 in /opt/conda/lib/python3.10/site-packages (from sagemaker->-r requirements.txt (line 15)) (6.11.0)
Requirement already satisfied: pathos in /opt/conda/lib/python3.10/site-packages (from sagemaker->-r requirements.txt (line 15)) (0.3.2)
Requirement already satisfied: schema in /opt/conda/lib/python3.10/site-packages (from sagemaker->-r requirements.txt (line 15)) (0.7.5)
Requirement already satisfied: PyYAML~=6.0 in /opt/conda/lib/python3.10/site-packages (from sagemaker->-r requirements.txt (line 15)) (6.0.1)
Requirement already satisfied: jsonschema in /opt/conda/lib/python3.10/site-packages (from sagemaker->-r requirements.txt (line 15)) (4.21.1)
Requirement already satisfied: platformdirs in /opt/conda/lib/python3.10/site-packages (from sagemaker->-r requirements.txt (line 15)) (4.2.0)
Requirement already satisfied: tblib<3,>=1.7.0 in /opt/conda/lib/python3.10/site-packages (from sagemaker->-r requirements.txt (line 15)) (2.0.0)
Requirement already satisfied: urllib3<3.0.0,>=1.26.8 in /opt/conda/lib/python3.10/site-packages (from sagemaker->-r requirements.txt (line 15)) (1.26.18)
Requirement already satisfied: requests in /opt/conda/lib/python3.10/site-packages (from sagemaker->-r requirements.txt (line 15)) (2.31.0)
Requirement already satisfied: docker in /opt/conda/lib/python3.10/site-packages (from sagemaker->-r requirements.txt (line 15)) (7.0.0)
Requirement already satisfied: tqdm in /opt/conda/lib/python3.10/site-packages (from sagemaker->-r requirements.txt (line 15)) (4.66.1)
Requirement already satisfied: psutil in /opt/conda/lib/python3.10/site-packages (from sagemaker->-r requirements.txt (line 15)) (5.9.8)
Requirement already satisfied: six in /opt/conda/lib/python3.10/site-packages (from sagemaker-training->-r requirements.txt (line 16)) (1.16.0)
Requirement already satisfied: pip in /opt/conda/lib/python3.10/site-packages (from sagemaker-training->-r requirements.txt (line 16)) (24.0)
Requirement already satisfied: retrying>=1.3.3 in /opt/conda/lib/python3.10/site-packages (from sagemaker-training->-r requirements.txt (line 16)) (1.3.4)
Requirement already satisfied: gevent in /opt/conda/lib/python3.10/site-packages (from sagemaker-training->-r requirements.txt (line 16)) (24.2.1)
Requirement already satisfied: inotify-simple==1.2.1 in /opt/conda/lib/python3.10/site-packages (from sagemaker-training->-r requirements.txt (line 16)) (1.2.1)
Requirement already satisfied: werkzeug>=0.15.5 in /opt/conda/lib/python3.10/site-packages (from sagemaker-training->-r requirements.txt (line 16)) (3.0.1)
Requirement already satisfied: paramiko>=2.4.2 in /opt/conda/lib/python3.10/site-packages (from sagemaker-training->-r requirements.txt (line 16)) (3.4.0)
Requirement already satisfied: botocore>=1.31.57 in /opt/conda/lib/python3.10/site-packages (from sagemaker-training->-r requirements.txt (line 16)) (1.34.52)
Requirement already satisfied: jmespath<2.0.0,>=0.7.1 in /opt/conda/lib/python3.10/site-packages (from boto3<2.0,>=1.33.3->sagemaker->-r requirements.txt (line 15)) (1.0.1)
Requirement already satisfied: s3transfer<0.11.0,>=0.10.0 in /opt/conda/lib/python3.10/site-packages (from boto3<2.0,>=1.33.3->sagemaker->-r requirements.txt (line 15)) (0.10.0)
Requirement already satisfied: zipp>=0.5 in /opt/conda/lib/python3.10/site-packages (from importlib-metadata<7.0,>=1.4.0->sagemaker->-r requirements.txt (line 15)) (3.17.0)
Requirement already satisfied: bcrypt>=3.2 in /opt/conda/lib/python3.10/site-packages (from paramiko>=2.4.2->sagemaker-training->-r requirements.txt (line 16)) (4.1.2)
Requirement already satisfied: cryptography>=3.3 in /opt/conda/lib/python3.10/site-packages (from paramiko>=2.4.2->sagemaker-training->-r requirements.txt (line 16)) (42.0.5)
Requirement already satisfied: pynacl>=1.5 in /opt/conda/lib/python3.10/site-packages (from paramiko>=2.4.2->sagemaker-training->-r requirements.txt (line 16)) (1.5.0)
Requirement already satisfied: MarkupSafe>=2.1.1 in /opt/conda/lib/python3.10/site-packages (from werkzeug>=0.15.5->sagemaker-training->-r requirements.txt (line 16)) (2.1.5)
Requirement already satisfied: humanfriendly>=9.1 in /opt/conda/lib/python3.10/site-packages (from coloredlogs->onnxruntime->-r requirements.txt (line 7)) (10.0)
Requirement already satisfied: charset-normalizer<4,>=2 in /opt/conda/lib/python3.10/site-packages (from requests->sagemaker->-r requirements.txt (line 15)) (3.2.0)
Requirement already satisfied: idna<4,>=2.5 in /opt/conda/lib/python3.10/site-packages (from requests->sagemaker->-r requirements.txt (line 15)) (3.4)
Requirement already satisfied: certifi>=2017.4.17 in /opt/conda/lib/python3.10/site-packages (from requests->sagemaker->-r requirements.txt (line 15)) (2024.2.2)
Requirement already satisfied: zope.event in /opt/conda/lib/python3.10/site-packages (from gevent->sagemaker-training->-r requirements.txt (line 16)) (5.0)
Requirement already satisfied: zope.interface in /opt/conda/lib/python3.10/site-packages (from gevent->sagemaker-training->-r requirements.txt (line 16)) (6.2)
Requirement already satisfied: jsonschema-specifications>=2023.03.6 in /opt/conda/lib/python3.10/site-packages (from jsonschema->sagemaker->-r requirements.txt (line 15)) (2023.12.1)
Requirement already satisfied: referencing>=0.28.4 in /opt/conda/lib/python3.10/site-packages (from jsonschema->sagemaker->-r requirements.txt (line 15)) (0.33.0)
Requirement already satisfied: rpds-py>=0.7.1 in /opt/conda/lib/python3.10/site-packages (from jsonschema->sagemaker->-r requirements.txt (line 15)) (0.18.0)
Requirement already satisfied: ppft>=1.7.6.8 in /opt/conda/lib/python3.10/site-packages (from pathos->sagemaker->-r requirements.txt (line 15)) (1.7.6.8)
Requirement already satisfied: dill>=0.3.8 in /opt/conda/lib/python3.10/site-packages (from pathos->sagemaker->-r requirements.txt (line 15)) (0.3.8)
Requirement already satisfied: pox>=0.3.4 in /opt/conda/lib/python3.10/site-packages (from pathos->sagemaker->-r requirements.txt (line 15)) (0.3.4)
Requirement already satisfied: multiprocess>=0.70.16 in /opt/conda/lib/python3.10/site-packages (from pathos->sagemaker->-r requirements.txt (line 15)) (0.70.16)
Requirement already satisfied: contextlib2>=0.5.5 in /opt/conda/lib/python3.10/site-packages (from schema->sagemaker->-r requirements.txt (line 15)) (21.6.0)
Requirement already satisfied: mpmath>=0.19 in /opt/conda/lib/python3.10/site-packages (from sympy->onnxruntime->-r requirements.txt (line 7)) (1.3.0)
Requirement already satisfied: cffi>=1.12 in /opt/conda/lib/python3.10/site-packages (from cryptography>=3.3->paramiko>=2.4.2->sagemaker-training->-r requirements.txt (line 16)) (1.15.1)
Requirement already satisfied: setuptools in /opt/conda/lib/python3.10/site-packages (from zope.event->gevent->sagemaker-training->-r requirements.txt (line 16)) (68.1.2)
Requirement already satisfied: pycparser in /opt/conda/lib/python3.10/site-packages (from cffi>=1.12->cryptography>=3.3->paramiko>=2.4.2->sagemaker-training->-r requirements.txt (line 16)) (2.21)
Building wheels for collected packages: DET
Building wheel for DET (setup.py): started
Building wheel for DET (setup.py): finished with status 'done'
Created wheel for DET: filename=DET-0.1-py3-none-any.whl size=58109 sha256=cc1e4d8bf3a6a5dce23f244ad105697e6bf3154b74819103fc6608825d5216da
Stored in directory: /tmp/pip-ephem-wheel-cache-qhc9z9qg/wheels/ee/79/1e/3fb168dd34359b627e23b53045c3eb498188294150b39e2fb0
Successfully built DET
Installing collected packages: DET
Successfully installed DET-0.1
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
2024-03-16 01:12:16,538 sagemaker-training-toolkit INFO Waiting for the process to finish and give a return code.
2024-03-16 01:12:16,538 sagemaker-training-toolkit INFO Done waiting for a return code. Received 0 from exiting process.
2024-03-16 01:12:16,618 sagemaker-training-toolkit INFO No Neurons detected (normal if no neurons installed)
2024-03-16 01:12:16,681 sagemaker-training-toolkit INFO No Neurons detected (normal if no neurons installed)
2024-03-16 01:12:16,691 sagemaker-training-toolkit INFO Starting distributed training through torchrun
2024-03-16 01:12:16,691 sagemaker-training-toolkit ERROR Reporting training FAILURE
2024-03-16 01:12:16,691 sagemaker-training-toolkit ERROR Python packages are not supported for torch_distributed. Please use a python script as the entry-point
2024-03-16 01:12:16,691 sagemaker-training-toolkit ERROR Encountered exit_code 1
2024-03-16 01:13:23 Uploading - Uploading generated training model
2024-03-16 01:13:52 Failed - Training job failed
Traceback (most recent call last):
File "/home/sky/Desktop/det/./sagemaker_train.py", line 79, in
main(args)
File "/home/sky/Desktop/det/./sagemaker_train.py", line 59, in main
estimator.fit()
File "/home/sgai/lib/python3.10/site-packages/sagemaker/workflow/pipeline_context.py", line 346, in wrapper
return run_func(*args, **kwargs)
File "/home/sgai/lib/python3.10/site-packages/sagemaker/estimator.py", line 1341, in fit
self.latest_training_job.wait(logs=logs)
File "/home/sgai/lib/python3.10/site-packages/sagemaker/estimator.py", line 2677, in wait
self.sagemaker_session.logs_for_job(self.job_name, wait=True, log_type=logs)
File "/home/sgai/lib/python3.10/site-packages/sagemaker/session.py", line 5737, in logs_for_job
_logs_for_job(self, job_name, wait, poll, log_type, timeout)
File "/home/sgai/lib/python3.10/site-packages/sagemaker/session.py", line 7880, in _logs_for_job
_check_job_status(job_name, description, "TrainingJobStatus")
File "/home/sgai/lib/python3.10/site-packages/sagemaker/session.py", line 7933, in _check_job_status
raise exceptions.UnexpectedStatusException(
sagemaker.exceptions.UnexpectedStatusException: Error for Training job det-2024-03-16-00-47-26-876: Failed. Reason: AlgorithmError: Python packages are not supported for torch_distributed. Please use a python script as the entry-point, exit code: 1
My Dockerfile (using base image with some extensions)
`
// Set base image
FROM 442386744353.dkr.ecr.us-gov-west-1.amazonaws.com/pytorch-training:2.1.0-gpu-py310
COPY ./requirements.txt /home/requirements.txt
RUN pip install -r /home/requirements.txt
// RUN pip3 install sagemaker-pytorch-training
######
// Labels as key-value pair
LABEL Maintainer="sg.alton"
// Update packages
RUN apt-get update && apt-get install gcc -y
// Install git
RUN ["apt-get", "install", "-y", "git"]
`
Finally, the code where Sagemaker estimator is called
`
import argparse
import json
import boto3
import time
from sagemaker.pytorch.estimator import PyTorch
import sagemaker
from sagemaker.estimator import Estimator
from util import logging_utils
from aws.aws_instance_pricing import instance_dict
import os
import shutil
import subprocess
from subprocess import check_output, STDOUT
def main(args):
"""get AWS credentials"""
client=boto3.client('sts')
account=client.get_caller_identity()['Account']
my_session=boto3.session.Session()
region=my_session.region_name
ecr_image_uri='{}.dkr.ecr.{}.amazonaws.com/{}:latest'.format(account, region,
args.algorithm_name)
config_params = json.load(open(args.json_config_file_name,))
src = config_params['aws_urls_paths']
shutil.copy(src,"aws/aws_urls.json")
"""get subnet and security_groups from AWS SSM"""
cmd = """aws ssm get-parameter --name sky-param | jq '.Parameter.Value' | jq -r '.'"""
ssm_credentials = check_output(cmd, shell=True, universal_newlines=True, stderr=STDOUT).rstrip('\n')
ssm_credentials = json.loads(ssm_credentials)
role = ssm_credentials['role']
estimator = PyTorch(
image_uri=args.custom_image_uri, #our custom pytorch image URI
entry_point = "main.py", # training script
instance_count = 1, #number of EC2 instances needed for training
instance_type = args.instance_type, #Type of EC2 instance/s needed for training
disable_profiler = True, #Disable profiler, as it's not needed
role = role, #Execution role used by training job
source_dir="./",
distribution={"torch_distributed":{"enabled": True}}
)
print("Running estimator.fit...")
estimator.fit()
print("Done with estimator.fit")
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument('--json_config_file_name', action='store', type=str, help="train_config.json")
parser.add_argument('--algorithm_name', action='store', type=str, help="name of training algorithm/container")
parser.add_argument('--instance_type', action='store', type=str, default='ml.g4dn.4xlarge', help='specifies the type of instance to train on')
parser.add_argument('--custom_image_uri', action='store', type=str, help="Name of the ECR URI with custom image",\
default="834447890.dkr.ecr.us-gov-west-1.amazonaws.com/blahblah")
args = parser.parse_args()
main(args)
Important to note that inside main.py, various functions are called, and the primary dataloader pulls data from S3 via a boto3 client (all happens within the src code at the directory "./" that we call from. This explains why estimator.fit() is empty (and we do not explicitly add training data as the input)...
Anyone running into this problem? main.py is certainly a function, not a package, and I cannot catch down why the error occurs in sagemaker/session.py line 7933. Perhaps a bug with a misleading error message?
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
`2024-03-16 01:08:51 Training - Training image download completed. Training in progress........bash: cannot set terminal process group (-1): Inappropriate ioctl for device
bash: no job control in this shell
2024-03-16 01:09:58,071 sagemaker-training-toolkit INFO Imported framework sagemaker_pytorch_container.training
2024-03-16 01:09:58,124 sagemaker-training-toolkit INFO No Neurons detected (normal if no neurons installed)
2024-03-16 01:09:58,134 sagemaker_pytorch_container.training INFO Block until all host DNS lookups succeed.
2024-03-16 01:09:58,136 sagemaker_pytorch_container.training INFO Invoking TorchDistributed...
2024-03-16 01:09:58,136 sagemaker_pytorch_container.training INFO Invoking user training script.
2024-03-16 01:08:51 Training - Training image download completed.
Training in progress........bash: cannot set terminal process group (-1): Inappropriate ioctl for device
bash: no job control in this shell
2024-03-16 01:09:58,071 sagemaker-training-toolkit INFO Imported framework sagemaker_pytorch_container.training
2024-03-16 01:09:58,124 sagemaker-training-toolkit INFO No Neurons detected (normal if no neurons installed)
2024-03-16 01:09:58,134 sagemaker_pytorch_container.training INFO Block until all host DNS lookups succeed.
2024-03-16 01:09:58,136 sagemaker_pytorch_container.training INFO Invoking TorchDistributed...
2024-03-16 01:09:58,136 sagemaker_pytorch_container.training INFO Invoking user training script.
2024-03-16 01:12:13,835 sagemaker-training-toolkit INFO Installing module with the following command:
/opt/conda/bin/python3.10 -m pip install . -r requirements.txt
Processing /opt/ml/code
Preparing metadata (setup.py): started
Preparing metadata (setup.py): finished with status 'done'
Requirement already satisfied: cython in /opt/conda/lib/python3.10/site-packages (from -r requirements.txt (line 1)) (3.0.8)
Requirement already satisfied: submitit in /opt/conda/lib/python3.10/site-packages (from -r requirements.txt (line 2)) (1.5.1)
Requirement already satisfied: scipy in /opt/conda/lib/python3.10/site-packages (from -r requirements.txt (line 5)) (1.12.0)
Requirement already satisfied: onnx in /opt/conda/lib/python3.10/site-packages (from -r requirements.txt (line 6)) (1.15.0)
Requirement already satisfied: onnxruntime in /opt/conda/lib/python3.10/site-packages (from -r requirements.txt (line 7)) (1.17.1)
Requirement already satisfied: numpy in /opt/conda/lib/python3.10/site-packages (from -r requirements.txt (line 8)) (1.26.4)
Requirement already satisfied: pandas in /opt/conda/lib/python3.10/site-packages (from -r requirements.txt (line 10)) (2.2.1)
Requirement already satisfied: tabulate in /opt/conda/lib/python3.10/site-packages (from -r requirements.txt (line 11)) (0.9.0)
Requirement already satisfied: SQLAlchemy in /opt/conda/lib/python3.10/site-packages (from -r requirements.txt (line 12)) (2.0.28)
Requirement already satisfied: matplotlib in /opt/conda/lib/python3.10/site-packages (from -r requirements.txt (line 13)) (3.8.3)
Requirement already satisfied: scikit-learn in /opt/conda/lib/python3.10/site-packages (from -r requirements.txt (line 14)) (1.4.1.post1)
Requirement already satisfied: sagemaker in /opt/conda/lib/python3.10/site-packages (from -r requirements.txt (line 15)) (2.210.0)
Requirement already satisfied: sagemaker-training in /opt/conda/lib/python3.10/site-packages (from -r requirements.txt (line 16)) (4.7.4)
Requirement already satisfied: cloudpickle>=1.2.1 in /opt/conda/lib/python3.10/site-packages (from submitit->-r requirements.txt (line 2)) (2.2.1)
Requirement already satisfied: typing_extensions>=3.7.4.2 in /opt/conda/lib/python3.10/site-packages (from submitit->-r requirements.txt (line 2)) (4.10.0)
Requirement already satisfied: protobuf>=3.20.2 in /opt/conda/lib/python3.10/site-packages (from onnx->-r requirements.txt (line 6)) (3.20.3)
Requirement already satisfied: coloredlogs in /opt/conda/lib/python3.10/site-packages (from onnxruntime->-r requirements.txt (line 7)) (15.0.1)
Requirement already satisfied: flatbuffers in /opt/conda/lib/python3.10/site-packages (from onnxruntime->-r requirements.txt (line 7)) (24.3.7)
Requirement already satisfied: packaging in /opt/conda/lib/python3.10/site-packages (from onnxruntime->-r requirements.txt (line 7)) (23.1)
Requirement already satisfied: sympy in /opt/conda/lib/python3.10/site-packages (from onnxruntime->-r requirements.txt (line 7)) (1.12)
Requirement already satisfied: python-dateutil>=2.8.2 in /opt/conda/lib/python3.10/site-packages (from pandas->-r requirements.txt (line 10)) (2.8.2)
Requirement already satisfied: pytz>=2020.1 in /opt/conda/lib/python3.10/site-packages (from pandas->-r requirements.txt (line 10)) (2024.1)
Requirement already satisfied: tzdata>=2022.7 in /opt/conda/lib/python3.10/site-packages (from pandas->-r requirements.txt (line 10)) (2024.1)
Requirement already satisfied: greenlet!=0.4.17 in /opt/conda/lib/python3.10/site-packages (from SQLAlchemy->-r requirements.txt (line 12)) (3.0.3)
Requirement already satisfied: contourpy>=1.0.1 in /opt/conda/lib/python3.10/site-packages (from matplotlib->-r requirements.txt (line 13)) (1.2.0)
Requirement already satisfied: cycler>=0.10 in /opt/conda/lib/python3.10/site-packages (from matplotlib->-r requirements.txt (line 13)) (0.12.1)
Requirement already satisfied: fonttools>=4.22.0 in /opt/conda/lib/python3.10/site-packages (from matplotlib->-r requirements.txt (line 13)) (4.49.0)
Requirement already satisfied: kiwisolver>=1.3.1 in /opt/conda/lib/python3.10/site-packages (from matplotlib->-r requirements.txt (line 13)) (1.4.5)
Requirement already satisfied: pillow>=8 in /opt/conda/lib/python3.10/site-packages (from matplotlib->-r requirements.txt (line 13)) (10.2.0)
Requirement already satisfied: pyparsing>=2.3.1 in /opt/conda/lib/python3.10/site-packages (from matplotlib->-r requirements.txt (line 13)) (3.1.1)
Requirement already satisfied: joblib>=1.2.0 in /opt/conda/lib/python3.10/site-packages (from scikit-learn->-r requirements.txt (line 14)) (1.3.2)
Requirement already satisfied: threadpoolctl>=2.0.0 in /opt/conda/lib/python3.10/site-packages (from scikit-learn->-r requirements.txt (line 14)) (3.3.0)
Requirement already satisfied: attrs<24,>=23.1.0 in /opt/conda/lib/python3.10/site-packages (from sagemaker->-r requirements.txt (line 15)) (23.2.0)
Requirement already satisfied: boto3<2.0,>=1.33.3 in /opt/conda/lib/python3.10/site-packages (from sagemaker->-r requirements.txt (line 15)) (1.34.52)
Requirement already satisfied: google-pasta in /opt/conda/lib/python3.10/site-packages (from sagemaker->-r requirements.txt (line 15)) (0.2.0)
Requirement already satisfied: smdebug-rulesconfig==1.0.1 in /opt/conda/lib/python3.10/site-packages (from sagemaker->-r requirements.txt (line 15)) (1.0.1)
Requirement already satisfied: importlib-metadata<7.0,>=1.4.0 in /opt/conda/lib/python3.10/site-packages (from sagemaker->-r requirements.txt (line 15)) (6.11.0)
Requirement already satisfied: pathos in /opt/conda/lib/python3.10/site-packages (from sagemaker->-r requirements.txt (line 15)) (0.3.2)
Requirement already satisfied: schema in /opt/conda/lib/python3.10/site-packages (from sagemaker->-r requirements.txt (line 15)) (0.7.5)
Requirement already satisfied: PyYAML~=6.0 in /opt/conda/lib/python3.10/site-packages (from sagemaker->-r requirements.txt (line 15)) (6.0.1)
Requirement already satisfied: jsonschema in /opt/conda/lib/python3.10/site-packages (from sagemaker->-r requirements.txt (line 15)) (4.21.1)
Requirement already satisfied: platformdirs in /opt/conda/lib/python3.10/site-packages (from sagemaker->-r requirements.txt (line 15)) (4.2.0)
Requirement already satisfied: tblib<3,>=1.7.0 in /opt/conda/lib/python3.10/site-packages (from sagemaker->-r requirements.txt (line 15)) (2.0.0)
Requirement already satisfied: urllib3<3.0.0,>=1.26.8 in /opt/conda/lib/python3.10/site-packages (from sagemaker->-r requirements.txt (line 15)) (1.26.18)
Requirement already satisfied: requests in /opt/conda/lib/python3.10/site-packages (from sagemaker->-r requirements.txt (line 15)) (2.31.0)
Requirement already satisfied: docker in /opt/conda/lib/python3.10/site-packages (from sagemaker->-r requirements.txt (line 15)) (7.0.0)
Requirement already satisfied: tqdm in /opt/conda/lib/python3.10/site-packages (from sagemaker->-r requirements.txt (line 15)) (4.66.1)
Requirement already satisfied: psutil in /opt/conda/lib/python3.10/site-packages (from sagemaker->-r requirements.txt (line 15)) (5.9.8)
Requirement already satisfied: six in /opt/conda/lib/python3.10/site-packages (from sagemaker-training->-r requirements.txt (line 16)) (1.16.0)
Requirement already satisfied: pip in /opt/conda/lib/python3.10/site-packages (from sagemaker-training->-r requirements.txt (line 16)) (24.0)
Requirement already satisfied: retrying>=1.3.3 in /opt/conda/lib/python3.10/site-packages (from sagemaker-training->-r requirements.txt (line 16)) (1.3.4)
Requirement already satisfied: gevent in /opt/conda/lib/python3.10/site-packages (from sagemaker-training->-r requirements.txt (line 16)) (24.2.1)
Requirement already satisfied: inotify-simple==1.2.1 in /opt/conda/lib/python3.10/site-packages (from sagemaker-training->-r requirements.txt (line 16)) (1.2.1)
Requirement already satisfied: werkzeug>=0.15.5 in /opt/conda/lib/python3.10/site-packages (from sagemaker-training->-r requirements.txt (line 16)) (3.0.1)
Requirement already satisfied: paramiko>=2.4.2 in /opt/conda/lib/python3.10/site-packages (from sagemaker-training->-r requirements.txt (line 16)) (3.4.0)
Requirement already satisfied: botocore>=1.31.57 in /opt/conda/lib/python3.10/site-packages (from sagemaker-training->-r requirements.txt (line 16)) (1.34.52)
Requirement already satisfied: jmespath<2.0.0,>=0.7.1 in /opt/conda/lib/python3.10/site-packages (from boto3<2.0,>=1.33.3->sagemaker->-r requirements.txt (line 15)) (1.0.1)
Requirement already satisfied: s3transfer<0.11.0,>=0.10.0 in /opt/conda/lib/python3.10/site-packages (from boto3<2.0,>=1.33.3->sagemaker->-r requirements.txt (line 15)) (0.10.0)
Requirement already satisfied: zipp>=0.5 in /opt/conda/lib/python3.10/site-packages (from importlib-metadata<7.0,>=1.4.0->sagemaker->-r requirements.txt (line 15)) (3.17.0)
Requirement already satisfied: bcrypt>=3.2 in /opt/conda/lib/python3.10/site-packages (from paramiko>=2.4.2->sagemaker-training->-r requirements.txt (line 16)) (4.1.2)
Requirement already satisfied: cryptography>=3.3 in /opt/conda/lib/python3.10/site-packages (from paramiko>=2.4.2->sagemaker-training->-r requirements.txt (line 16)) (42.0.5)
Requirement already satisfied: pynacl>=1.5 in /opt/conda/lib/python3.10/site-packages (from paramiko>=2.4.2->sagemaker-training->-r requirements.txt (line 16)) (1.5.0)
Requirement already satisfied: MarkupSafe>=2.1.1 in /opt/conda/lib/python3.10/site-packages (from werkzeug>=0.15.5->sagemaker-training->-r requirements.txt (line 16)) (2.1.5)
Requirement already satisfied: humanfriendly>=9.1 in /opt/conda/lib/python3.10/site-packages (from coloredlogs->onnxruntime->-r requirements.txt (line 7)) (10.0)
Requirement already satisfied: charset-normalizer<4,>=2 in /opt/conda/lib/python3.10/site-packages (from requests->sagemaker->-r requirements.txt (line 15)) (3.2.0)
Requirement already satisfied: idna<4,>=2.5 in /opt/conda/lib/python3.10/site-packages (from requests->sagemaker->-r requirements.txt (line 15)) (3.4)
Requirement already satisfied: certifi>=2017.4.17 in /opt/conda/lib/python3.10/site-packages (from requests->sagemaker->-r requirements.txt (line 15)) (2024.2.2)
Requirement already satisfied: zope.event in /opt/conda/lib/python3.10/site-packages (from gevent->sagemaker-training->-r requirements.txt (line 16)) (5.0)
Requirement already satisfied: zope.interface in /opt/conda/lib/python3.10/site-packages (from gevent->sagemaker-training->-r requirements.txt (line 16)) (6.2)
Requirement already satisfied: jsonschema-specifications>=2023.03.6 in /opt/conda/lib/python3.10/site-packages (from jsonschema->sagemaker->-r requirements.txt (line 15)) (2023.12.1)
Requirement already satisfied: referencing>=0.28.4 in /opt/conda/lib/python3.10/site-packages (from jsonschema->sagemaker->-r requirements.txt (line 15)) (0.33.0)
Requirement already satisfied: rpds-py>=0.7.1 in /opt/conda/lib/python3.10/site-packages (from jsonschema->sagemaker->-r requirements.txt (line 15)) (0.18.0)
Requirement already satisfied: ppft>=1.7.6.8 in /opt/conda/lib/python3.10/site-packages (from pathos->sagemaker->-r requirements.txt (line 15)) (1.7.6.8)
Requirement already satisfied: dill>=0.3.8 in /opt/conda/lib/python3.10/site-packages (from pathos->sagemaker->-r requirements.txt (line 15)) (0.3.8)
Requirement already satisfied: pox>=0.3.4 in /opt/conda/lib/python3.10/site-packages (from pathos->sagemaker->-r requirements.txt (line 15)) (0.3.4)
Requirement already satisfied: multiprocess>=0.70.16 in /opt/conda/lib/python3.10/site-packages (from pathos->sagemaker->-r requirements.txt (line 15)) (0.70.16)
Requirement already satisfied: contextlib2>=0.5.5 in /opt/conda/lib/python3.10/site-packages (from schema->sagemaker->-r requirements.txt (line 15)) (21.6.0)
Requirement already satisfied: mpmath>=0.19 in /opt/conda/lib/python3.10/site-packages (from sympy->onnxruntime->-r requirements.txt (line 7)) (1.3.0)
Requirement already satisfied: cffi>=1.12 in /opt/conda/lib/python3.10/site-packages (from cryptography>=3.3->paramiko>=2.4.2->sagemaker-training->-r requirements.txt (line 16)) (1.15.1)
Requirement already satisfied: setuptools in /opt/conda/lib/python3.10/site-packages (from zope.event->gevent->sagemaker-training->-r requirements.txt (line 16)) (68.1.2)
Requirement already satisfied: pycparser in /opt/conda/lib/python3.10/site-packages (from cffi>=1.12->cryptography>=3.3->paramiko>=2.4.2->sagemaker-training->-r requirements.txt (line 16)) (2.21)
Building wheels for collected packages: DET
Building wheel for DET (setup.py): started
Building wheel for DET (setup.py): finished with status 'done'
Created wheel for DET: filename=DET-0.1-py3-none-any.whl size=58109 sha256=cc1e4d8bf3a6a5dce23f244ad105697e6bf3154b74819103fc6608825d5216da
Stored in directory: /tmp/pip-ephem-wheel-cache-qhc9z9qg/wheels/ee/79/1e/3fb168dd34359b627e23b53045c3eb498188294150b39e2fb0
Successfully built DET
Installing collected packages: DET
Successfully installed DET-0.1
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
2024-03-16 01:12:16,538 sagemaker-training-toolkit INFO Waiting for the process to finish and give a return code.
2024-03-16 01:12:16,538 sagemaker-training-toolkit INFO Done waiting for a return code. Received 0 from exiting process.
2024-03-16 01:12:16,618 sagemaker-training-toolkit INFO No Neurons detected (normal if no neurons installed)
2024-03-16 01:12:16,681 sagemaker-training-toolkit INFO No Neurons detected (normal if no neurons installed)
2024-03-16 01:12:16,691 sagemaker-training-toolkit INFO Starting distributed training through torchrun
2024-03-16 01:12:16,691 sagemaker-training-toolkit ERROR Reporting training FAILURE
2024-03-16 01:12:16,691 sagemaker-training-toolkit ERROR Python packages are not supported for torch_distributed. Please use a python script as the entry-point
2024-03-16 01:12:16,691 sagemaker-training-toolkit ERROR Encountered exit_code 1
2024-03-16 01:13:23 Uploading - Uploading generated training model
2024-03-16 01:13:52 Failed - Training job failed
Traceback (most recent call last):
File "/home/sky/Desktop/det/./sagemaker_train.py", line 79, in
main(args)
File "/home/sky/Desktop/det/./sagemaker_train.py", line 59, in main
estimator.fit()
File "/home/sgai/lib/python3.10/site-packages/sagemaker/workflow/pipeline_context.py", line 346, in wrapper
return run_func(*args, **kwargs)
File "/home/sgai/lib/python3.10/site-packages/sagemaker/estimator.py", line 1341, in fit
self.latest_training_job.wait(logs=logs)
File "/home/sgai/lib/python3.10/site-packages/sagemaker/estimator.py", line 2677, in wait
self.sagemaker_session.logs_for_job(self.job_name, wait=True, log_type=logs)
File "/home/sgai/lib/python3.10/site-packages/sagemaker/session.py", line 5737, in logs_for_job
_logs_for_job(self, job_name, wait, poll, log_type, timeout)
File "/home/sgai/lib/python3.10/site-packages/sagemaker/session.py", line 7880, in _logs_for_job
_check_job_status(job_name, description, "TrainingJobStatus")
File "/home/sgai/lib/python3.10/site-packages/sagemaker/session.py", line 7933, in _check_job_status
raise exceptions.UnexpectedStatusException(
sagemaker.exceptions.UnexpectedStatusException: Error for Training job de-2024-03-16-00-47-26-876: Failed. Reason: AlgorithmError: Python packages are not supported for torch_distributed. Please use a python script as the entry-point, exit code: 1
2024-03-16 01:12:13,835 sagemaker-training-toolkit INFO Installing module with the following command:
/opt/conda/bin/python3.10 -m pip install . -r requirements.txt
Processing /opt/ml/code
Preparing metadata (setup.py): started
Preparing metadata (setup.py): finished with status 'done'
Requirement already satisfied: cython in /opt/conda/lib/python3.10/site-packages (from -r requirements.txt (line 1)) (3.0.8)
Requirement already satisfied: submitit in /opt/conda/lib/python3.10/site-packages (from -r requirements.txt (line 2)) (1.5.1)
Requirement already satisfied: scipy in /opt/conda/lib/python3.10/site-packages (from -r requirements.txt (line 5)) (1.12.0)
Requirement already satisfied: onnx in /opt/conda/lib/python3.10/site-packages (from -r requirements.txt (line 6)) (1.15.0)
Requirement already satisfied: onnxruntime in /opt/conda/lib/python3.10/site-packages (from -r requirements.txt (line 7)) (1.17.1)
Requirement already satisfied: numpy in /opt/conda/lib/python3.10/site-packages (from -r requirements.txt (line 8)) (1.26.4)
Requirement already satisfied: pandas in /opt/conda/lib/python3.10/site-packages (from -r requirements.txt (line 10)) (2.2.1)
Requirement already satisfied: tabulate in /opt/conda/lib/python3.10/site-packages (from -r requirements.txt (line 11)) (0.9.0)
Requirement already satisfied: SQLAlchemy in /opt/conda/lib/python3.10/site-packages (from -r requirements.txt (line 12)) (2.0.28)
Requirement already satisfied: matplotlib in /opt/conda/lib/python3.10/site-packages (from -r requirements.txt (line 13)) (3.8.3)
Requirement already satisfied: scikit-learn in /opt/conda/lib/python3.10/site-packages (from -r requirements.txt (line 14)) (1.4.1.post1)
Requirement already satisfied: sagemaker in /opt/conda/lib/python3.10/site-packages (from -r requirements.txt (line 15)) (2.210.0)
Requirement already satisfied: sagemaker-training in /opt/conda/lib/python3.10/site-packages (from -r requirements.txt (line 16)) (4.7.4)
Requirement already satisfied: cloudpickle>=1.2.1 in /opt/conda/lib/python3.10/site-packages (from submitit->-r requirements.txt (line 2)) (2.2.1)
Requirement already satisfied: typing_extensions>=3.7.4.2 in /opt/conda/lib/python3.10/site-packages (from submitit->-r requirements.txt (line 2)) (4.10.0)
Requirement already satisfied: protobuf>=3.20.2 in /opt/conda/lib/python3.10/site-packages (from onnx->-r requirements.txt (line 6)) (3.20.3)
Requirement already satisfied: coloredlogs in /opt/conda/lib/python3.10/site-packages (from onnxruntime->-r requirements.txt (line 7)) (15.0.1)
Requirement already satisfied: flatbuffers in /opt/conda/lib/python3.10/site-packages (from onnxruntime->-r requirements.txt (line 7)) (24.3.7)
Requirement already satisfied: packaging in /opt/conda/lib/python3.10/site-packages (from onnxruntime->-r requirements.txt (line 7)) (23.1)
Requirement already satisfied: sympy in /opt/conda/lib/python3.10/site-packages (from onnxruntime->-r requirements.txt (line 7)) (1.12)
Requirement already satisfied: python-dateutil>=2.8.2 in /opt/conda/lib/python3.10/site-packages (from pandas->-r requirements.txt (line 10)) (2.8.2)
Requirement already satisfied: pytz>=2020.1 in /opt/conda/lib/python3.10/site-packages (from pandas->-r requirements.txt (line 10)) (2024.1)
Requirement already satisfied: tzdata>=2022.7 in /opt/conda/lib/python3.10/site-packages (from pandas->-r requirements.txt (line 10)) (2024.1)
Requirement already satisfied: greenlet!=0.4.17 in /opt/conda/lib/python3.10/site-packages (from SQLAlchemy->-r requirements.txt (line 12)) (3.0.3)
Requirement already satisfied: contourpy>=1.0.1 in /opt/conda/lib/python3.10/site-packages (from matplotlib->-r requirements.txt (line 13)) (1.2.0)
Requirement already satisfied: cycler>=0.10 in /opt/conda/lib/python3.10/site-packages (from matplotlib->-r requirements.txt (line 13)) (0.12.1)
Requirement already satisfied: fonttools>=4.22.0 in /opt/conda/lib/python3.10/site-packages (from matplotlib->-r requirements.txt (line 13)) (4.49.0)
Requirement already satisfied: kiwisolver>=1.3.1 in /opt/conda/lib/python3.10/site-packages (from matplotlib->-r requirements.txt (line 13)) (1.4.5)
Requirement already satisfied: pillow>=8 in /opt/conda/lib/python3.10/site-packages (from matplotlib->-r requirements.txt (line 13)) (10.2.0)
Requirement already satisfied: pyparsing>=2.3.1 in /opt/conda/lib/python3.10/site-packages (from matplotlib->-r requirements.txt (line 13)) (3.1.1)
Requirement already satisfied: joblib>=1.2.0 in /opt/conda/lib/python3.10/site-packages (from scikit-learn->-r requirements.txt (line 14)) (1.3.2)
Requirement already satisfied: threadpoolctl>=2.0.0 in /opt/conda/lib/python3.10/site-packages (from scikit-learn->-r requirements.txt (line 14)) (3.3.0)
Requirement already satisfied: attrs<24,>=23.1.0 in /opt/conda/lib/python3.10/site-packages (from sagemaker->-r requirements.txt (line 15)) (23.2.0)
Requirement already satisfied: boto3<2.0,>=1.33.3 in /opt/conda/lib/python3.10/site-packages (from sagemaker->-r requirements.txt (line 15)) (1.34.52)
Requirement already satisfied: google-pasta in /opt/conda/lib/python3.10/site-packages (from sagemaker->-r requirements.txt (line 15)) (0.2.0)
Requirement already satisfied: smdebug-rulesconfig==1.0.1 in /opt/conda/lib/python3.10/site-packages (from sagemaker->-r requirements.txt (line 15)) (1.0.1)
Requirement already satisfied: importlib-metadata<7.0,>=1.4.0 in /opt/conda/lib/python3.10/site-packages (from sagemaker->-r requirements.txt (line 15)) (6.11.0)
Requirement already satisfied: pathos in /opt/conda/lib/python3.10/site-packages (from sagemaker->-r requirements.txt (line 15)) (0.3.2)
Requirement already satisfied: schema in /opt/conda/lib/python3.10/site-packages (from sagemaker->-r requirements.txt (line 15)) (0.7.5)
Requirement already satisfied: PyYAML~=6.0 in /opt/conda/lib/python3.10/site-packages (from sagemaker->-r requirements.txt (line 15)) (6.0.1)
Requirement already satisfied: jsonschema in /opt/conda/lib/python3.10/site-packages (from sagemaker->-r requirements.txt (line 15)) (4.21.1)
Requirement already satisfied: platformdirs in /opt/conda/lib/python3.10/site-packages (from sagemaker->-r requirements.txt (line 15)) (4.2.0)
Requirement already satisfied: tblib<3,>=1.7.0 in /opt/conda/lib/python3.10/site-packages (from sagemaker->-r requirements.txt (line 15)) (2.0.0)
Requirement already satisfied: urllib3<3.0.0,>=1.26.8 in /opt/conda/lib/python3.10/site-packages (from sagemaker->-r requirements.txt (line 15)) (1.26.18)
Requirement already satisfied: requests in /opt/conda/lib/python3.10/site-packages (from sagemaker->-r requirements.txt (line 15)) (2.31.0)
Requirement already satisfied: docker in /opt/conda/lib/python3.10/site-packages (from sagemaker->-r requirements.txt (line 15)) (7.0.0)
Requirement already satisfied: tqdm in /opt/conda/lib/python3.10/site-packages (from sagemaker->-r requirements.txt (line 15)) (4.66.1)
Requirement already satisfied: psutil in /opt/conda/lib/python3.10/site-packages (from sagemaker->-r requirements.txt (line 15)) (5.9.8)
Requirement already satisfied: six in /opt/conda/lib/python3.10/site-packages (from sagemaker-training->-r requirements.txt (line 16)) (1.16.0)
Requirement already satisfied: pip in /opt/conda/lib/python3.10/site-packages (from sagemaker-training->-r requirements.txt (line 16)) (24.0)
Requirement already satisfied: retrying>=1.3.3 in /opt/conda/lib/python3.10/site-packages (from sagemaker-training->-r requirements.txt (line 16)) (1.3.4)
Requirement already satisfied: gevent in /opt/conda/lib/python3.10/site-packages (from sagemaker-training->-r requirements.txt (line 16)) (24.2.1)
Requirement already satisfied: inotify-simple==1.2.1 in /opt/conda/lib/python3.10/site-packages (from sagemaker-training->-r requirements.txt (line 16)) (1.2.1)
Requirement already satisfied: werkzeug>=0.15.5 in /opt/conda/lib/python3.10/site-packages (from sagemaker-training->-r requirements.txt (line 16)) (3.0.1)
Requirement already satisfied: paramiko>=2.4.2 in /opt/conda/lib/python3.10/site-packages (from sagemaker-training->-r requirements.txt (line 16)) (3.4.0)
Requirement already satisfied: botocore>=1.31.57 in /opt/conda/lib/python3.10/site-packages (from sagemaker-training->-r requirements.txt (line 16)) (1.34.52)
Requirement already satisfied: jmespath<2.0.0,>=0.7.1 in /opt/conda/lib/python3.10/site-packages (from boto3<2.0,>=1.33.3->sagemaker->-r requirements.txt (line 15)) (1.0.1)
Requirement already satisfied: s3transfer<0.11.0,>=0.10.0 in /opt/conda/lib/python3.10/site-packages (from boto3<2.0,>=1.33.3->sagemaker->-r requirements.txt (line 15)) (0.10.0)
Requirement already satisfied: zipp>=0.5 in /opt/conda/lib/python3.10/site-packages (from importlib-metadata<7.0,>=1.4.0->sagemaker->-r requirements.txt (line 15)) (3.17.0)
Requirement already satisfied: bcrypt>=3.2 in /opt/conda/lib/python3.10/site-packages (from paramiko>=2.4.2->sagemaker-training->-r requirements.txt (line 16)) (4.1.2)
Requirement already satisfied: cryptography>=3.3 in /opt/conda/lib/python3.10/site-packages (from paramiko>=2.4.2->sagemaker-training->-r requirements.txt (line 16)) (42.0.5)
Requirement already satisfied: pynacl>=1.5 in /opt/conda/lib/python3.10/site-packages (from paramiko>=2.4.2->sagemaker-training->-r requirements.txt (line 16)) (1.5.0)
Requirement already satisfied: MarkupSafe>=2.1.1 in /opt/conda/lib/python3.10/site-packages (from werkzeug>=0.15.5->sagemaker-training->-r requirements.txt (line 16)) (2.1.5)
Requirement already satisfied: humanfriendly>=9.1 in /opt/conda/lib/python3.10/site-packages (from coloredlogs->onnxruntime->-r requirements.txt (line 7)) (10.0)
Requirement already satisfied: charset-normalizer<4,>=2 in /opt/conda/lib/python3.10/site-packages (from requests->sagemaker->-r requirements.txt (line 15)) (3.2.0)
Requirement already satisfied: idna<4,>=2.5 in /opt/conda/lib/python3.10/site-packages (from requests->sagemaker->-r requirements.txt (line 15)) (3.4)
Requirement already satisfied: certifi>=2017.4.17 in /opt/conda/lib/python3.10/site-packages (from requests->sagemaker->-r requirements.txt (line 15)) (2024.2.2)
Requirement already satisfied: zope.event in /opt/conda/lib/python3.10/site-packages (from gevent->sagemaker-training->-r requirements.txt (line 16)) (5.0)
Requirement already satisfied: zope.interface in /opt/conda/lib/python3.10/site-packages (from gevent->sagemaker-training->-r requirements.txt (line 16)) (6.2)
Requirement already satisfied: jsonschema-specifications>=2023.03.6 in /opt/conda/lib/python3.10/site-packages (from jsonschema->sagemaker->-r requirements.txt (line 15)) (2023.12.1)
Requirement already satisfied: referencing>=0.28.4 in /opt/conda/lib/python3.10/site-packages (from jsonschema->sagemaker->-r requirements.txt (line 15)) (0.33.0)
Requirement already satisfied: rpds-py>=0.7.1 in /opt/conda/lib/python3.10/site-packages (from jsonschema->sagemaker->-r requirements.txt (line 15)) (0.18.0)
Requirement already satisfied: ppft>=1.7.6.8 in /opt/conda/lib/python3.10/site-packages (from pathos->sagemaker->-r requirements.txt (line 15)) (1.7.6.8)
Requirement already satisfied: dill>=0.3.8 in /opt/conda/lib/python3.10/site-packages (from pathos->sagemaker->-r requirements.txt (line 15)) (0.3.8)
Requirement already satisfied: pox>=0.3.4 in /opt/conda/lib/python3.10/site-packages (from pathos->sagemaker->-r requirements.txt (line 15)) (0.3.4)
Requirement already satisfied: multiprocess>=0.70.16 in /opt/conda/lib/python3.10/site-packages (from pathos->sagemaker->-r requirements.txt (line 15)) (0.70.16)
Requirement already satisfied: contextlib2>=0.5.5 in /opt/conda/lib/python3.10/site-packages (from schema->sagemaker->-r requirements.txt (line 15)) (21.6.0)
Requirement already satisfied: mpmath>=0.19 in /opt/conda/lib/python3.10/site-packages (from sympy->onnxruntime->-r requirements.txt (line 7)) (1.3.0)
Requirement already satisfied: cffi>=1.12 in /opt/conda/lib/python3.10/site-packages (from cryptography>=3.3->paramiko>=2.4.2->sagemaker-training->-r requirements.txt (line 16)) (1.15.1)
Requirement already satisfied: setuptools in /opt/conda/lib/python3.10/site-packages (from zope.event->gevent->sagemaker-training->-r requirements.txt (line 16)) (68.1.2)
Requirement already satisfied: pycparser in /opt/conda/lib/python3.10/site-packages (from cffi>=1.12->cryptography>=3.3->paramiko>=2.4.2->sagemaker-training->-r requirements.txt (line 16)) (2.21)
Building wheels for collected packages: DET
Building wheel for DET (setup.py): started
Building wheel for DET (setup.py): finished with status 'done'
Created wheel for DET: filename=DET-0.1-py3-none-any.whl size=58109 sha256=cc1e4d8bf3a6a5dce23f244ad105697e6bf3154b74819103fc6608825d5216da
Stored in directory: /tmp/pip-ephem-wheel-cache-qhc9z9qg/wheels/ee/79/1e/3fb168dd34359b627e23b53045c3eb498188294150b39e2fb0
Successfully built DET
Installing collected packages: DET
Successfully installed DET-0.1
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
2024-03-16 01:12:16,538 sagemaker-training-toolkit INFO Waiting for the process to finish and give a return code.
2024-03-16 01:12:16,538 sagemaker-training-toolkit INFO Done waiting for a return code. Received 0 from exiting process.
2024-03-16 01:12:16,618 sagemaker-training-toolkit INFO No Neurons detected (normal if no neurons installed)
2024-03-16 01:12:16,681 sagemaker-training-toolkit INFO No Neurons detected (normal if no neurons installed)
2024-03-16 01:12:16,691 sagemaker-training-toolkit INFO Starting distributed training through torchrun
2024-03-16 01:12:16,691 sagemaker-training-toolkit ERROR Reporting training FAILURE
2024-03-16 01:12:16,691 sagemaker-training-toolkit ERROR Python packages are not supported for torch_distributed. Please use a python script as the entry-point
2024-03-16 01:12:16,691 sagemaker-training-toolkit ERROR Encountered exit_code 1
2024-03-16 01:13:23 Uploading - Uploading generated training model
2024-03-16 01:13:52 Failed - Training job failed
Traceback (most recent call last):
File "/home/sky/Desktop/det/./sagemaker_train.py", line 79, in
main(args)
File "/home/sky/Desktop/det/./sagemaker_train.py", line 59, in main
estimator.fit()
File "/home/sgai/lib/python3.10/site-packages/sagemaker/workflow/pipeline_context.py", line 346, in wrapper
return run_func(*args, **kwargs)
File "/home/sgai/lib/python3.10/site-packages/sagemaker/estimator.py", line 1341, in fit
self.latest_training_job.wait(logs=logs)
File "/home/sgai/lib/python3.10/site-packages/sagemaker/estimator.py", line 2677, in wait
self.sagemaker_session.logs_for_job(self.job_name, wait=True, log_type=logs)
File "/home/sgai/lib/python3.10/site-packages/sagemaker/session.py", line 5737, in logs_for_job
_logs_for_job(self, job_name, wait, poll, log_type, timeout)
File "/home/sgai/lib/python3.10/site-packages/sagemaker/session.py", line 7880, in _logs_for_job
_check_job_status(job_name, description, "TrainingJobStatus")
File "/home/sgai/lib/python3.10/site-packages/sagemaker/session.py", line 7933, in _check_job_status
raise exceptions.UnexpectedStatusException(
sagemaker.exceptions.UnexpectedStatusException: Error for Training job det-2024-03-16-00-47-26-876: Failed. Reason: AlgorithmError: Python packages are not supported for torch_distributed. Please use a python script as the entry-point, exit code: 1
My Dockerfile (using base image with some extensions)
`
`
Finally, the code where Sagemaker estimator is called
`
Important to note that inside main.py, various functions are called, and the primary dataloader pulls data from S3 via a boto3 client (all happens within the src code at the directory "./" that we call from. This explains why estimator.fit() is empty (and we do not explicitly add training data as the input)...
Anyone running into this problem? main.py is certainly a function, not a package, and I cannot catch down why the error occurs in sagemaker/session.py line 7933. Perhaps a bug with a misleading error message?
Beta Was this translation helpful? Give feedback.
All reactions