-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Closed
Labels
Description
Describe the bug
Upon testing for local-session [sagemaker], single instance, multi-gpu distributed training
It fails at
Input
training_instance_type = 'local_gpu', distributions = {'mpi': {'enabled': True, 'processes_per_host': 4}}
Stack Trace
def warn_if_parameter_server_with_multi_gpu(training_instance_type, distributions):
"""Warn the user that training will not fully leverage all the GPU
cores if parameter server is enabled and a multi-GPU instance is selected.
Distributed training with the default parameter server setup doesn't
support multi-GPU instances.
Args:
training_instance_type (str): A string representing the type of training instance selected.
distributions (dict): A dictionary with information to enable distributed training.
(Defaults to None if distributed training is not enabled.) For example:
.. code:: python
{
'parameter_server':
{
'enabled': True
}
}
"""
if training_instance_type == "local" or distributions is None:
return
is_multi_gpu_instance = (
> training_instance_type.split(".")[1].startswith("p")
and training_instance_type not in SINGLE_GPU_INSTANCE_TYPES
)
E IndexError: list index out of range
.tox/py37/lib/python3.7/site-packages/sagemaker/fw_utils.py:620: IndexError
To reproduce
tox -e py37 -- tests/integ/test_horovod_mx.py
- Build custom docker image :
https://github.com/ChaiBapchya/sagemaker-mxnet-training-toolkit/tree/mx_hvd_mpi - custom Sagemaker python SDK build from source : https://github.com/ChaiBapchya/sagemaker-python-sdk/tree/mx_estimator_horovod_mpi
Expected behavior
For running Distributed training on single instance multi-gpu for mpi-based horovod, I encounter this error.
Since I'm using horovod [mpi] this warning isn't relevant.
I suggest we should also add local_gpu here
if training_instance_type in ["local","local_gpu"] or distributions is None:
return
System information
A description of your system. Please provide:
- SageMaker Python SDK version: Build from source [1.62.1.dev0]
- Framework name (eg. PyTorch) or algorithm (eg. KMeans):MXNet
- Framework version:1.6.0
- Python version:3
- CPU or GPU:GPU
- Custom Docker image (Y/N):Y