Skip to content

local_gpu with distributed training for single instance multi-gpu distributed training #1582

@ChaiBapchya

Description

@ChaiBapchya

Describe the bug
Upon testing for local-session [sagemaker], single instance, multi-gpu distributed training

It fails at

Input
training_instance_type = 'local_gpu', distributions = {'mpi': {'enabled': True, 'processes_per_host': 4}}

Stack Trace

    def warn_if_parameter_server_with_multi_gpu(training_instance_type, distributions):
        """Warn the user that training will not fully leverage all the GPU
        cores if parameter server is enabled and a multi-GPU instance is selected.
        Distributed training with the default parameter server setup doesn't
        support multi-GPU instances.

        Args:
            training_instance_type (str): A string representing the type of training instance selected.
            distributions (dict): A dictionary with information to enable distributed training.
                (Defaults to None if distributed training is not enabled.) For example:

                .. code:: python

                    {
                        'parameter_server':
                        {
                            'enabled': True
                        }
                    }


        """
        if training_instance_type == "local" or distributions is None:
            return

        is_multi_gpu_instance = (
>           training_instance_type.split(".")[1].startswith("p")
            and training_instance_type not in SINGLE_GPU_INSTANCE_TYPES
        )
E       IndexError: list index out of range

.tox/py37/lib/python3.7/site-packages/sagemaker/fw_utils.py:620: IndexError

To reproduce

tox -e py37 -- tests/integ/test_horovod_mx.py

Expected behavior
For running Distributed training on single instance multi-gpu for mpi-based horovod, I encounter this error.
Since I'm using horovod [mpi] this warning isn't relevant.
I suggest we should also add local_gpu here

if training_instance_type in ["local","local_gpu"] or distributions is None:
            return

System information
A description of your system. Please provide:

  • SageMaker Python SDK version: Build from source [1.62.1.dev0]
  • Framework name (eg. PyTorch) or algorithm (eg. KMeans):MXNet
  • Framework version:1.6.0
  • Python version:3
  • CPU or GPU:GPU
  • Custom Docker image (Y/N):Y

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions