Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable distributed training over multiple node in compliant detonation chamber #93

Open
nifarn opened this issue May 25, 2022 · 1 comment

Comments

@nifarn
Copy link
Contributor

nifarn commented May 25, 2022

Distributed training currently does not work across multiple nodes in compliant detonation chamber. This is due to a bug in the set_environment_variables_for_nccl_backend method in pymarlin.utils.distributed where the master node's address is taken from the environment variable AZ_BATCH_MASTER_NODE rather than AZ_BATCHAI_MPI_MASTER_NODE. While this works for signed builds, the former option is disabled in detonation chamber. Thus, can we modify the codebase to enable this behavior? Based off of this stackoverflow post, the recommendation seems to be to always use AZ_BATCHAI_MPI_MASTER_NODE over AZ_BATCH_MASTER_NODE.

Given the current implementation for set_environment_variables_for_nccl_backend, this should be a pretty straightforward change of removing the if statement for single_node. I have already verified these changes in compliant detonation chamber.

def set_environment_variables_for_nccl_backend():
    """Sets distributed training environments for azureml openmpi runs with NCCL backend."""

    # NCCL environment. Still works without it.
    os.environ["NCCL_SOCKET_IFNAME"] = "^docker0,lo"
    os.environ["NCCL_IB_DISABLE"] = "0"  # for IB

    single_node = int(os.environ["OMPI_COMM_WORLD_LOCAL_SIZE"]) == int(
        os.environ["OMPI_COMM_WORLD_SIZE"]
    )

    if single_node:
        master_node = os.environ["AZ_BATCHAI_MPI_MASTER_NODE"]
        master_port = "54965"
    else:
        master_node_params = os.environ["AZ_BATCH_MASTER_NODE"].split(":")

        master_node = master_node_params[0]
        master_port = (
            os.environ["MASTER_PORT"] if "MASTER_PORT" in os.environ else "6105"
        )

    # set env variables
    os.environ["MASTER_ADDR"] = master_node
    os.environ["MASTER_PORT"] = master_port
@nifarn
Copy link
Contributor Author

nifarn commented May 25, 2022

Created a PR with changes mentioned. #94

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant