You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Distributed training currently does not work across multiple nodes in compliant detonation chamber. This is due to a bug in the set_environment_variables_for_nccl_backend method in pymarlin.utils.distributed where the master node's address is taken from the environment variable AZ_BATCH_MASTER_NODE rather than AZ_BATCHAI_MPI_MASTER_NODE. While this works for signed builds, the former option is disabled in detonation chamber. Thus, can we modify the codebase to enable this behavior? Based off of this stackoverflow post, the recommendation seems to be to always use AZ_BATCHAI_MPI_MASTER_NODE over AZ_BATCH_MASTER_NODE.
Given the current implementation for set_environment_variables_for_nccl_backend, this should be a pretty straightforward change of removing the if statement for single_node. I have already verified these changes in compliant detonation chamber.
defset_environment_variables_for_nccl_backend():
"""Sets distributed training environments for azureml openmpi runs with NCCL backend."""# NCCL environment. Still works without it.os.environ["NCCL_SOCKET_IFNAME"] ="^docker0,lo"os.environ["NCCL_IB_DISABLE"] ="0"# for IBsingle_node=int(os.environ["OMPI_COMM_WORLD_LOCAL_SIZE"]) ==int(
os.environ["OMPI_COMM_WORLD_SIZE"]
)
ifsingle_node:
master_node=os.environ["AZ_BATCHAI_MPI_MASTER_NODE"]
master_port="54965"else:
master_node_params=os.environ["AZ_BATCH_MASTER_NODE"].split(":")
master_node=master_node_params[0]
master_port= (
os.environ["MASTER_PORT"] if"MASTER_PORT"inos.environelse"6105"
)
# set env variablesos.environ["MASTER_ADDR"] =master_nodeos.environ["MASTER_PORT"] =master_port
The text was updated successfully, but these errors were encountered:
Distributed training currently does not work across multiple nodes in compliant detonation chamber. This is due to a bug in the
set_environment_variables_for_nccl_backend
method inpymarlin.utils.distributed
where the master node's address is taken from the environment variableAZ_BATCH_MASTER_NODE
rather thanAZ_BATCHAI_MPI_MASTER_NODE
. While this works for signed builds, the former option is disabled in detonation chamber. Thus, can we modify the codebase to enable this behavior? Based off of this stackoverflow post, the recommendation seems to be to always useAZ_BATCHAI_MPI_MASTER_NODE
overAZ_BATCH_MASTER_NODE
.Given the current implementation for
set_environment_variables_for_nccl_backend
, this should be a pretty straightforward change of removing the if statement forsingle_node
. I have already verified these changes in compliant detonation chamber.The text was updated successfully, but these errors were encountered: