-
Notifications
You must be signed in to change notification settings - Fork 649
[RFC] Seperate init_distributed_env from the Trainer.__init__ #2003
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This allows people to customize the distributed environment, including ParallelDims and distributed backend
| parallel_dims: TorchCommsParallelDims | ||
|
|
||
| def _create_parallel_dims(self, parallelism_config, world_size) -> ParallelDims: | ||
| def init_distributed_env(self, job_config: JobConfig) -> ParallelDims: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR doesn't do anything new with distributed backend. Are you saying we will modify this code for torchcomms?
Asking because I'm a bit concerned about sending whole job_config to this function, which seems a regression in terms of config usage hygiene.
tianyu-l
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
stamp to unblock
|
This is to unblock the case where people who need to subclass trainer to customize the distributed environment, which includes both TorchCOMM and TorchFT. Unfortunately, it seems that initializing the distributed environment requires more configurations from different configurations. So there is no single sub configuration is enough. But I shouldn't pass job_config as it is a method and can access job_config anyway. |
…h#2003) This allows people to customize the distributed environment, including ParallelDims and distributed backend.
This allows people to customize the distributed environment, including ParallelDims and distributed backend.