Skip to content

Conversation

@fegin
Copy link
Contributor

@fegin fegin commented Nov 7, 2025

This allows people to customize the distributed environment, including ParallelDims and distributed backend.

This allows people to customize the distributed environment, including ParallelDims and distributed backend
@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Nov 7, 2025
@fegin fegin marked this pull request as ready for review November 7, 2025 21:25
parallel_dims: TorchCommsParallelDims

def _create_parallel_dims(self, parallelism_config, world_size) -> ParallelDims:
def init_distributed_env(self, job_config: JobConfig) -> ParallelDims:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR doesn't do anything new with distributed backend. Are you saying we will modify this code for torchcomms?

Asking because I'm a bit concerned about sending whole job_config to this function, which seems a regression in terms of config usage hygiene.

Copy link
Contributor

@tianyu-l tianyu-l left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

stamp to unblock

@fegin
Copy link
Contributor Author

fegin commented Nov 7, 2025

This is to unblock the case where people who need to subclass trainer to customize the distributed environment, which includes both TorchCOMM and TorchFT. Unfortunately, it seems that initializing the distributed environment requires more configurations from different configurations. So there is no single sub configuration is enough. But I shouldn't pass job_config as it is a method and can access job_config anyway.

@fegin fegin merged commit 61c25f8 into main Nov 8, 2025
10 checks passed
@fegin fegin deleted the chienchin/trainer_distributed branch November 8, 2025 05:14
jquesnelle pushed a commit to NousResearch/torchtitan that referenced this pull request Nov 10, 2025
…h#2003)

This allows people to customize the distributed environment, including
ParallelDims and distributed backend.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants