-
Notifications
You must be signed in to change notification settings - Fork 699
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Discussion] PyTorch Operator Improvement #1836
Comments
@kuizhiqing Thanks for creating great issue ! Can you create a proposal like https://github.com/kubeflow/katib/tree/master/docs/proposals? |
I think this great proposal is worth leaving as documentation. |
cc: @kubeflow/wg-training-leads |
cc @zw0610 |
@tenzen-y Thank you for your kind feedback. I appreciate your suggestion to provide a documentation version which could be a PR though. However, I think it would be better to do so after further discussion and we have reached some agreement on this issue. As you noted, I have proposed some questions to prompt discussion, but I am still refining these ideas to develop a more robust proposal, especially for the elastic training part. |
I see. SGTM. I will leave my feedback later on this issue. |
Thanks for taking time for writing this detailed proposal. I will review it carefully soon. I have few early questions
|
@johnugeorge Thank you for your comment. Let me try to answer your questions:
Again, arguments can be set equally using environment variables.
|
Thank you for driving this @kuizhiqing
Does it mean that users can't use PyTorchJob without |
I think that this is breaking change. So I would propose that we create a new v2beta1 PyTrochJob.
At first, we can implement simple PyTorchJob v2 assuming @kuizhiqing WDYT? |
@tenzen-y I am not sure, that dropping support for older PyTorch versions in PyTorchJobs makes sense for our users. |
@andreyvelich @tenzen-y
Overall, I want to unify all training to use torchrun, including regular training and elastic training, though unifying the approaches has been more difficult than anticipated. For now, if we talking about a short-term and practical solution, since approach 3 is compatible with approach 2, I prefer to make minor changes to support approaches 2 and 3, without breaking change. |
I agree with you.
Overall it sounds good to me. But I have one more concern and question. @kuizhiqing @andreyvelich @johnugeorge Do you have any ideas? |
@tenzen-y @kuizhiqing Can we just verify the container start command and assign the appropriate env variables?
|
If users define commands as Dockerfile ENTRYPOINT, that way doesn't work fine. But I don't have a good idea which env vars we should set when podSpec.containers[0].command is empty. |
@tenzen-y I remember, that we can extract Docker Entrypoint using |
Thanks for the great suggestion. Maybe we should evaluate the library: "Which standards are they supported? (Docker v1, Dokcer v2, OCI v1, OCI v2, Lazy Pulling, etc...)" |
If I understand correctly, the only change proposed is to change the semantics how WORLD_SIZE, RANK, LOCAL_RANK, LOCAL_SIZE variables are populated.
|
|
I think we should add fields for setting
|
@tenzen-y Yes, you are right, many cloud provider custom their resources declaration. Maybe adding field explicit is better. |
It would be better if such a new field for |
@kuizhiqing Could you update the proposal to include new fields for the |
I'm ok with adding new fields to PyTorchJob v1 to maintain backward compatibility. |
Yes, a new field will help in maintaining backward compatibilty in supporting all launchers nprocs_per_node by default is set to 1. If this field is set with a value greater than 1, use new way of setting WORLD_SIZE and other parameters. Else, we will use current way. Master spec can be explicit master or one of the pods(eg: pod0) |
I'm working on a version try to make a compatible change as we discussed, this PR is not fully tested in all cases. |
@kuizhiqing Curious which DGX box are you using? Is it a 32 DGX cluster with 8 GPUs each? You referred to having TP and DP on the same DGX? Regarding the rank assignment discussion, what is ideal topology and expectation of rank of workers given your DGX cluster setup ? |
@kuizhiqing Does DGX box mean DGX superPOD? |
@johnugeorge @tenzen-y |
@johnugeorge @tenzen-y It implement part of what I propose to do, I'm continuing working on it. |
@kuizhiqing After #1840 , do you need more changes wrt this issue? How do you handle rank assignment in your current deployment? |
@johnugeorge wrt this issue, I want to make a change that make the master/worker declaration separation optional, as someone already request this feature. Otherwise, I will leave other idea in long term improvement. |
do you plan to take this up in a week? Release feature freeze will happen by end of next week. |
@johnugeorge After evaluation, I think the feature making master declaration optional may not necessary in current design, and it can be done in the current elastic mode. Overall, I have no more change pending before release. Thx. |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
/lifecycle frozen |
Motivation
The current PyTorch Operator focuses on a one process per pod architecture, which may not fully utilize the generic design of PyTorch and can underperform. When users adopt
torchrun
as the entrypoint, the operator does not function properly, see #1790.Background
PyTorch Architecture
For distributed training with GPUs, frameworks like PyTorch and PaddlePaddle use a one process per GPU architecture. They introduce a launch process on each node to manage the GPU-bound processes.
Each process is identified by:
*
world_size
: The total number of GPUs.*
rank
: The global rank of the process (from 0 toworld_size - 1
).*
local_size
: The number of GPUs on the node.*
local_rank
: The rank of the process within the node.Thanks to argparse_util, these settings can also be passed through environment variables which have higher priority. For this proposal, we do not distinguish between args and env vars.
Since version 1.9.0 (released on Jun 2021), PyTorch has
torchrun
which is an alias forpython -m torch.distributed.run
.Compared to
torch.distributed.launch
,torchrun
provides:rank
andsize
are assigned automatically.Note that when using
torchrun
, specifying rank is optional; rank can be provided with the--node_rank
argument if desired, buttorchrun
will automatically assign ranks otherwise.Megatron and Large Language Model Training
3D parallelism is commonly used to train large models in a distributed fashion. For example, a job could use:
This requires 8x8x4 = 256 GPUs across 32 nodes of 8 GPUs each. The model has 8 (TP) x 4 (PP) = 32 partitions, with 8 replicas of the model taking different input data.
Communication overhead is typically TP > DP > PP. So we prefer to place the 8 TP partitions on the same node (DGX box), the 8 DP partitions on the same Switch, and PP partitions as close as possible.
The parallel groups (TP, DP, PP) are formed based on the rank of each worker, so the rank of each worker (bound to a GPU) indicates its proximity.
The scheduler assigns resources to optimize proximity. The operator or something else should assign ranks accordingly.
One more thing to note, in performance critical scenarios, users will typically run pods with host network for maximum efficiency.
Current Design
The current operator design appears tailored for PyTorch versions before 1.9.0 and favors running without the
torchrun
module; specifically, it calculatesWORLD_SIZE
based on pod replica count, inconsistent withtorchrun
's methodology.Proposed Design
The goal of the operator should be to natively support running Megatron-LM examples, such as the one shown here, using a YAML spec like the following:
The operator will handle setting distributed training environment variables for the example. It will:
WORLD_SIZE=256
andLOCAL_SIZE=8
Rank assignment remains open for discussion:
- In dynamic mode, ranks are assigned by alphabetical sorted IP, partially optimizing locality
- In elastic etcd mode, ranks are assigned randomly by join order
For maximum performance, users often implement custom rank assignments before calling
torchrun
or by modifying PyTorch's internal rank assignment logic.Discussion List
Should pod rank assignment be in the scope of the operator, or handled externally?
There are arguments for either approach. The operator assigning ranks enables optimization but reduces flexibility. External rank assignment is more flexible but may lack optimization.
Is only supporting
torchrun
and elastic mode acceptable, to simplify the operator?I think Yes, focusing the operator on elastic distributed training with
torchrun
encourages a simple, robust design aligned with PyTorch's capabilities. Warnings could notify users iftorchrun
is not used.Is designating one pod as a master separately necessary for collective training?
The current design is somewhat confusing and technically unnecessary.
Should pod rank be omitted from pod names in elastic mode?
Omitting rank from pod names in elastic mode decouples pod identity from rank, allowing ranks to change dynamically as nodes join or leave the cluster. This flexibility is important for elasticity.
This draft is not yet mature and there are many aspects that require further consideration. Comments and discussion are welcome.
Reference
The text was updated successfully, but these errors were encountered: