GPU Middle Class? #2161

EugenHotaj · 2024-12-16T17:22:47Z

Does torchtune have any plans to support "GPU middle class" users?

We're trying to evaluate using torchtune for post-training, especially since there are many useful features implemented (RLHF, LORA, etc). However, one big sticking point is that the system seems heavily geared towards single-node training. Are there plans to support multi-node training (e.g. 16-64 nodes) and things like model parallelism, 128k context training, etc?

If not, is torchtitan the recommended system to use?

Thanks!

joecummings · 2024-12-16T18:23:54Z

Hey @EugenHotaj - glad you're checking out torchtune. Up til now, we've managed to provide pretty extensive offerings including long-context, large models up to 405B, and RLHF all on single node. This has allowed people will smaller GPU budgets to fine-tune some pretty incredible models and develop new features faster b/c single node is much easier to debug.

Now, all that said, torchtune technically already supports multi-node for FSDP. And we plan on adding tensor parallel + model parallel very soon. The absolute latest we will have these features in torchtune is end of January, but I would bet on sooner!

Would you need anything beyond these parallelism techniques, e.g. pipeline parallel? Are you running on MAST or something like SLURM?

EugenHotaj · 2024-12-16T18:37:29Z

Now, all that said, torchtune technically already supports multi-node for FSDP. And we plan on adding tensor parallel + model parallel very soon. The absolute latest we will have these features in torchtune is end of January, but I would bet on sooner!

Thanks @joecummings that's awesome to hear!

Would you need anything beyond these parallelism techniques, e.g. pipeline parallel? Are you running on MAST or something like SLURM.

Yes we use SLURM -- I'm currently trying to hack a multi-node run from your suggestions on #2018 and torchtitan, so having some examples in torchtune would be super useful imo. We'd also take all the parallelisms we can get 😃, e.g. model, pipeline, and attention parallelism for longer context.

tginart · 2024-12-17T23:12:07Z

I second SLURM! I have also been trying to hack this into torchtune since the single-node experience is quite good.

joecummings added discussion Start a discussion distributed Anything related to distributed env (multi-GPU, multi-node) triaged This issue has been assigned an owner and appropriate label labels Dec 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPU Middle Class? #2161

GPU Middle Class? #2161

EugenHotaj commented Dec 16, 2024 •

edited

Loading

joecummings commented Dec 16, 2024

EugenHotaj commented Dec 16, 2024

tginart commented Dec 17, 2024

GPU Middle Class? #2161

GPU Middle Class? #2161

Comments

EugenHotaj commented Dec 16, 2024 • edited Loading

joecummings commented Dec 16, 2024

EugenHotaj commented Dec 16, 2024

tginart commented Dec 17, 2024

EugenHotaj commented Dec 16, 2024 •

edited

Loading