Why DeepSpeed executes all the code several times? #3086

igonro · 2023-03-23T09:26:59Z

igonro
Mar 23, 2023

I'm training a transformer model using Pytorch Lightning with DeepSpeed. I’m currently working on GCP and using a custom training job strategy, so I have all the code of my experiment in just one Python script.

The code is not only in charge of the training, but also of registering the experiment run, logging parameters, metrics, artifacts, etc.

The problem I’m running into is that DeepSpeed distributes the training across multiple GPUs, but also runs the entire code of the script multiple times. When the code for creating a experiment run gets executed multiple times, an exception is raised because there cannot be multiple experiment runs with the same name.

When I've read about the DeepSpeed strategy I thought it would only run in parallel the trainer.fit() part. Is there any way to have some code in my script that only runs once?

The code is something like this:

...

# this should only be executed once, otherwise I get an error
aiplatform.start_run(run_id)
aiplatform.log_params(params)

...

# this is the part that needs to be distributed across multiple gpus
trainer = pl.Trainer(accelerator="gpu", devices=4, strategy="deepspeed_stage_2", ...)
trainer.fit(model, train_dataloaders=train_dl, val_dataloaders=val_dl)

...

A workaround that I'm exploring is to create a wrapper script that will be in charge of registering the experiment, logging parameters, etc... after calling the training script.

I've already tested that if import and call the function of the training in this wrapper script, the wrapper also gets executed several times
If I execute the training script with subprocess.run(["python", "train.py"]), it looks like the code of the wrapper only gets executed once (but it's also harder to get the results of the training)

It would be great if someone could help me understand if it makes sense what I'm trying to achieve, or if there are other better solutions for this.
Thanks!

fac2003 · 2023-09-16T20:50:46Z

fac2003
Sep 16, 2023

Have you tried checking for rank ==0 to protect regions of code you need to run only once?

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why DeepSpeed executes all the code several times? #3086

{{title}}

Replies: 1 comment

{{title}}

Select a reply

Why DeepSpeed executes all the code several times? #3086

igonro Mar 23, 2023

Replies: 1 comment

fac2003 Sep 16, 2023

igonro
Mar 23, 2023

fac2003
Sep 16, 2023