Replies: 1 comment
-
Have you tried checking for rank ==0 to protect regions of code you need to run only once? |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
I'm training a transformer model using Pytorch Lightning with DeepSpeed. I’m currently working on GCP and using a custom training job strategy, so I have all the code of my experiment in just one Python script.
The code is not only in charge of the training, but also of registering the experiment run, logging parameters, metrics, artifacts, etc.
The problem I’m running into is that DeepSpeed distributes the training across multiple GPUs, but also runs the entire code of the script multiple times. When the code for creating a experiment run gets executed multiple times, an exception is raised because there cannot be multiple experiment runs with the same name.
When I've read about the DeepSpeed strategy I thought it would only run in parallel the
trainer.fit()
part. Is there any way to have some code in my script that only runs once?The code is something like this:
A workaround that I'm exploring is to create a wrapper script that will be in charge of registering the experiment, logging parameters, etc... after calling the training script.
subprocess.run(["python", "train.py"])
, it looks like the code of the wrapper only gets executed once (but it's also harder to get the results of the training)It would be great if someone could help me understand if it makes sense what I'm trying to achieve, or if there are other better solutions for this.
Thanks!
Beta Was this translation helpful? Give feedback.
All reactions