-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fixing TPU tests #2632
fixing TPU tests #2632
Conversation
Codecov Report
@@ Coverage Diff @@
## master #2632 +/- ##
======================================
Coverage 91% 91%
======================================
Files 82 82
Lines 6770 6784 +14
======================================
+ Hits 6127 6151 +24
+ Misses 643 633 -10 |
|
it seems that first argument of
it is interesting as both are in the same device before...
similar to pytorch/xla#894 @dlibenzi do you have advice what am I doing wrong? |
Explanation of why the error raise when XLA Tensors are LazyCPU and CUDA tensors launch operations immediately or eagerly. XLA tensors, on the other hand, are lazy. They record operations in a graph until the results are needed. Deferring execution like this lets XLA optimize it. even with
|
As I explained countless times, you cannot do stuff like this: You should get rid of that TPU IDs, like you can control which device ordinal you get within a process. You cannot. If you want to have some sort of information about which ordinal within the distributed system the calling process has, you can use the |
@davidel I am going to change the TPU id in next step, but not I am getting this error in |
# call setup after the ddp process has connected | ||
self.setup('fit') | ||
if self.is_function_implemented('setup', model): | ||
model.setup('fit') | ||
|
||
# put model on tpu | ||
self._device = xm.xla_device(self.tpu_id) if self.tpu_id is not None else xm.xla_device() | ||
xm.get_ordinal() | ||
# TODO, wrong definition of TPU index |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this will be changed in the next step of corrections... cc: @davidel
@davidel @williamFalcon When TPU ID is provided, the training does not run in multi-processing mode i.e |
I will be flying today. Will get back to this once on the other side of the pond 😄 |
I had some follow-up questions:
|
@Borda I see this as the latest error in the Github Actions run for the 8-core test:
I talked with Davide about this last week and here is an explanation of what is happening:
Here are 2 examples of defining the dataloader inside the
|
yes, the point is to speed-up training
I agree, the trainer is teaching just one model so the case with selecting core was about multiple users can train their own model/trainer and share one physical device |
in such case, it is very similar to DDP when we run a python script on beach GPU separately, right? |
I'm not familiar with DDP but I think generally this is similar to the multi-GPU case. Each TPU core is running it's own copy of the same code, then after each backprop, we synchronize the gradients using See here for This means that each TPU core should end up with the same weights and it means that the TPU cores are working together to make training progress rather than having independent copies of your model training on each core. Note that we are sharing gradients but each process instantiates its own weights. One pitfall is if each process ends up with different initial weights. It's important to set the same seed between processes so that initial weights are the same, like we do here in our example. Note that we set the seed immediately. This is important since any code that runs before setting the seed can change the seed for that process and lead to different weights per process, which can lead to e.g. model failing to converge. |
yes, this is exactly how ddp works and what we are set up to do. This is how we setup the training. |
Yes, An independent instance of a model on each core with a different fold.
The use case is K-fold training. The user can train K-models simultaneously with a different fold on every core. |
I see. I think we can make this work. I'm assuming you would also want to write out 8 different model weights. For your normal 8-core case, you'll want to use If you want every core to write its weights, you'll want to call |
I'm not sure if |
If it does not run as multi-core it might be OK ... but honestly specifying the core ID makes no sense from an API standpoint. |
checkpoint loading, maybe similar to #2700
EDIT: it seems that no checkpoint is saved during training |
Co-authored-by: Ethan Harris <[email protected]> Co-authored-by: Rohit Gupta <[email protected]>
Co-authored-by: Ethan Harris <[email protected]>
Co-authored-by: Rohit Gupta <[email protected]>
@williamFalcon lets fix the master... :] |
What does this PR do?
Fixes #2124
Fixes #1956
Before submitting
PR review
Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in Github issues there's a high chance it will not be merged.
Did you have fun?
Make sure you had fun coding 🙃