Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refacotor optimizer step logic #163

Merged
merged 1 commit into from
May 23, 2022
Merged

Conversation

vwxyzjn
Copy link
Contributor

@vwxyzjn vwxyzjn commented May 23, 2022

The current implementation is problematic when not doing gradient truncation. The following filediff shows the lines that are needed to add.

     if self.truncate_grads:
         if self.multi_gpu:
             self.optimizer.synchronize()
             self.scaler.unscale_(self.optimizer)
             nn.utils.clip_grad_norm_(self.model.parameters(), self.grad_norm)
             with self.optimizer.skip_synchronize():
                 self.scaler.step(self.optimizer)
                 self.scaler.update()
         else:
             self.scaler.unscale_(self.optimizer)
             nn.utils.clip_grad_norm_(self.model.parameters(), self.grad_norm)
             self.scaler.step(self.optimizer)
             self.scaler.update()
     else:
-        self.scaler.step(self.optimizer)
-        self.scaler.update()
+        if self.multi_gpu:
+            with self.optimizer.skip_synchronize():
+                self.scaler.step(self.optimizer)
+                self.scaler.update()
+        else:
+            self.scaler.step(self.optimizer)
+            self.scaler.update()

But this is pretty ugly... so I refactored it to

        if self.multi_gpu:
            self.optimizer.synchronize()
        
        if self.truncate_grads:
            self.scaler.unscale_(self.optimizer)
            nn.utils.clip_grad_norm_(self.model.parameters(), self.grad_norm)

        if self.multi_gpu:
            with self.optimizer.skip_synchronize():
                self.scaler.step(self.optimizer)
                self.scaler.update()
        else:
            self.scaler.step(self.optimizer)
            self.scaler.update()

@vwxyzjn vwxyzjn mentioned this pull request May 23, 2022
2 tasks
@Denys88
Copy link
Owner

Denys88 commented May 23, 2022

Yeah I even had a comment about it in the code :)
Maybe lets remove mixed precision too? we canget rid of scaler. Anyway it doesnt help.
Do you want to apply this change before or after horovod replacement?

@vwxyzjn
Copy link
Contributor Author

vwxyzjn commented May 23, 2022

Maybe it's probably better to merge this PR as is? We can then remove mixed-precision in a PR that involves docs change and could impact downstream projects such as isaacgymenvs.

@Denys88
Copy link
Owner

Denys88 commented May 23, 2022

Okay! Ill review all other prs after my work in the evening.

@Denys88 Denys88 merged commit 86f5e82 into Denys88:master May 23, 2022
@vwxyzjn
Copy link
Contributor Author

vwxyzjn commented May 23, 2022

Also pinging @markelsanz14, are you ok with me filing a PR removing mixed-precision support? I know you were thinking about using it in the future.

@markelsanz14
Copy link

@Denys88 do we know how much removing mixed precision will affect different model sizes? I agree in most cases it won't help, mainly when we have 2-5 layer NNs, but it might speed up training with larger NN sizes.

@vwxyzjn
Copy link
Contributor Author

vwxyzjn commented May 23, 2022

Hey @Denys88, @ViktorM @markelsanz14 and I had a quick chat about this, and maybe it's worth it to keep the mixed-precision option? I did a quick benchmark and found virtually no performance differences

image

However, as @markelsanz14 suggested maybe in large NNs mixed-precision could offer benefits.

@markelsanz14
Copy link

Yeah, I wouldn't put more work into it for now, but removing it altogether seems a bit drastic. I feel like it can be useful in the future with larger NN architectures.

@vwxyzjn
Copy link
Contributor Author

vwxyzjn commented May 23, 2022

More data on this, removing the scaler seems to make it less than 5% faster (178 seconds vs 186 seconds), so it could just bit a bit of randomness.

image

@Denys88
Copy link
Owner

Denys88 commented May 23, 2022

could you check it on some heave networks, including transformers.
I have my IG fork https://github.com/Denys88/IsaacGymEnvs with some fat neural networks like openai embedding from the paper or just transformer.
!https://github.com/Denys88/IsaacGymEnvs/blob/main/isaacgymenvs/cfg/train/AntPPO_tr.yaml

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants