-
Notifications
You must be signed in to change notification settings - Fork 31.7k
Deprecates AdamW and adds --optim
#14744
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for implementing this. I left some nits and there is one line missing which shows we need a test of this new feature :-)
I'm also wondering if we shouldn't create an Enum for all optimizer values that are supported?
Co-authored-by: Sylvain Gugger <[email protected]>
Co-authored-by: Sylvain Gugger <[email protected]>
Co-authored-by: Sylvain Gugger <[email protected]>
Co-authored-by: Sylvain Gugger <[email protected]>
Co-authored-by: Sylvain Gugger <[email protected]>
Co-authored-by: Sylvain Gugger <[email protected]>
|
@sgugger Thank you for the suggestions. I think an enum makes sense. I will add that and fix the missing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for working on it, @manuelciosici
I made a few proposals about consistency
Please let me know if it's not clear.
(and I edited/removed my earlier comments on me thinking that the actual adafactor optimizer being deprecated, which was my reading the code incorrectly).
src/transformers/trainer.py
Outdated
| ] | ||
| optimizer_cls = Adafactor if self.args.adafactor else AdamW | ||
| if self.args.adafactor: | ||
| if self.args.adafactor and self.args.optim not in {"adamw_hf", "adafactor"}: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this logic is unnecessary complicated, IMHO
- deprecate --adafactor in favor of --optim adafactor
- if
--adafactoris passed set --optim adafactor
now you no longer need to ever consider self.args.adafactor other than to deprecate it.
|
@manuelciosici. have we lost you or are you just on an extended vacation? Let's finish this up, so we can test out various optimizers. I will sync the recent changes in doc formatting. I was just told that |
|
update: see updated benchmarks here: I'm working on a neat HF Trainer benchmarking tool, so here it is applied to the changes introduced by this PR:
So torch's AdamW appears to be even slower than ours. So clearly apex's AdamW is the way to go speed-wise. Note, that the absolute and relative results will be different on a different GPU and a different finetuning setup, but most likely the current fastest optimizer will remain fastest, etc. Reproducibility and other info: |
Co-authored-by: Stas Bekman <[email protected]>
4d59fd5 to
d599a38
Compare
|
@stas00 You have not lost me. I've been on vacation and before that, I accidentally messed up my GitHub notification settings, so I didn't know you had also reviewed the code. I applied all the suggestions from you. I also changed to an Let me know if I should change anything else. |
|
@stas00 Let me know if I should squash all the commits to a single commit so they don't pollute |
|
@stas00 Thank you for the code cleanup, figuring out the mocking issue, and for your patience. This PR has been an educational experience (I didn't know about |
|
Since clearly you're interesting in easier testing, you may find some useful tidbits in this extensive doc: https://huggingface.co/docs/transformers/testing, e.g. for different parameterization situations https://huggingface.co/docs/transformers/testing#parametrization optimizers-wise I think the next interesting but challenging thing to add is BNB 8bit optimizer The other thing to potentially experiment with is |
sgugger
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for waiting for me! This looks great (nice new tests!) but I have a few last comments.
src/transformers/trainer.py
Outdated
| if args.optim == OptimizerNames.ADAFACTOR.value: | ||
| optimizer_cls = Adafactor | ||
| optimizer_kwargs.update({"scale_parameter": False, "relative_step": False}) | ||
| elif args.optim == OptimizerNames.ADAM_HF.value: | ||
| from .optimization import AdamW | ||
|
|
||
| optimizer_cls = AdamW | ||
| optimizer_kwargs.update(adam_kwargs) | ||
| elif args.optim == OptimizerNames.ADAM_TORCH.value: | ||
| from torch.optim import AdamW | ||
|
|
||
| optimizer_cls = AdamW | ||
| optimizer_kwargs.update(adam_kwargs) | ||
| elif args.optim == OptimizerNames.ADAM_APEX_FUSED.value: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a very use of an enum. The args.optim attributed should be converted to the enum type (you just need to set it with a type OptimizerNames instead of str and it will accept both str and OptimizerNames values as well as doing the conversion in the post init) and those tests should have all the .value removed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you. I've updated everything.
Co-authored-by: Sylvain Gugger <[email protected]>
Co-authored-by: Sylvain Gugger <[email protected]>
Co-authored-by: Sylvain Gugger <[email protected]>
|
I think the only other remaining item that I didn't hear you weigh on, @sgugger, is whether we should call these |
|
Oh sorry, I didn't catch that. It should adamw everywhere IMO. |
Thank you for validating that, @sgugger. @manuelciosici, so one more tiny change please Thank you! |
|
I removed Let me know if I should change anything else. |
|
but the rest needs to updated to match, e.g. currently many torch tests fail with: |
|
@stas00 I just saw that. I'm trying to figure out what I misunderstood. |
|
@stas00 I surprised that fixes it. On my end, I just fixed it by adding |
|
It's the same as: I'm not sure if you can see the failing tests from CI, so I thought it'd be faster to push in the fix as I saw where it was failing. Are you still planning to push a change? You said removing a unit test. Let us know when you're done. |
|
@stas00 Commit e73249c makes the unit tests pass, but it doesn't work when
So, calling a script on the command line (for example, I added another test for specifying the optimizer name as a string. Also, I removed I also changed the default value from While The first one leaks the internal object name, while the second indicates the string we want users to pass. Finally, I make
|
sgugger
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All looking great on my side! @stas00 I'll let you merge if you're happy as well.
Thanks again for all your work on this @manuelciosici !
|
I second that - thank you, @manuelciosici! |
What does this PR do?
Fixes #14539
Before submitting
Pull Request section?
to it if that's the case.
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
@stas00
Misc
@stas00
FusedAdambased on your comment in Two bugs in AdamW #14539--optim adafactorand--adafactorserver the same purpose, I marked--adafactoras deprecated. I copy-pasted a deprecation warning that mentionstransformersversion 5. Let me know if the deprecation warning should say something else.