-
Notifications
You must be signed in to change notification settings - Fork 481
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix Z-loss calculation #634
Conversation
It should be average over tokens, not instances. 7146473 introduced this bug. See https://github.com/mlfoundations/open_lm/blob/main/open_lm/losses.py for a reference implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code itself looks fine.
I did not reconstruct the sequence of events here. It seems noteworthy that num_micro_batches
is no longer needed now, but it was needed before, and that parameter wasn't introduced in the change that caused the bug.
Can we add a test for this being correct? |
@dirkgr we've been YOLO-ing the trainer so far with 0 test coverage. This is bad, but setting up the boilerplate for proper trainer tests is a project in and of itself. My vision is that eventually we move and improve the trainer to OLMo-core, at which point we could add thorough tests. |
Hm, fair enough, maybe a bigger discussion. I'd rather merge OLMo-core and OLMo, and separately, have more things be functional rather than object oriented. Because of the objects everywhere, I had to copy and paste so many things in https://github.com/allenai/OLMo/blob/FindHighGnormInstances/scripts/find_high_gnorm_instances.py. But a lot of that stuff doesn't need to be in an object. And if it's functional, then it can be tested in isolation. But that's not in scope for this fix. |
Based on my experience with the code bugs and the lack of tests lately, I would rather we add tests now as and when we encounter things, instead of waiting to find time to write perfect boilerplate for tests. I think the goal should be to have a place where we at least record the things that require testing, that have failed accidentally without tests. We can always go back and refactor with better testing code. |
@2015aroras uncovered a recent bug with how we're calculating Z-loss. It should be average over tokens, not instances.
7146473 introduced this bug.
See https://github.com/mlfoundations/open_lm/blob/main/open_lm/losses.py for a reference implementation.