Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix Z-loss calculation #634

Merged
merged 1 commit into from
Jun 26, 2024
Merged

Fix Z-loss calculation #634

merged 1 commit into from
Jun 26, 2024

Conversation

epwalsh
Copy link
Member

@epwalsh epwalsh commented Jun 26, 2024

@2015aroras uncovered a recent bug with how we're calculating Z-loss. It should be average over tokens, not instances.
7146473 introduced this bug.

See https://github.com/mlfoundations/open_lm/blob/main/open_lm/losses.py for a reference implementation.

It should be average over tokens, not instances.
7146473 introduced this bug.
See https://github.com/mlfoundations/open_lm/blob/main/open_lm/losses.py
for a reference implementation.
Copy link
Member

@dirkgr dirkgr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code itself looks fine.

I did not reconstruct the sequence of events here. It seems noteworthy that num_micro_batches is no longer needed now, but it was needed before, and that parameter wasn't introduced in the change that caused the bug.

@dirkgr
Copy link
Member

dirkgr commented Jun 26, 2024

Can we add a test for this being correct?

@epwalsh
Copy link
Member Author

epwalsh commented Jun 26, 2024

Can we add a test for this being correct?

@dirkgr we've been YOLO-ing the trainer so far with 0 test coverage. This is bad, but setting up the boilerplate for proper trainer tests is a project in and of itself. My vision is that eventually we move and improve the trainer to OLMo-core, at which point we could add thorough tests.

@dirkgr
Copy link
Member

dirkgr commented Jun 26, 2024

Hm, fair enough, maybe a bigger discussion. I'd rather merge OLMo-core and OLMo, and separately, have more things be functional rather than object oriented.

Because of the objects everywhere, I had to copy and paste so many things in https://github.com/allenai/OLMo/blob/FindHighGnormInstances/scripts/find_high_gnorm_instances.py. But a lot of that stuff doesn't need to be in an object.

And if it's functional, then it can be tested in isolation.

But that's not in scope for this fix.

@epwalsh epwalsh merged commit d7994c8 into main Jun 26, 2024
11 of 12 checks passed
@epwalsh epwalsh deleted the epwalsh/fix-z-loss branch June 26, 2024 20:45
@AkshitaB
Copy link
Contributor

Can we add a test for this being correct?

@dirkgr we've been YOLO-ing the trainer so far with 0 test coverage. This is bad, but setting up the boilerplate for proper trainer tests is a project in and of itself. My vision is that eventually we move and improve the trainer to OLMo-core, at which point we could add thorough tests.

Based on my experience with the code bugs and the lack of tests lately, I would rather we add tests now as and when we encounter things, instead of waiting to find time to write perfect boilerplate for tests. I think the goal should be to have a place where we at least record the things that require testing, that have failed accidentally without tests. We can always go back and refactor with better testing code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants