Fix Z-loss calculation #634

epwalsh · 2024-06-26T19:28:34Z

@2015aroras uncovered a recent bug with how we're calculating Z-loss. It should be average over tokens, not instances.
7146473 introduced this bug.

See https://github.com/mlfoundations/open_lm/blob/main/open_lm/losses.py for a reference implementation.

It should be average over tokens, not instances. 7146473 introduced this bug. See https://github.com/mlfoundations/open_lm/blob/main/open_lm/losses.py for a reference implementation.

dirkgr

Code itself looks fine.

I did not reconstruct the sequence of events here. It seems noteworthy that num_micro_batches is no longer needed now, but it was needed before, and that parameter wasn't introduced in the change that caused the bug.

dirkgr · 2024-06-26T20:04:28Z

Can we add a test for this being correct?

epwalsh · 2024-06-26T20:40:14Z

Can we add a test for this being correct?

@dirkgr we've been YOLO-ing the trainer so far with 0 test coverage. This is bad, but setting up the boilerplate for proper trainer tests is a project in and of itself. My vision is that eventually we move and improve the trainer to OLMo-core, at which point we could add thorough tests.

dirkgr · 2024-06-26T20:43:57Z

Hm, fair enough, maybe a bigger discussion. I'd rather merge OLMo-core and OLMo, and separately, have more things be functional rather than object oriented.

Because of the objects everywhere, I had to copy and paste so many things in https://github.com/allenai/OLMo/blob/FindHighGnormInstances/scripts/find_high_gnorm_instances.py. But a lot of that stuff doesn't need to be in an object.

And if it's functional, then it can be tested in isolation.

But that's not in scope for this fix.

AkshitaB · 2024-06-26T22:00:09Z

Can we add a test for this being correct?

@dirkgr we've been YOLO-ing the trainer so far with 0 test coverage. This is bad, but setting up the boilerplate for proper trainer tests is a project in and of itself. My vision is that eventually we move and improve the trainer to OLMo-core, at which point we could add thorough tests.

Based on my experience with the code bugs and the lack of tests lately, I would rather we add tests now as and when we encounter things, instead of waiting to find time to write perfect boilerplate for tests. I think the goal should be to have a place where we at least record the things that require testing, that have failed accidentally without tests. We can always go back and refactor with better testing code.

Fix Z-loss calculation

d15262a

It should be average over tokens, not instances. 7146473 introduced this bug. See https://github.com/mlfoundations/open_lm/blob/main/open_lm/losses.py for a reference implementation.

epwalsh requested review from dirkgr and 2015aroras June 26, 2024 19:29

2015aroras approved these changes Jun 26, 2024

View reviewed changes

dirkgr approved these changes Jun 26, 2024

View reviewed changes

epwalsh merged commit d7994c8 into main Jun 26, 2024
11 of 12 checks passed

epwalsh deleted the epwalsh/fix-z-loss branch June 26, 2024 20:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix Z-loss calculation #634

Fix Z-loss calculation #634

epwalsh commented Jun 26, 2024 •

edited

Loading

dirkgr left a comment

dirkgr commented Jun 26, 2024

epwalsh commented Jun 26, 2024

dirkgr commented Jun 26, 2024

AkshitaB commented Jun 26, 2024

Fix Z-loss calculation #634

Fix Z-loss calculation #634

Conversation

epwalsh commented Jun 26, 2024 • edited Loading

dirkgr left a comment

Choose a reason for hiding this comment

dirkgr commented Jun 26, 2024

epwalsh commented Jun 26, 2024

dirkgr commented Jun 26, 2024

AkshitaB commented Jun 26, 2024

epwalsh commented Jun 26, 2024 •

edited

Loading