-
Notifications
You must be signed in to change notification settings - Fork 516
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support in-loop evals from oe-eval-internal request dump #685
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am pretty sure this is the case, but just to confirm: This uses the tokenizer we define in the model training config. It does not rely on pre-tokenized data. Correct?
Can we replace all existing in-loop evals with this?
def doc_to_text(self, doc) -> str: | ||
raise NotImplementedError | ||
|
||
def doc_to_continuations(self, doc) -> List[str]: | ||
raise NotImplementedError | ||
|
||
def doc_to_label(self, doc) -> int: | ||
raise NotImplementedError |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are already like that in the base class?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, but they have the @abc.abstractmethod
there so mypy gets angry if I don't include them (and I was trying to mess minimally with existing code)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Yes in principle, especially if we're not insisting on exact parity (the defaults in oe-eval in prompt formatting, like "Q:" vs "Question:", might be slightly different). That might be a good point to decide exactly what suite of tasks we want by default in-loop (including few-shot and MC versions of tasks). |
We don't insist on exact parity. Can you add configs and files for all the in-loop evals we have now, to the peteish configs? I'm thinking we run them side-by-side for a little bit, and then switch over completely. Is there an obvious place where we could put the instructions for how to export an evaluation from oe-eval to here? We should be able to do this without Oyvind's help. Especially people outside of AI2. |
add 0 shot requests dumps for olmes
Hi, we have now added these tasks to this same PR:
Some updates in the downstream.py file to capture the byte length of continuations (which can be different from character length). Also added script I'll add a separate message below with a list of all the entries that can now be added to training configs. This is too many for any reasonable run, so will have to decide what's wanted. |
Here's the full list of new tasks that could be added to training configs:
|
No reason to put in the 0 shot ones other than for comparability with past runs? |
I mean, it's good to have them. But I won't stick them in every config. |
The 0-shot ones will be faster to run than the 5-shot ones (roughly 5x faster). If the 5-shot ones have better signal, they would be worth including. Possibly that's decision to be made per task (and model size), especially if there resources used on the evaluation step becomes an issue. |
olmo/eval/downstream.py
Outdated
@@ -1292,6 +1314,7 @@ def __init__( | |||
dataset_name=dataset_names, | |||
split=split, | |||
prompts=prompts, | |||
metric_type=None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
metric_type=None, | |
metric_type=metric_type, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, thanks!
Why are we missing |
I got the following error when running
|
I'll add the 0 shot versions of csqa and socialiqa, they were missing as they weren't in the OLMo-paper list of tasks, added later, but for completeness should have them as well. As for the sciq bug, that's interesting, I'll investigate asap! |
Thanks a lot! |
Fixed in #712, it was lucky we ran across this, as it was a really bad bug affecting all tasks (by only scoring instances where the gold answer is the first one and sciq happens to be set up such that the gold answer is always the last one). |
This introduces a new downstream task class which reads directly a request dump from running an
oe-eval
evaluations. This allows any task configuration (for now only "ranked classification" tasks) in oe-eval to be replicated as an in-loop eval. See some tentative instructions for how to set up tasks here.The basic idea is to grab the request file from running oe-eval the normal way (alternatively, can run without a model to just save requests) and save it under
olmo_data/oe_eval_tasks/<task_name>/requests.jsonl
. Also put a pointer to it inlabel_to_task_map
inolmo/eval/downstream.py
and then reference in training.yaml
file.Possibly future features could be:
label_to_task_map
, and specify path and metric directly in training.yaml
fileolmo_data
This would allow adding new tasks without changing any OLMo code, just saving requests in S3 and update the
.yaml
files.An example task added in this PR is added to training .yaml files using:
This is a task from @sarahwie with one hundred 10-way MC questions of the type
Question: A frog is green. What color is a frog?\n A. green\n B. black
, to test basic MC capabilities.There is also a version with 1000 questions called
copycolors_xl_10way
which cycles the answer choices (A->B, B->C, ...). This should have somewhat less noice than the 100 question one.On the copycolors_xl_10way task, OLMo-7B-0424 scores 96% while the 350B/400B/450B/500B/600B/700B/1T checkpoints score 10/10/55/22/28/45/78 respectively, showing the (somewhat bumpy) arrival of capability.
I also added a
arc_challenge_rc_0shot
task to match the currentarc_challenge
task to verify they give identical numbers (wandb link):