Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support in-loop evals from oe-eval-internal request dump #685

Merged
merged 25 commits into from
Aug 20, 2024

Conversation

OyvindTafjord
Copy link
Contributor

@OyvindTafjord OyvindTafjord commented Aug 1, 2024

This introduces a new downstream task class which reads directly a request dump from running an oe-eval evaluations. This allows any task configuration (for now only "ranked classification" tasks) in oe-eval to be replicated as an in-loop eval. See some tentative instructions for how to set up tasks here.

The basic idea is to grab the request file from running oe-eval the normal way (alternatively, can run without a model to just save requests) and save it under olmo_data/oe_eval_tasks/<task_name>/requests.jsonl. Also put a pointer to it in label_to_task_map in olmo/eval/downstream.py and then reference in training .yaml file.

Possibly future features could be:

  • Optionally skip reference in label_to_task_map, and specify path and metric directly in training .yaml file
  • Allow path in S3 as well as in olmo_data

This would allow adding new tasks without changing any OLMo code, just saving requests in S3 and update the .yaml files.

An example task added in this PR is added to training .yaml files using:

  - label: copycolors_10way
    type: downstream

This is a task from @sarahwie with one hundred 10-way MC questions of the type Question: A frog is green. What color is a frog?\n A. green\n B. black, to test basic MC capabilities.

There is also a version with 1000 questions called copycolors_xl_10way which cycles the answer choices (A->B, B->C, ...). This should have somewhat less noice than the 100 question one.

On the copycolors_xl_10way task, OLMo-7B-0424 scores 96% while the 350B/400B/450B/500B/600B/700B/1T checkpoints score 10/10/55/22/28/45/78 respectively, showing the (somewhat bumpy) arrival of capability.

I also added a arc_challenge_rc_0shot task to match the current arc_challenge task to verify they give identical numbers (wandb link):
image

@OyvindTafjord OyvindTafjord requested review from epwalsh and dirkgr August 1, 2024 12:45
Copy link
Member

@dirkgr dirkgr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am pretty sure this is the case, but just to confirm: This uses the tokenizer we define in the model training config. It does not rely on pre-tokenized data. Correct?

Can we replace all existing in-loop evals with this?

Comment on lines +1554 to +1561
def doc_to_text(self, doc) -> str:
raise NotImplementedError

def doc_to_continuations(self, doc) -> List[str]:
raise NotImplementedError

def doc_to_label(self, doc) -> int:
raise NotImplementedError
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are already like that in the base class?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, but they have the @abc.abstractmethod there so mypy gets angry if I don't include them (and I was trying to mess minimally with existing code)

Copy link
Member

@epwalsh epwalsh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

olmo/eval/downstream.py Outdated Show resolved Hide resolved
@OyvindTafjord
Copy link
Contributor Author

Can we replace all existing in-loop evals with this?

Yes in principle, especially if we're not insisting on exact parity (the defaults in oe-eval in prompt formatting, like "Q:" vs "Question:", might be slightly different). That might be a good point to decide exactly what suite of tasks we want by default in-loop (including few-shot and MC versions of tasks).

@dirkgr
Copy link
Member

dirkgr commented Aug 2, 2024

We don't insist on exact parity. Can you add configs and files for all the in-loop evals we have now, to the peteish configs? I'm thinking we run them side-by-side for a little bit, and then switch over completely.

Is there an obvious place where we could put the instructions for how to export an evaluation from oe-eval to here? We should be able to do this without Oyvind's help. Especially people outside of AI2.

@OyvindTafjord
Copy link
Contributor Author

Hi, we have now added these tasks to this same PR:

  • 0-shot tasks similar to existing ones (arc_challenge, arc_easy, boolq, copa, hellaswag, openbookqa, piqa, sciq winogrande)
  • 5-shot tasks from OLMES (the 9 non-MMLU ones), using validation sets. Both MC and RC versions
  • Bits-per-byte (bpb) versions of all these tasks as well
  • bpb versions of the original MMLU validation set RC tasks (regular and _var versions)

Some updates in the downstream.py file to capture the byte length of continuations (which can be different from character length).

Also added script scripts/list_evals_from_oe_eval.py which will list all the tasks currently in olmo_data/oe_eval_tasks (with example requests if needed) along with their entries for downstream.py and for training configs.

I'll add a separate message below with a list of all the entries that can now be added to training configs. This is too many for any reasonable run, so will have to decide what's wanted.

Screenshot from sample run:
image

@OyvindTafjord
Copy link
Contributor Author

Here's the full list of new tasks that could be added to training configs:

  - label: mmlu_stem_bpb
    type: downstream

  - label: mmlu_humanities_bpb
    type: downstream

  - label: mmlu_social_sciences_bpb
    type: downstream

  - label: mmlu_other_bpb
    type: downstream

  - label: mmlu_stem_var_bpb
    type: downstream

  - label: mmlu_humanities_var_bpb
    type: downstream

  - label: mmlu_social_sciences_var_bpb
    type: downstream

  - label: mmlu_other_var_bpb
    type: downstream

  - label: arc_challenge_mc_5shot
    type: downstream

  - label: arc_challenge_mc_5shot_bpb
    type: downstream

  - label: arc_challenge_rc_0shot
    type: downstream

  - label: arc_challenge_rc_0shot_bpb
    type: downstream

  - label: arc_challenge_rc_5shot
    type: downstream

  - label: arc_challenge_rc_5shot_bpb
    type: downstream

  - label: arc_easy_mc_5shot
    type: downstream

  - label: arc_easy_mc_5shot_bpb
    type: downstream

  - label: arc_easy_rc_0shot
    type: downstream

  - label: arc_easy_rc_0shot_bpb
    type: downstream

  - label: arc_easy_rc_5shot
    type: downstream

  - label: arc_easy_rc_5shot_bpb
    type: downstream

  - label: boolq_mc_5shot
    type: downstream

  - label: boolq_mc_5shot_bpb
    type: downstream

  - label: boolq_rc_0shot
    type: downstream

  - label: boolq_rc_0shot_bpb
    type: downstream

  - label: boolq_rc_5shot
    type: downstream

  - label: boolq_rc_5shot_bpb
    type: downstream

  - label: copa_rc_0shot
    type: downstream

  - label: copa_rc_0shot_bpb
    type: downstream

  - label: copycolors_10way
    type: downstream

  - label: copycolors_10way_bpb
    type: downstream

  - label: copycolors_xl_10way
    type: downstream

  - label: copycolors_xl_10way_bpb
    type: downstream

  - label: csqa_mc_5shot
    type: downstream

  - label: csqa_mc_5shot_bpb
    type: downstream

  - label: csqa_rc_5shot
    type: downstream

  - label: csqa_rc_5shot_bpb
    type: downstream

  - label: hellaswag_mc_5shot
    type: downstream

  - label: hellaswag_mc_5shot_bpb
    type: downstream

  - label: hellaswag_rc_0shot
    type: downstream

  - label: hellaswag_rc_0shot_bpb
    type: downstream

  - label: hellaswag_rc_5shot
    type: downstream

  - label: hellaswag_rc_5shot_bpb
    type: downstream

  - label: openbookqa_mc_5shot
    type: downstream

  - label: openbookqa_mc_5shot_bpb
    type: downstream

  - label: openbookqa_rc_0shot
    type: downstream

  - label: openbookqa_rc_0shot_bpb
    type: downstream

  - label: openbookqa_rc_5shot
    type: downstream

  - label: openbookqa_rc_5shot_bpb
    type: downstream

  - label: piqa_mc_5shot
    type: downstream

  - label: piqa_mc_5shot_bpb
    type: downstream

  - label: piqa_rc_0shot
    type: downstream

  - label: piqa_rc_0shot_bpb
    type: downstream

  - label: piqa_rc_5shot
    type: downstream

  - label: piqa_rc_5shot_bpb
    type: downstream

  - label: sciq_rc_0shot
    type: downstream

  - label: sciq_rc_0shot_bpb
    type: downstream

  - label: socialiqa_mc_5shot
    type: downstream

  - label: socialiqa_mc_5shot_bpb
    type: downstream

  - label: socialiqa_rc_5shot
    type: downstream

  - label: socialiqa_rc_5shot_bpb
    type: downstream

  - label: winogrande_mc_5shot
    type: downstream

  - label: winogrande_mc_5shot_bpb
    type: downstream

  - label: winogrande_rc_0shot
    type: downstream

  - label: winogrande_rc_0shot_bpb
    type: downstream

  - label: winogrande_rc_5shot
    type: downstream

  - label: winogrande_rc_5shot_bpb
    type: downstream

@dirkgr
Copy link
Member

dirkgr commented Aug 19, 2024

No reason to put in the 0 shot ones other than for comparability with past runs?

@dirkgr
Copy link
Member

dirkgr commented Aug 19, 2024

I mean, it's good to have them. But I won't stick them in every config.

@OyvindTafjord
Copy link
Contributor Author

The 0-shot ones will be faster to run than the 5-shot ones (roughly 5x faster). If the 5-shot ones have better signal, they would be worth including. Possibly that's decision to be made per task (and model size), especially if there resources used on the evaluation step becomes an issue.

@@ -1292,6 +1314,7 @@ def __init__(
dataset_name=dataset_names,
split=split,
prompts=prompts,
metric_type=None,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
metric_type=None,
metric_type=metric_type,

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, thanks!

@OyvindTafjord OyvindTafjord merged commit 9147889 into main Aug 20, 2024
11 of 12 checks passed
@OyvindTafjord OyvindTafjord deleted the ot-oe-eval-requests branch August 20, 2024 06:25
@liujch1998
Copy link
Contributor

Why are we missing csqa_rc_0shot_bpb and socialiqa_rc_0shot_bpb in the list?

@liujch1998
Copy link
Contributor

I got the following error when running sciq_rc_0shot_bpb

2024-08-23T17:55:29.295911089Z 2024-08-23 10:55:29.295	jupiter-cs-aus-110.reviz.ai2.in:0	olmo.train:1005	INFO	Running evaluation for 'sciq_rc_0shot_bpb'...
2024-08-23T17:55:29.368329339Z 2024-08-23 10:55:29.367	jupiter-cs-aus-110.reviz.ai2.in:0	olmo.train:1029	INFO	[eval_step=1/-1]
2024-08-23T17:55:29.394690515Z 2024-08-23 10:55:29.394	jupiter-cs-aus-110.reviz.ai2.in:0	olmo.train:1029	INFO	[eval_step=2/-1]
2024-08-23T17:55:31.758315366Z 2024-08-23 10:55:31.756	jupiter-cs-aus-110.reviz.ai2.in:2	olmo.util:165	CRITICAL	Uncaught ZeroDivisionError: division by zero
2024-08-23T17:55:31.758335471Z Traceback (most recent call last):
2024-08-23T17:55:31.758339730Z   File "/gantry-runtime/scripts/ladder.py", line 511, in <module>
2024-08-23T17:55:31.758341663Z     args.func(args)
2024-08-23T17:55:31.758343421Z   File "/gantry-runtime/scripts/ladder.py", line 441, in train_cmd
2024-08-23T17:55:31.758344823Z     main(cfg)
2024-08-23T17:55:31.758346276Z   File "/gantry-runtime/scripts/train.py", line 321, in main
2024-08-23T17:55:31.758347640Z     trainer.fit()
2024-08-23T17:55:31.758349128Z   File "/gantry-runtime/olmo/train.py", line 1298, in fit
2024-08-23T17:55:31.758350471Z     eval_metrics = self.eval()
2024-08-23T17:55:31.758351948Z                    ^^^^^^^^^^^
2024-08-23T17:55:31.758353205Z   File "/gantry-runtime/olmo/train.py", line 1032, in eval
2024-08-23T17:55:31.758354750Z     metrics = evaluator.compute_metrics()
2024-08-23T17:55:31.758356093Z               ^^^^^^^^^^^^^^^^^^^^^^^^^^^
2024-08-23T17:55:31.758357589Z   File "/gantry-runtime/olmo/eval/evaluator.py", line 32, in compute_metrics
2024-08-23T17:55:31.758358973Z     value = self.eval_metric.compute().item()
2024-08-23T17:55:31.758360445Z             ^^^^^^^^^^^^^^^^^^^^^^^^^^
2024-08-23T17:55:31.758361726Z   File "/opt/conda/lib/python3.11/site-packages/torchmetrics/metric.py", line 632, in wrapped_func
2024-08-23T17:55:31.758363276Z     value = _squeeze_if_scalar(compute(*args, **kwargs))
2024-08-23T17:55:31.758364558Z                                ^^^^^^^^^^^^^^^^^^^^^^^^
2024-08-23T17:55:31.758365862Z   File "/gantry-runtime/olmo/eval/downstream.py", line 158, in compute
2024-08-23T17:55:31.758367199Z     score = sum(correct) / len(correct)
2024-08-23T17:55:31.758368407Z             ~~~~~~~~~~~~~^~~~~~~~~~~~~~
2024-08-23T17:55:31.758369648Z ZeroDivisionError: division by zero

@OyvindTafjord
Copy link
Contributor Author

I'll add the 0 shot versions of csqa and socialiqa, they were missing as they weren't in the OLMo-paper list of tasks, added later, but for completeness should have them as well.

As for the sciq bug, that's interesting, I'll investigate asap!

@liujch1998
Copy link
Contributor

liujch1998 commented Aug 23, 2024

Thanks a lot!
Other tasks ending with _rc_0shot_bpb ran successfully for me, only buggy one is sciq_rc_0shot_bpb

@OyvindTafjord
Copy link
Contributor Author

Fixed in #712, it was lucky we ran across this, as it was a really bad bug affecting all tasks (by only scoring instances where the gold answer is the first one and sciq happens to be set up such that the gold answer is always the last one).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants