Support in-loop evals from oe-eval-internal request dump #685

OyvindTafjord · 2024-08-01T12:44:34Z

This introduces a new downstream task class which reads directly a request dump from running an oe-eval evaluations. This allows any task configuration (for now only "ranked classification" tasks) in oe-eval to be replicated as an in-loop eval. See some tentative instructions for how to set up tasks here.

The basic idea is to grab the request file from running oe-eval the normal way (alternatively, can run without a model to just save requests) and save it under olmo_data/oe_eval_tasks/<task_name>/requests.jsonl. Also put a pointer to it in label_to_task_map in olmo/eval/downstream.py and then reference in training .yaml file.

Possibly future features could be:

Optionally skip reference in label_to_task_map, and specify path and metric directly in training .yaml file
Allow path in S3 as well as in olmo_data

This would allow adding new tasks without changing any OLMo code, just saving requests in S3 and update the .yaml files.

An example task added in this PR is added to training .yaml files using:

  - label: copycolors_10way
    type: downstream

This is a task from @sarahwie with one hundred 10-way MC questions of the type Question: A frog is green. What color is a frog?\n A. green\n B. black, to test basic MC capabilities.

There is also a version with 1000 questions called copycolors_xl_10way which cycles the answer choices (A->B, B->C, ...). This should have somewhat less noice than the 100 question one.

On the copycolors_xl_10way task, OLMo-7B-0424 scores 96% while the 350B/400B/450B/500B/600B/700B/1T checkpoints score 10/10/55/22/28/45/78 respectively, showing the (somewhat bumpy) arrival of capability.

I also added a arc_challenge_rc_0shot task to match the current arc_challenge task to verify they give identical numbers (wandb link):

dirkgr

I am pretty sure this is the case, but just to confirm: This uses the tokenizer we define in the model training config. It does not rely on pre-tokenized data. Correct?

Can we replace all existing in-loop evals with this?

dirkgr · 2024-08-01T16:19:02Z

olmo/eval/downstream.py

+    def doc_to_text(self, doc) -> str:
+        raise NotImplementedError
+
+    def doc_to_continuations(self, doc) -> List[str]:
+        raise NotImplementedError
+
+    def doc_to_label(self, doc) -> int:
+        raise NotImplementedError


These are already like that in the base class?

Yeah, but they have the @abc.abstractmethod there so mypy gets angry if I don't include them (and I was trying to mess minimally with existing code)

epwalsh

LGTM

olmo/eval/downstream.py

OyvindTafjord · 2024-08-01T20:46:30Z

Can we replace all existing in-loop evals with this?

Yes in principle, especially if we're not insisting on exact parity (the defaults in oe-eval in prompt formatting, like "Q:" vs "Question:", might be slightly different). That might be a good point to decide exactly what suite of tasks we want by default in-loop (including few-shot and MC versions of tasks).

dirkgr · 2024-08-02T16:56:10Z

We don't insist on exact parity. Can you add configs and files for all the in-loop evals we have now, to the peteish configs? I'm thinking we run them side-by-side for a little bit, and then switch over completely.

Is there an obvious place where we could put the instructions for how to export an evaluation from oe-eval to here? We should be able to do this without Oyvind's help. Especially people outside of AI2.

add 0 shot requests dumps for olmes

OyvindTafjord · 2024-08-17T00:42:16Z

Hi, we have now added these tasks to this same PR:

0-shot tasks similar to existing ones (arc_challenge, arc_easy, boolq, copa, hellaswag, openbookqa, piqa, sciq winogrande)
5-shot tasks from OLMES (the 9 non-MMLU ones), using validation sets. Both MC and RC versions
Bits-per-byte (bpb) versions of all these tasks as well
bpb versions of the original MMLU validation set RC tasks (regular and _var versions)

Some updates in the downstream.py file to capture the byte length of continuations (which can be different from character length).

Also added script scripts/list_evals_from_oe_eval.py which will list all the tasks currently in olmo_data/oe_eval_tasks (with example requests if needed) along with their entries for downstream.py and for training configs.

I'll add a separate message below with a list of all the entries that can now be added to training configs. This is too many for any reasonable run, so will have to decide what's wanted.

Screenshot from sample run:

OyvindTafjord · 2024-08-17T00:43:57Z

Here's the full list of new tasks that could be added to training configs:

  - label: mmlu_stem_bpb
    type: downstream

  - label: mmlu_humanities_bpb
    type: downstream

  - label: mmlu_social_sciences_bpb
    type: downstream

  - label: mmlu_other_bpb
    type: downstream

  - label: mmlu_stem_var_bpb
    type: downstream

  - label: mmlu_humanities_var_bpb
    type: downstream

  - label: mmlu_social_sciences_var_bpb
    type: downstream

  - label: mmlu_other_var_bpb
    type: downstream

  - label: arc_challenge_mc_5shot
    type: downstream

  - label: arc_challenge_mc_5shot_bpb
    type: downstream

  - label: arc_challenge_rc_0shot
    type: downstream

  - label: arc_challenge_rc_0shot_bpb
    type: downstream

  - label: arc_challenge_rc_5shot
    type: downstream

  - label: arc_challenge_rc_5shot_bpb
    type: downstream

  - label: arc_easy_mc_5shot
    type: downstream

  - label: arc_easy_mc_5shot_bpb
    type: downstream

  - label: arc_easy_rc_0shot
    type: downstream

  - label: arc_easy_rc_0shot_bpb
    type: downstream

  - label: arc_easy_rc_5shot
    type: downstream

  - label: arc_easy_rc_5shot_bpb
    type: downstream

  - label: boolq_mc_5shot
    type: downstream

  - label: boolq_mc_5shot_bpb
    type: downstream

  - label: boolq_rc_0shot
    type: downstream

  - label: boolq_rc_0shot_bpb
    type: downstream

  - label: boolq_rc_5shot
    type: downstream

  - label: boolq_rc_5shot_bpb
    type: downstream

  - label: copa_rc_0shot
    type: downstream

  - label: copa_rc_0shot_bpb
    type: downstream

  - label: copycolors_10way
    type: downstream

  - label: copycolors_10way_bpb
    type: downstream

  - label: copycolors_xl_10way
    type: downstream

  - label: copycolors_xl_10way_bpb
    type: downstream

  - label: csqa_mc_5shot
    type: downstream

  - label: csqa_mc_5shot_bpb
    type: downstream

  - label: csqa_rc_5shot
    type: downstream

  - label: csqa_rc_5shot_bpb
    type: downstream

  - label: hellaswag_mc_5shot
    type: downstream

  - label: hellaswag_mc_5shot_bpb
    type: downstream

  - label: hellaswag_rc_0shot
    type: downstream

  - label: hellaswag_rc_0shot_bpb
    type: downstream

  - label: hellaswag_rc_5shot
    type: downstream

  - label: hellaswag_rc_5shot_bpb
    type: downstream

  - label: openbookqa_mc_5shot
    type: downstream

  - label: openbookqa_mc_5shot_bpb
    type: downstream

  - label: openbookqa_rc_0shot
    type: downstream

  - label: openbookqa_rc_0shot_bpb
    type: downstream

  - label: openbookqa_rc_5shot
    type: downstream

  - label: openbookqa_rc_5shot_bpb
    type: downstream

  - label: piqa_mc_5shot
    type: downstream

  - label: piqa_mc_5shot_bpb
    type: downstream

  - label: piqa_rc_0shot
    type: downstream

  - label: piqa_rc_0shot_bpb
    type: downstream

  - label: piqa_rc_5shot
    type: downstream

  - label: piqa_rc_5shot_bpb
    type: downstream

  - label: sciq_rc_0shot
    type: downstream

  - label: sciq_rc_0shot_bpb
    type: downstream

  - label: socialiqa_mc_5shot
    type: downstream

  - label: socialiqa_mc_5shot_bpb
    type: downstream

  - label: socialiqa_rc_5shot
    type: downstream

  - label: socialiqa_rc_5shot_bpb
    type: downstream

  - label: winogrande_mc_5shot
    type: downstream

  - label: winogrande_mc_5shot_bpb
    type: downstream

  - label: winogrande_rc_0shot
    type: downstream

  - label: winogrande_rc_0shot_bpb
    type: downstream

  - label: winogrande_rc_5shot
    type: downstream

  - label: winogrande_rc_5shot_bpb
    type: downstream

dirkgr · 2024-08-19T20:04:04Z

No reason to put in the 0 shot ones other than for comparability with past runs?

dirkgr · 2024-08-19T20:04:20Z

I mean, it's good to have them. But I won't stick them in every config.

OyvindTafjord · 2024-08-19T20:35:51Z

The 0-shot ones will be faster to run than the 5-shot ones (roughly 5x faster). If the 5-shot ones have better signal, they would be worth including. Possibly that's decision to be made per task (and model size), especially if there resources used on the evaluation step becomes an issue.

2015aroras · 2024-08-19T23:06:28Z

olmo/eval/downstream.py

@@ -1292,6 +1314,7 @@ def __init__(
            dataset_name=dataset_names,
            split=split,
            prompts=prompts,
+            metric_type=None,


Suggested change

metric_type=None,

metric_type=metric_type,

Good catch, thanks!

liujch1998 · 2024-08-23T17:44:39Z

Why are we missing csqa_rc_0shot_bpb and socialiqa_rc_0shot_bpb in the list?

liujch1998 · 2024-08-23T18:00:03Z

I got the following error when running sciq_rc_0shot_bpb

2024-08-23T17:55:29.295911089Z 2024-08-23 10:55:29.295	jupiter-cs-aus-110.reviz.ai2.in:0	olmo.train:1005	INFO	Running evaluation for 'sciq_rc_0shot_bpb'...
2024-08-23T17:55:29.368329339Z 2024-08-23 10:55:29.367	jupiter-cs-aus-110.reviz.ai2.in:0	olmo.train:1029	INFO	[eval_step=1/-1]
2024-08-23T17:55:29.394690515Z 2024-08-23 10:55:29.394	jupiter-cs-aus-110.reviz.ai2.in:0	olmo.train:1029	INFO	[eval_step=2/-1]
2024-08-23T17:55:31.758315366Z 2024-08-23 10:55:31.756	jupiter-cs-aus-110.reviz.ai2.in:2	olmo.util:165	CRITICAL	Uncaught ZeroDivisionError: division by zero
2024-08-23T17:55:31.758335471Z Traceback (most recent call last):
2024-08-23T17:55:31.758339730Z   File "/gantry-runtime/scripts/ladder.py", line 511, in <module>
2024-08-23T17:55:31.758341663Z     args.func(args)
2024-08-23T17:55:31.758343421Z   File "/gantry-runtime/scripts/ladder.py", line 441, in train_cmd
2024-08-23T17:55:31.758344823Z     main(cfg)
2024-08-23T17:55:31.758346276Z   File "/gantry-runtime/scripts/train.py", line 321, in main
2024-08-23T17:55:31.758347640Z     trainer.fit()
2024-08-23T17:55:31.758349128Z   File "/gantry-runtime/olmo/train.py", line 1298, in fit
2024-08-23T17:55:31.758350471Z     eval_metrics = self.eval()
2024-08-23T17:55:31.758351948Z                    ^^^^^^^^^^^
2024-08-23T17:55:31.758353205Z   File "/gantry-runtime/olmo/train.py", line 1032, in eval
2024-08-23T17:55:31.758354750Z     metrics = evaluator.compute_metrics()
2024-08-23T17:55:31.758356093Z               ^^^^^^^^^^^^^^^^^^^^^^^^^^^
2024-08-23T17:55:31.758357589Z   File "/gantry-runtime/olmo/eval/evaluator.py", line 32, in compute_metrics
2024-08-23T17:55:31.758358973Z     value = self.eval_metric.compute().item()
2024-08-23T17:55:31.758360445Z             ^^^^^^^^^^^^^^^^^^^^^^^^^^
2024-08-23T17:55:31.758361726Z   File "/opt/conda/lib/python3.11/site-packages/torchmetrics/metric.py", line 632, in wrapped_func
2024-08-23T17:55:31.758363276Z     value = _squeeze_if_scalar(compute(*args, **kwargs))
2024-08-23T17:55:31.758364558Z                                ^^^^^^^^^^^^^^^^^^^^^^^^
2024-08-23T17:55:31.758365862Z   File "/gantry-runtime/olmo/eval/downstream.py", line 158, in compute
2024-08-23T17:55:31.758367199Z     score = sum(correct) / len(correct)
2024-08-23T17:55:31.758368407Z             ~~~~~~~~~~~~~^~~~~~~~~~~~~~
2024-08-23T17:55:31.758369648Z ZeroDivisionError: division by zero

OyvindTafjord · 2024-08-23T19:55:18Z

I'll add the 0 shot versions of csqa and socialiqa, they were missing as they weren't in the OLMo-paper list of tasks, added later, but for completeness should have them as well.

As for the sciq bug, that's interesting, I'll investigate asap!

liujch1998 · 2024-08-23T20:37:43Z

Thanks a lot!
Other tasks ending with _rc_0shot_bpb ran successfully for me, only buggy one is sciq_rc_0shot_bpb

OyvindTafjord · 2024-08-23T21:33:22Z

Fixed in #712, it was lucky we ran across this, as it was a really bad bug affecting all tasks (by only scoring instances where the gold answer is the first one and sciq happens to be set up such that the gold answer is always the last one).

OyvindTafjord and others added 6 commits July 31, 2024 15:18

Add adapter downstream task for oe-eval request dumps

d7ce697

Fix minor issues

594298f

Rename data directory for oe eval tasks

0c4c7ef

Add copycolors_xl_10way task

89687b1

Add arc_challenge task under oe_eval_tasks for checking

7a468d9

Add arc_challenge_rc_0shot task

3428003

OyvindTafjord requested review from epwalsh and dirkgr August 1, 2024 12:45

OyvindTafjord added 2 commits August 1, 2024 15:15

Update CHANGELOG.md

5e61080

Linting

4b2789d

dirkgr reviewed Aug 1, 2024

View reviewed changes

epwalsh approved these changes Aug 1, 2024

View reviewed changes

olmo/eval/downstream.py Outdated Show resolved Hide resolved

Fix comment

7458a17

Dany Haddad and others added 13 commits August 13, 2024 12:51

add 0 shot requests dumps for olmes

b49ae4d

Merge pull request #701 from allenai/dh-add-0shots

5c5f026

add 0 shot requests dumps for olmes

Add support for bits-per-byte (bpb) metric

7975006

Linting, fix typo

1044e9b

Allow None args

230c30d

Script to list, validate and activate evals from oe-eval

a7ca5b9

Add/refactor 0-shot oe-eval tasks

9041160

Connect new 0-shot tasks from oe-eval

98bd28c

Capture byte_len more places

24a9380

Adding 5-shot tasks, both rc and mc

bb2153d

Add 5-shot RC/MC tasks (9 core OLMES tasks, validation set)

4b43284

Add sample bpb version of MMLU

35df2d5

Add bpb version of mmlu_var as well

38f6085

Merge branch 'main' into ot-oe-eval-requests

31da9e1

2015aroras approved these changes Aug 19, 2024

View reviewed changes

OyvindTafjord added 2 commits August 19, 2024 17:23

Pass through metric_type, fix lint

88301b2

Merge branch 'main' into ot-oe-eval-requests

61f83ac

OyvindTafjord merged commit 9147889 into main Aug 20, 2024
11 of 12 checks passed

OyvindTafjord deleted the ot-oe-eval-requests branch August 20, 2024 06:25

OyvindTafjord mentioned this pull request Aug 23, 2024

Fix bug in bpb tasks from oe-eval, add 0-shot csqa and social_iqa #712

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support in-loop evals from oe-eval-internal request dump #685

Support in-loop evals from oe-eval-internal request dump #685

OyvindTafjord commented Aug 1, 2024 •

edited

Loading

dirkgr left a comment

dirkgr Aug 1, 2024

OyvindTafjord Aug 1, 2024

epwalsh left a comment

OyvindTafjord commented Aug 1, 2024

dirkgr commented Aug 2, 2024

OyvindTafjord commented Aug 17, 2024

OyvindTafjord commented Aug 17, 2024

dirkgr commented Aug 19, 2024

dirkgr commented Aug 19, 2024

OyvindTafjord commented Aug 19, 2024

2015aroras Aug 19, 2024

OyvindTafjord Aug 20, 2024

liujch1998 commented Aug 23, 2024

liujch1998 commented Aug 23, 2024

OyvindTafjord commented Aug 23, 2024

liujch1998 commented Aug 23, 2024 •

edited

Loading

OyvindTafjord commented Aug 23, 2024

Support in-loop evals from oe-eval-internal request dump #685

Support in-loop evals from oe-eval-internal request dump #685

Conversation

OyvindTafjord commented Aug 1, 2024 • edited Loading

dirkgr left a comment

Choose a reason for hiding this comment

dirkgr Aug 1, 2024

Choose a reason for hiding this comment

OyvindTafjord Aug 1, 2024

Choose a reason for hiding this comment

epwalsh left a comment

Choose a reason for hiding this comment

OyvindTafjord commented Aug 1, 2024

dirkgr commented Aug 2, 2024

OyvindTafjord commented Aug 17, 2024

OyvindTafjord commented Aug 17, 2024

dirkgr commented Aug 19, 2024

dirkgr commented Aug 19, 2024

OyvindTafjord commented Aug 19, 2024

2015aroras Aug 19, 2024

Choose a reason for hiding this comment

OyvindTafjord Aug 20, 2024

Choose a reason for hiding this comment

liujch1998 commented Aug 23, 2024

liujch1998 commented Aug 23, 2024

OyvindTafjord commented Aug 23, 2024

liujch1998 commented Aug 23, 2024 • edited Loading

OyvindTafjord commented Aug 23, 2024

OyvindTafjord commented Aug 1, 2024 •

edited

Loading

liujch1998 commented Aug 23, 2024 •

edited

Loading