Skip to content

allenai/OLMo-Eval

Repository files navigation

OLMo-Eval

OLMo-Eval is a repository for evaluating open language models.

Note of Deprecation

NOTE: This repository has been superceded by the OLMES repository, available at https://github.com/allenai/olmes (Open Language Model Evaluation System).

Overview

The olmo_eval framework is a way to run evaluation pipelines for language models on NLP tasks. The codebase is extensible and contains task_sets and example configurations, which run a series of tango steps for computing the model outputs and metrics.

Using this pipeline, you can evaluate m models on t task_sets, where each task_set consists of one or more individual tasks. Using task_sets allows you to compute aggregate metrics for multiple tasks. The optional google-sheet integration can be used for reporting.

The pipeline is built using ai2-tango and ai2-catwalk.

Installation

After cloning the repository, please run

conda create -n eval-pipeline python=3.10
conda activate eval-pipeline
cd OLMo-Eval
pip install -e .

Quickstart

The current task_sets can be found at configs/task_sets. In this example, we run gen_tasks on EleutherAI/pythia-1b. The example config is here.

The configuration can be run as follows:

tango --settings tango.yml run configs/example_config.jsonnet --workspace my-eval-workspace

This executes all the steps defined in the config, and saves them in a local tango workspace called my-eval-workspace. If you add a new task_set or model to your config and run the same command again, it will reuse the previous outputs, and only compute the new outputs.

The output should look like this:

Screen Shot 2023-12-04 at 9 22 35 PM

New models and datasets can be added by modifying the example configuration.

Load pipeline output

from tango import Workspace
workspace = Workspace.from_url("local://my-eval-workspace")
result = workspace.step_result("combine-all-outputs")

Load individual task results with per instance outputs

result = workspace.step_result("outputs_pythia-1bstep140000_gen_tasks_drop")

Evaluating common models on standard benchmarks

The eval_table config evaluates falcon-7b, mpt-7b, llama2-7b, and llama2-13b, on standard_benchmarks and MMLU. Run as follows:

tango --settings tango.yml run configs/eval_table.jsonnet --workspace my-eval-workspace

PALOMA

This repository was also used to run evaluations for the PALOMA paper

Details on running the evaluation on PALOMA can be found here.

Advanced

Releases

No releases published

Packages

No packages published

Languages