See Quick Start for setting up your own evaluation.
Large language models (LLMs) aligned with humans are reshaping AI research and applications, but a comprehensive and reliable evaluation of them remains a conundrum in academia. As an initial attempt, we present STEPS, a Systematic TEsting PropoSal tailored for chat-based LLMs' progressive cognitive abilities. Enlightened by taxonomy in cognitive science, we categorize existing identified abilities of LLMs into 5 progressive levels: Task Knowledge, Test-Taking, Grounding, Resourcefulness, and Decisiveness. On top of the design, we compile and create a series of novel tasks, settings, datasets, and environments with a scalable and handy toolkit for unified LLM evaluation. Our extensive testing over APIs and open-sourced chat-based LLMs unveil that, while gaps between star companies' and open-sourced competitors are tolerable at preliminary levels (e.g., I & II), on advanced challenges (e.g., IV & V) their performances are poles apart. STEPS demonstrates a significant discrepancy between GPT-4 and other models. We appeal to the community to join the effort to review and benchmark our current progresses and limitations holistically.
Model | I | II | III | IV | V | AVG |
---|---|---|---|---|---|---|
gpt-4 | 60.2 | 57.2 | 52.5 | 61.2 | 48.6 | 55.9 |
gpt-turbo-3.5 | 48.9 | 48.3 | 48.8 | 56.8 | 27.2 | 46.0 |
claude-v1.3 | 49.1 | 47.8 | 46.8 | 50.9 | 31.2 | 45.2 |
claude-instant-v1.1 | 46.0 | 46.8 | 45.1 | 42.4 | 27.8 | 41.6 |
text-davinci-003 | 46.5 | 42.5 | 46.8 | 38.6 | 23.4 | 39.5 |
text-davinci-002 | 41.4 | 41.6 | 45.6 | 32.0 | 15.2 | 35.2 |
text-bison-001 | 44.0 | / | 46.9 | 25.3 | 15.0 | 32.8 |
chatglm-130b | 41.6 | 42.4 | 44.6 | 20.9 | 5.0 | 30.9 |
chatglm-6b | 36.8 | 36.0 | 43.0 | 13.3 | 3.0 | 26.4 |
vicuna-13b | 34.4 | 28.2 | 42.6 | 20.9 | 3.6 | 25.9 |
bloomz-7b1-mt | 38.1 | 36.1 | 43.0 | 2.6 | 2.4 | 24.4 |
vicuna-7b | 32.2 | 27.1 | 42.2 | 13.7 | 1.7 | 23.4 |
dolly-v2-12b | 29.1 | 29.6 | 37.7 | 11.4 | 3.2 | 22.2 |
bloomz-7b1 | 35.9 | 27.2 | 42.3 | 2.9 | 2.1 | 22.1 |
koala-13b | 28.1 | 27.1 | 41.8 | 11.1 | 0.7 | 21.8 |
moss-moon-003-sft | 26.5 | 26.2 | 40.9 | 9.8 | 0.1 | 20.7 |
oasst-sft-4-pythia-12b | 26.2 | 28.3 | 40.5 | 7.9 | 0.5 | 20.7 |
First, you need to install the necessary dependencies. These are listed in requirements.txt
. To install them, use pip:
pip install -r requirements.txt
In this step, you will need to configure two YAML files:
configs/tasks/<filename>.yaml
: this is used to set up your evaluation task.configs/agents/<filename>.yaml
: this is used to specify your model's configuration.
In each YAML file, you will need to specify the following:
module: "module.path.to.class" # the class that will be used to instantiate your model or task, for example, "src.agents.DoNothingAgent"
parameters: # the parameters that will be passed to your model or task's constructor
key_1: "value_1"
key_2: "value_2"
You should place your data files according to the data paths specified in your configs/tasks/<task_name>.yaml
file. Make sure the data is correctly placed so it can be accessed by the program.
Now, you can run the evaluate.py
script with the following command:
python evaluate.py --task configs/tasks/<your task yaml file> --agent configs/agents/<your model yaml file>
Replace <your task yaml file>
and <your model yaml file>
with your specific YAML files.
This command will evaluate your model on your specified task, and the results will be saved in the output directory.
For example, just try:
python evaluate.py \
--task \
configs/tasks/example.yaml \
configs/tasks/singleround.yaml \
--agent configs/agents/do_nothing.yaml
The evaluation and prediction results will be stored in the output/
directory. Check this directory to view your model's performance.
First, you need to create a new task class. This class should inherit from the Task
class in src/task.py
.
You can create a file src/tasks/own_task.py
to override the following methods:
class YourOwnTask(Task):
def __init__(self, **config): # Change the constructor parameters if necessary
super().__init__()
@property
def metrics(self): # Change the metrics if necessary
return {"EM": lambda outputs, targets: len([1 for o, t in zip(outputs, targets) if o == t]) / min(len(outputs), len(targets))}
def get_data(self): # Change the data loading process
raise NotImplementedError
def predict_single(self, session, data_item): # Change the prediction process
raise NotImplementedError
And then, add the following code to the src/tasks/__init__.py
file:
from .own_task import YourOwnTask
Next, you need to create a new YAML file to configure your task. You can create a file configs/taskso/wn_task.yaml
to specify your task's configuration:
module: "src.tasks.YourOwnTask"
parameters: # the parameters in YourOwnTask's constructor
key: value
key2: value2
Now, you can run the evaluate.py
script with the following command:
python evaluate.py --task configs/tasks/wn_task.yaml --agent configs/agents/<your model yaml file>
- Suggestions for the data path if needed:
data/<task_name>/*.jsonl
.