STEPS: A Systematic Testing Proposal for Progressive Cognitive Abilities in Language Models

See Quick Start for setting up your own evaluation.

STEPS: A Systematic Testing Proposal for Progressive Cognitive Abilities in Language Models

Large language models (LLMs) aligned with humans are reshaping AI research and applications, but a comprehensive and reliable evaluation of them remains a conundrum in academia. As an initial attempt, we present STEPS, a Systematic TEsting PropoSal tailored for chat-based LLMs' progressive cognitive abilities. Enlightened by taxonomy in cognitive science, we categorize existing identified abilities of LLMs into 5 progressive levels: Task Knowledge, Test-Taking, Grounding, Resourcefulness, and Decisiveness. On top of the design, we compile and create a series of novel tasks, settings, datasets, and environments with a scalable and handy toolkit for unified LLM evaluation. Our extensive testing over APIs and open-sourced chat-based LLMs unveil that, while gaps between star companies' and open-sourced competitors are tolerable at preliminary levels (e.g., I & II), on advanced challenges (e.g., IV & V) their performances are poles apart. STEPS demonstrates a significant discrepancy between GPT-4 and other models. We appeal to the community to join the effort to review and benchmark our current progresses and limitations holistically.

STEPS Overall Result

Model	I	II	III	IV	V	AVG
gpt-4	60.2	57.2	52.5	61.2	48.6	55.9
gpt-turbo-3.5	48.9	48.3	48.8	56.8	27.2	46.0
claude-v1.3	49.1	47.8	46.8	50.9	31.2	45.2
claude-instant-v1.1	46.0	46.8	45.1	42.4	27.8	41.6
text-davinci-003	46.5	42.5	46.8	38.6	23.4	39.5
text-davinci-002	41.4	41.6	45.6	32.0	15.2	35.2
text-bison-001	44.0	/	46.9	25.3	15.0	32.8
chatglm-130b	41.6	42.4	44.6	20.9	5.0	30.9
chatglm-6b	36.8	36.0	43.0	13.3	3.0	26.4
vicuna-13b	34.4	28.2	42.6	20.9	3.6	25.9
bloomz-7b1-mt	38.1	36.1	43.0	2.6	2.4	24.4
vicuna-7b	32.2	27.1	42.2	13.7	1.7	23.4
dolly-v2-12b	29.1	29.6	37.7	11.4	3.2	22.2
bloomz-7b1	35.9	27.2	42.3	2.9	2.1	22.1
koala-13b	28.1	27.1	41.8	11.1	0.7	21.8
moss-moon-003-sft	26.5	26.2	40.9	9.8	0.1	20.7
oasst-sft-4-pythia-12b	26.2	28.3	40.5	7.9	0.5	20.7

Quick Start

1. Install requirements.txt

First, you need to install the necessary dependencies. These are listed in requirements.txt. To install them, use pip:

pip install -r requirements.txt

2. Configure YAML files

In this step, you will need to configure two YAML files:

configs/tasks/<filename>.yaml: this is used to set up your evaluation task.
configs/agents/<filename>.yaml: this is used to specify your model's configuration.

In each YAML file, you will need to specify the following:

module: "module.path.to.class" # the class that will be used to instantiate your model or task, for example, "src.agents.DoNothingAgent"
parameters: # the parameters that will be passed to your model or task's constructor
    key_1: "value_1"
    key_2: "value_2"

3. Place data files

You should place your data files according to the data paths specified in your configs/tasks/<task_name>.yaml file. Make sure the data is correctly placed so it can be accessed by the program.

4. Run the evaluation script

Now, you can run the evaluate.py script with the following command:

python evaluate.py --task configs/tasks/<your task yaml file> --agent configs/agents/<your model yaml file>

Replace <your task yaml file> and <your model yaml file> with your specific YAML files.

This command will evaluate your model on your specified task, and the results will be saved in the output directory.

For example, just try:

python evaluate.py \
    --task \
        configs/tasks/example.yaml \
        configs/tasks/singleround.yaml \
    --agent configs/agents/do_nothing.yaml

5. Check the results

The evaluation and prediction results will be stored in the output/ directory. Check this directory to view your model's performance.

How to add you own tasks?

1. Create a new task class

First, you need to create a new task class. This class should inherit from the Task class in src/task.py.

You can create a file src/tasks/own_task.py to override the following methods:

class YourOwnTask(Task):
    def __init__(self, **config): # Change the constructor parameters if necessary
        super().__init__()

    @property
    def metrics(self): # Change the metrics if necessary
        return {"EM": lambda outputs, targets: len([1 for o, t in zip(outputs, targets) if o == t]) / min(len(outputs), len(targets))}

    def get_data(self): # Change the data loading process
        raise NotImplementedError

    def predict_single(self, session, data_item): # Change the prediction process
        raise NotImplementedError

And then, add the following code to the src/tasks/__init__.py file:

from .own_task import YourOwnTask

2. Create a new task YAML file

Next, you need to create a new YAML file to configure your task. You can create a file configs/taskso/wn_task.yaml to specify your task's configuration:

module: "src.tasks.YourOwnTask"
parameters: # the parameters in YourOwnTask's constructor
    key: value
    key2: value2

3. Run the evaluation script

Now, you can run the evaluate.py script with the following command:

python evaluate.py --task configs/tasks/wn_task.yaml --agent configs/agents/<your model yaml file>

4. Suggestions

Suggestions for the data path if needed: data/<task_name>/*.jsonl.

Name		Name	Last commit message	Last commit date
Latest commit History 247 Commits
configs		configs
docs		docs
scripts		scripts
server		server
src		src
.DS_Store		.DS_Store
.gitignore		.gitignore
GUIDE.md		GUIDE.md
QUICK_START.md		QUICK_START.md
README.md		README.md
TODO.md		TODO.md
eval.sh		eval.sh
evaluate.py		evaluate.py
requirements.txt		requirements.txt
run_lite.sh		run_lite.sh
setup-lite.sh		setup-lite.sh
test.sh		test.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

STEPS: A Systematic Testing Proposal for Progressive Cognitive Abilities in Language Models

STEPS Overall Result

Quick Start

1. Install requirements.txt

2. Configure YAML files

3. Place data files

4. Run the evaluation script

5. Check the results

How to add you own tasks?

1. Create a new task class

2. Create a new task YAML file

3. Run the evaluation script

4. Suggestions

About

Releases

Packages

Contributors 6

Languages

xuyifan-0731/STEPS_benchmark

Folders and files

Latest commit

History

Repository files navigation

STEPS: A Systematic Testing Proposal for Progressive Cognitive Abilities in Language Models

STEPS Overall Result

Quick Start

1. Install requirements.txt

2. Configure YAML files

3. Place data files

4. Run the evaluation script

5. Check the results

How to add you own tasks?

1. Create a new task class

2. Create a new task YAML file

3. Run the evaluation script

4. Suggestions

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 6

Languages

Packages