`Mistral-Pro` evaluation harness

Fork of the Eleuther LM Evaluation Harness used in Mistral-Pro-8B-v0.1.

Environment Setup

You can run the following code to set up the environment.

pip install -r requirements.txt

Running the evaluation

See scripts/leaderboard.sh for an entrypoint to running evaluation on a model from the HuggingFace Hub.

The script can be used through the following command:

bash scripts/leaderboard.sh YOUR_MODEL_PATH YOUR_MODEL_NAME

Refer to lm_eval/tasks directory for their associated implementations.

Tasks Supported

Below, we detail all evaluation benchmarks used in the Open-LLM-Leaderboard.

ARC: 25-shot, arc-challenge (acc_norm)
HellaSwag: 10-shot, hellaswag (acc_norm)
TruthfulQA: 0-shot, truthfulqa-mc (mc2)
MMLU: 5-shot (average of all the results acc)
Winogrande: 5-shot, winogrande (acc)
GSM8k: 5-shot, gsm8k (acc)

Main Difference between Open-LLM-Leaderboard

The main difference between the version used by Open-LLM-Leaderboard is the truncation word in GSM8K. The version used by Open-LLM-Leaderboard will be truncated when the output token is ":" as the following figure's red part shows, while this repo adopts the green part to do the truncation. The truncation word of the green part is from the latest version of lm-evaluation-harness.

Here is one example that the output will be truncated too early so that we use the latest version of lm-evaluation-harness's truncation word to run the test.

Name		Name	Last commit message	Last commit date
Latest commit History 3,259 Commits
.github/workflows		.github/workflows
assets		assets
docs		docs
examples		examples
lm_eval		lm_eval
scripts		scripts
templates/new_yaml_task		templates/new_yaml_task
tests		tests
.coveragerc		.coveragerc
.flake8		.flake8
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
CITATION.bib		CITATION.bib
CODEOWNERS		CODEOWNERS
LICENSE.md		LICENSE.md
README.md		README.md
ignore.txt		ignore.txt
mypy.ini		mypy.ini
pile_statistics.json		pile_statistics.json
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

`Mistral-Pro` evaluation harness

Environment Setup

Running the evaluation

Tasks Supported

Main Difference between Open-LLM-Leaderboard

About

Releases

Packages

Languages

License

hills-code/lm-evaluation-harness

Folders and files

Latest commit

History

Repository files navigation

Mistral-Pro evaluation harness

Environment Setup

Running the evaluation

Tasks Supported

Main Difference between Open-LLM-Leaderboard

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

`Mistral-Pro` evaluation harness

Packages