Skip to content

Commit 96e74e2

Browse files
Add config files for models (#131)
* added the different configs through yaml files * edited task examples path --------- Co-authored-by: Nathan Habib <[email protected]>
1 parent 0f5e257 commit 96e74e2

20 files changed

+169
-210
lines changed

.gitignore

Lines changed: 0 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -164,29 +164,19 @@ tests/.data
164164
tests/data
165165

166166
# outputs folder
167-
examples/*/outputs
168-
examples/*/NeMo_experiments
169-
examples/*/nemo_experiments
170-
examples/*/.hydra
171-
examples/*/wandb
172-
examples/*/data
173167
wandb
174168
dump.py
175169

176170
docs/sources/source/test_build/
177171

178172
# Checkpoints, config files and temporary files created in tutorials.
179-
examples/neural_graphs/*.chkpt
180-
examples/neural_graphs/*.yml
181-
182173
.hydra/
183174
nemo_experiments/
184175

185176
.ruff_cache
186177

187178
tmp.py
188179

189-
examples
190180
benchmark_output
191181
prod_env
192182

README.md

Lines changed: 21 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -10,9 +10,6 @@ We're releasing it with the community in the spirit of building in the open.
1010
Note that it is still very much early so don't expect 100% stability ^^'
1111
In case of problems or question, feel free to open an issue!
1212

13-
## News
14-
- **Feb 08, 2024**: Release of `lighteval`
15-
1613
## Installation
1714

1815
Clone the repo:
@@ -98,7 +95,7 @@ Here, `--tasks` refers to either a _comma-separated_ list of supported tasks fro
9895
suite|task|num_few_shot|{0 or 1 to automatically reduce `num_few_shot` if prompt is too long}
9996
```
10097

101-
or a file path like [`tasks_examples/recommended_set.txt`](./tasks_examples/recommended_set.txt) which specifies multiple task configurations. For example, to evaluate GPT-2 on the Truthful QA benchmark run:
98+
or a file path like [`examples/tasks/recommended_set.txt`](./examples/tasks/recommended_set.txt) which specifies multiple task configurations. For example, to evaluate GPT-2 on the Truthful QA benchmark run:
10299

103100
```shell
104101
accelerate launch --multi_gpu --num_processes=8 run_evals_accelerate.py \
@@ -118,7 +115,20 @@ accelerate launch --multi_gpu --num_processes=8 run_evals_accelerate.py \
118115
--output_dir="./evals/"
119116
```
120117

121-
See the [`tasks_examples/recommended_set.txt`](./tasks_examples/recommended_set.txt) file for a list of recommended task configurations.
118+
See the [`examples/tasks/recommended_set.txt`](./examples/tasks/recommended_set.txt) file for a list of recommended task configurations.
119+
120+
### Evaluating a model with a complex configuration
121+
122+
If you want to evaluate a model by spinning up inference endpoints, or use adapter/delta weights, or more complex configuration options, you can load models using a configuration file. This is done as follows:
123+
124+
```shell
125+
accelerate launch --multi_gpu --num_processes=<num_gpus> run_evals_accelerate.py \
126+
--model_config_path="<path to your model configuration>" \
127+
--tasks <task parameters> \
128+
--output_dir output_dir
129+
```
130+
131+
Examples of possible configuration files are provided in `examples/model_configs`.
122132

123133
### Evaluating a large model with pipeline parallelism
124134

@@ -127,15 +137,13 @@ To evaluate models larger that ~40B parameters in 16-bit precision, you will nee
127137
```shell
128138
# PP=2, DP=4 - good for models < 70B params
129139
accelerate launch --multi_gpu --num_processes=4 run_evals_accelerate.py \
130-
--model_args="pretrained=<path to model on the hub>" \
131-
--model_parallel \
140+
--model_args="pretrained=<path to model on the hub>,model_parallel=True" \
132141
--tasks <task parameters> \
133142
--output_dir output_dir
134143

135144
# PP=4, DP=2 - good for huge models >= 70B params
136145
accelerate launch --multi_gpu --num_processes=2 run_evals_accelerate.py \
137-
--model_args="pretrained=<path to model on the hub>" \
138-
--model_parallel \
146+
--model_args="pretrained=<path to model on the hub>,model_parallel=True" \
139147
--tasks <task parameters> \
140148
--output_dir output_dir
141149
```
@@ -147,7 +155,7 @@ To evaluate a model on all the benchmarks of the [Open LLM Leaderboard](https://
147155
```shell
148156
accelerate launch --multi_gpu --num_processes=8 run_evals_accelerate.py \
149157
--model_args "pretrained=<model name>" \
150-
--tasks tasks_examples/open_llm_leaderboard_tasks.txt \
158+
--tasks examples/tasks/open_llm_leaderboard_tasks.txt \
151159
--override_batch_size 1 \
152160
--output_dir="./evals/"
153161
```
@@ -220,7 +228,7 @@ However, we are very grateful to the Harness and HELM teams for their continued
220228
- [metrics](https://github.com/huggingface/lighteval/tree/main/src/lighteval/metrics): All the available metrics you can use. They are described in metrics, and divided between sample metrics (applied at the sample level, such as a prediction accuracy) and corpus metrics (applied over the whole corpus). You'll also find available normalisation functions.
221229
- [models](https://github.com/huggingface/lighteval/tree/main/src/lighteval/models): Possible models to use. We cover transformers (base_model), with adapter or delta weights, as well as TGI models locally deployed (it's likely the code here is out of date though), and brrr/nanotron models.
222230
- [tasks](https://github.com/huggingface/lighteval/tree/main/src/lighteval/tasks): Available tasks. The complete list is in `tasks_table.jsonl`, and you'll find all the prompts in `tasks_prompt_formatting.py`. Popular tasks requiring custom logic are exceptionally added in the [extended tasks](https://github.com/huggingface/lighteval/blob/main/src/lighteval/tasks/extended).
223-
- [tasks_examples](https://github.com/huggingface/lighteval/tree/main/tasks_examples) contains a list of available tasks you can launch. We advise using tasks in the `recommended_set`, as it's possible that some of the other tasks need double checking.
231+
- [examples/tasks](https://github.com/huggingface/lighteval/tree/main/examples/tasks) contains a list of available tasks you can launch. We advise using tasks in the `recommended_set`, as it's possible that some of the other tasks need double checking.
224232
- [tests](https://github.com/huggingface/lighteval/tree/main/tests) contains our test suite, that we run at each PR to prevent regressions in metrics/prompts/tasks, for a subset of important tasks.
225233

226234
## Customisation
@@ -291,7 +299,7 @@ if __name__ == "__main__":
291299

292300
You can then give your custom metric to lighteval by using `--custom-tasks path_to_your_file` when launching it.
293301

294-
To see an example of a custom metric added along with a custom task, look at `tasks_examples/custom_tasks_with_custom_metrics/ifeval/ifeval.py`.
302+
To see an example of a custom metric added along with a custom task, look at `examples/tasks/custom_tasks_with_custom_metrics/ifeval/ifeval.py`.
295303

296304
## Available metrics
297305
### Metrics for multiple choice tasks
@@ -414,7 +422,7 @@ source <path_to_your_venv>/activate #or conda activate yourenv
414422
cd <path_to_your_lighteval>/lighteval
415423

416424
export CUDA_LAUNCH_BLOCKING=1
417-
srun accelerate launch --multi_gpu --num_processes=8 run_evals_accelerate.py --model_args "pretrained=your model name" --tasks tasks_examples/open_llm_leaderboard_tasks.txt --override_batch_size 1 --save_details --output_dir=your output dir
425+
srun accelerate launch --multi_gpu --num_processes=8 run_evals_accelerate.py --model_args "pretrained=your model name" --tasks examples/tasks/open_llm_leaderboard_tasks.txt --override_batch_size 1 --save_details --output_dir=your output dir
418426
```
419427

420428
## Releases
Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,12 @@
1+
model:
2+
type: "base" # can be base, tgi, or endpoint
3+
base_params:
4+
model_args: "pretrained=HuggingFaceH4/zephyr-7b-beta,revision=main" # pretrained=model_name,trust_remote_code=boolean,revision=revision_to_use,model_parallel=True ...
5+
dtype: "bfloat16"
6+
merged_weights: # Ignore this section if you are not using PEFT models
7+
delta_weights: false # set to True of your model should be merged with a base model, also need to provide the base model name
8+
adapter_weights: false # set to True of your model has been trained with peft, also need to provide the base model name
9+
base_model: null # path to the base_model
10+
generation:
11+
multichoice_continuations_start_space: false # Whether to force multiple choice continuations to start with a space
12+
no_multichoice_continuations_start_space: false # Whether to force multiple choice continuations to not start with a space
Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,18 @@
1+
model:
2+
type: "endpoint" # can be base, tgi, or endpoint
3+
base_params:
4+
endpoint_name: "llama-2-7B-lighteval" # needs to be lower case without special characters
5+
model: "meta-llama/Llama-2-7b-hf"
6+
revision: "main"
7+
dtype: "float16" # can be any of "awq", "eetq", "gptq", "4bit' or "8bit" (will use bitsandbytes), "bfloat16" or "float16"
8+
reuse_existing: false # if true, ignore all params in instance
9+
instance:
10+
accelerator: "gpu"
11+
region: "eu-west-1"
12+
vendor: "aws"
13+
instance_size: "medium"
14+
instance_type: "g5.2xlarge"
15+
framework: "pytorch"
16+
endpoint_type: "protected"
17+
generation:
18+
add_special_tokens: true
Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
model:
2+
type: "tgi" # can be base, tgi, or endpoint
3+
instance:
4+
inference_server_address: ""
5+
inference_server_auth: null
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.

0 commit comments

Comments
 (0)