Skip to content

Commit dd7c8e1

Browse files
Supports extended tasks (#101)
* init - now gives the path with an arg, maybe will remove * allows several custom task modules to be loaded * fix quality --------- Co-authored-by: Nathan Habib <[email protected]> Co-authored-by: Nathan Habib <[email protected]>
1 parent d4ea256 commit dd7c8e1

File tree

12 files changed

+88
-30
lines changed

12 files changed

+88
-30
lines changed

README.md

Lines changed: 10 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -210,9 +210,13 @@ However, we are very grateful to the Harness and HELM teams for their continued
210210
If your new task or metric has requirements, add a specific `requirements.txt` file with your evaluation.
211211

212212
### Adding a new task
213-
To add a new task, first either open an issue, to determine whether it will be integrated in the core evaluations of lighteval, or in the community tasks, and **add its dataset** on the hub.
214-
Note: Core evaluations are evals we will add to our test suite to ensure non regression through time, and which already see a high usage in the community.
215-
A popular community evaluation can move to become a core evaluation through time.
213+
To add a new task, first either open an issue, to determine whether it will be integrated in the core evaluations of lighteval, in the extended tasks, or in the community tasks, and **add its dataset** on the hub.
214+
215+
- Core evaluations are evaluation which only require standard logic in their metrics and processing, and that we will add to our test suite to ensure non regression through time. They already see a high usage in the community.
216+
- Extended evaluations are evaluations which require custom logic in their metrics (complex normalisation, an LLM as a judge, ...), that we added to facilitate the life of users. They already see a high usage in the community.
217+
- Community evaluations are submissions by the community of new tasks.
218+
219+
A popular community evaluation can move to becoming an extended or core evaluation through time.
216220

217221
#### Core evaluations
218222
Prompt function: **find a suitable prompt function** in `src.lighteval.tasks.task_prompt_formatting.py`, or code your own. This function must output a `Doc` object, which should contain `query`, your prompt, and either `gold`, the gold output, or `choices` and `gold_index`, the list of choices and index or indices of correct answers. If your query contains an instruction which should not be repeated in a few shot setup, add it to an `instruction` field.
@@ -241,6 +245,9 @@ Summary: create a **line summary** of your evaluation, in `src/lighteval/tasks/t
241245

242246
Make sure you can launch your model with your new task using `--tasks lighteval|yournewtask|2|0`.
243247

248+
### Extended evaluations
249+
Proceed as for community evaluations, but in the `extended_tasks` folder.
250+
244251
#### Community evaluations
245252
Copy the `community_tasks/_template.yml` to `community_tasks/yourevalname.py` and edit it to add your custom tasks (the parameters you can use are explained above). It contains an interesting mechanism if the dataset you are adding contains a lot of subsets.
246253

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -23,7 +23,7 @@
2323

2424
import langdetect
2525

26-
import tasks_examples.custom_tasks_with_custom_metrics.ifeval.instructions_utils as instructions_util
26+
import extended_tasks.ifeval.instructions_utils as instructions_util
2727

2828

2929
logger = logging.getLogger(__name__)
Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -13,7 +13,7 @@
1313
# limitations under the License.
1414

1515
"""Registry of all instructions."""
16-
import tasks_examples.custom_tasks_with_custom_metrics.ifeval.instructions as instructions
16+
import extended_tasks.ifeval.instructions as instructions
1717

1818

1919
_KEYWORD = "keywords:"
File renamed without changes.
Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -23,7 +23,7 @@
2323
import numpy as np
2424
from aenum import extend_enum
2525

26-
import tasks_examples.custom_tasks_with_custom_metrics.ifeval.instructions_registry as instructions_registry
26+
import extended_tasks.ifeval.instructions_registry as instructions_registry
2727
from lighteval.metrics import Metrics
2828
from lighteval.metrics.utils import (
2929
MetricCategory,
@@ -38,7 +38,7 @@
3838
ifeval = LightevalTaskConfig(
3939
name="ifeval",
4040
prompt_function="ifeval_prompt",
41-
suite=["custom"],
41+
suite=["extended"],
4242
hf_repo="wis-k/instruction-following-eval",
4343
hf_subset="default",
4444
metric=["ifeval_metric"],

pyproject.toml

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -78,7 +78,6 @@ dependencies = [
7878
accelerate = ["accelerate"]
7979
tgi = ["text-generation==0.6.0"]
8080
optimum = ["optimum==1.12.0"]
81-
# Quantization and adapter weights
8281
quantization = ["bitsandbytes>=0.41.0", "auto-gptq>=0.4.2"]
8382
adapters = ["peft==0.3.0"]
8483
nanotron = [
@@ -88,7 +87,9 @@ nanotron = [
8887
quality = ["ruff==v0.2.2","pre-commit"]
8988
tests = ["pytest==7.4.0"]
9089
dev = ["lighteval[accelerate,quality,tests]"]
91-
90+
extended_tasks = [
91+
"langdetect", #ifeval
92+
]
9293

9394
[project.urls]
9495
Homepage = "https://github.com/huggingface/lighteval"

run_evals_accelerate.py

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -103,6 +103,12 @@ def get_parser():
103103
default=None,
104104
help="Path to a file with custom tasks (a TASK list of dict and potentially prompt formating functions)",
105105
)
106+
parser.add_argument(
107+
"--extended_tasks",
108+
type=str,
109+
default=None,
110+
help="Path to the folder which contains all extended tasks",
111+
)
106112
group.add_argument(
107113
"--tasks",
108114
type=str,

src/lighteval/main_accelerate.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -81,7 +81,7 @@ def main(args):
8181
with accelerator.main_process_first() if accelerator is not None else nullcontext():
8282
task_names_list, few_shots_dict = taskinfo_selector(args.tasks)
8383
task_dict = Registry(cache_dir=env_config.cache_dir).get_task_dict(
84-
task_names_list, custom_tasks=args.custom_tasks
84+
task_names_list, custom_tasks=args.custom_tasks, extended_tasks=args.extended_tasks
8585
)
8686
LightevalTask.load_datasets(task_dict.values(), args.dataset_loading_processes)
8787

src/lighteval/main_nanotron.py

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -135,7 +135,8 @@ def main(
135135

136136
task_names_list, few_shots_dict = taskinfo_selector(tasks_selection)
137137
task_dict = Registry(cache_dir=cache_dir).get_task_dict(
138-
task_names_list, custom_tasks=lighteval_config.tasks.custom_tasks
138+
task_names_list,
139+
custom_tasks=lighteval_config.tasks.custom_tasks,
139140
)
140141
# Loading all the dataset in a distributed manner
141142
LightevalTask.load_datasets(task_dict.values(), lighteval_config.tasks.dataset_loading_processes)

src/lighteval/tasks/lighteval_task.py

Lines changed: 22 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -145,7 +145,9 @@ def __post_init__(self):
145145

146146

147147
class LightevalTask:
148-
def __init__(self, name: str, cfg: LightevalTaskConfig, cache_dir: Optional[str] = None, custom_tasks_module=None):
148+
def __init__( # noqa: C901
149+
self, name: str, cfg: LightevalTaskConfig, cache_dir: Optional[str] = None, custom_tasks_module: list = None
150+
):
149151
"""
150152
Initialize a LightEval task.
151153
@@ -202,16 +204,26 @@ def __init__(self, name: str, cfg: LightevalTaskConfig, cache_dir: Optional[str]
202204
# to use once prompt formatting is managed as a module
203205
if custom_tasks_module is None:
204206
self.formatter = getattr(tasks_prompt_formatting, cfg.prompt_function)
205-
elif hasattr(custom_tasks_module, cfg.prompt_function):
206-
# If we have a prompt in both the custom_tasks_module and our tasks_prompt_formatting
207-
# We take the prompt from the custom_tasks_module
208-
if hasattr(tasks_prompt_formatting, cfg.prompt_function):
209-
hlog_warn(
210-
f"Be careful you are using custom prompt function {cfg.prompt_function} and not the default one."
211-
)
212-
self.formatter = getattr(custom_tasks_module, cfg.prompt_function)
213207
else:
214-
self.formatter = getattr(tasks_prompt_formatting, cfg.prompt_function)
208+
formatter = []
209+
for module in custom_tasks_module:
210+
if hasattr(module, cfg.prompt_function):
211+
formatter.append(getattr(module, cfg.prompt_function))
212+
213+
if len(formatter) == 0: # Default version
214+
self.formatter = getattr(tasks_prompt_formatting, cfg.prompt_function)
215+
elif len(formatter) == 1:
216+
# If we have a prompt in both the module and our tasks_prompt_formatting
217+
# We take the prompt from the module
218+
if hasattr(tasks_prompt_formatting, cfg.prompt_function):
219+
hlog_warn(
220+
f"Be careful you are using custom prompt function {cfg.prompt_function} and not the default one."
221+
)
222+
self.formatter = getattr(module, cfg.prompt_function)
223+
else:
224+
raise Exception(
225+
f"You defined the prompt function {cfg.prompt_function} several times in the different custom modules you are loading."
226+
)
215227
self.generation_size = cfg.generation_size
216228
self.stop_sequence = cfg.stop_sequence
217229
self.output_regex = cfg.output_regex

0 commit comments

Comments
 (0)