Skip to content

Latest commit

 

History

History
86 lines (63 loc) · 3.96 KB

evaluate-your-own-tasks.md

File metadata and controls

86 lines (63 loc) · 3.96 KB

Evaluate Your Own Tasks

YAML file for tasks

We use the YAML file to define tasks, this allows us to easily evaluate multiple tasks at a single run and configure them independently. Specifically, you can add multiple tasks or folders at a time for evaluation, and the script will automatically collect all YAML files under those folders recursively.

# Single node
bash scripts/evaluate.sh task1.yaml task2.yaml dir1 dir2 ...
# Multi node
bash scripts/evaluate_multiple_node.sh task1.yaml task2.yaml dir1 dir2 ...

We support two types of evaluation tasks: multi-choice and generation. The YAML config options for both tasks are defined in evaluation/configs.py. Basically, all types of tasks share common configs defining task information:

name: 'glue_cola'  # Task Name
type: 'mul'  # Task type, 'gen' (generate) or 'mul' (multiple choice)
path: 'bloom/glue_cola'  # task data path relative to DATA_PATH in 'evaluate.sh'
use_task_mask: False # Whether use [gMASK] for evaluation
unidirectional: False # Whether use unidirectional attention
max_seq_length: 2048  # Max sequence length
file-pattern: # Organize jsonl file in groups
  validation: "**/validation.jsonl" # Will search for all file named 'validation.jsonl' in `DATA_PATH/bloom/glue_cola` using glob.glob()
micro-batch-size: 30 # 'gen' task only support mbs = 1 for now

See configuration details for multi-choice and generation tasks in evaluation/configs.py.

Data format for tasks

We recommend organizing the task data in the following structure and setup up two groups named "validation" and "test" in the file-pattern config so that it becomes very easy to evaluate different prompts on both validation and test sets independently.

DATA_PATH
└── task_name
    ├── prompt_1
    │   ├── test.jsonl
    │   └── val.jsonl
    ├── prompt_2
    │   ├── test.jsonl
    │   └── val.jsonl
    └── prompt_3
        ├── test.jsonl
        └── val.jsonl

The evaluation data for each prompt are organized into jsonline format. For multi-choice tasks, the format of each line of JSON should be

{
    "inputs_pretokenized": "Context and question here",
    "choices_pretokenized": ["Choice 1", "Choice 2", "Choice 3"],
    "label": int
}

The default metric for the multi-choice task is Accuracy.

For the generation task, the format of each line of JSON should be

{
    "inputs_pretokenized": "Context and question here",
    "targets_pretokenized": ["Target 1", "Target 2", "Target 3"],
    "label": int
}

The default metrics for the generation task are EM(Exact-Match) and F1. Given inputs, the sequence generated by the model will be metricized separately from all targets and the highest value will be taken.

Implement Your Metrics

You can customize your evaluation metrics function and add it to DEFAULT_METRICS in evaluation/metrics.py, and then you can specify metric: ['Your metric name'] in the task YAML file.

Fully customize the evaluation process

By default, we implement classes named MultiChoiceTask and GenerationTask in evaluation/tasks.py for multi-choice tasks and generation tasks, respectively.

You can implement a new task class and inherit from one of these two classes, and implement the process_single_batch function to define how to process a batch of inputs and get the predictions. Following Big-Bench, we implemented two methods you can use for your evaluation:

  • model.cond_log_prob(): Compute the probabilities of provided model outputs for given inputs.
  • model.generate_text(): Generate text for given inputs.

Once you have created the new task class, you need to specify the relative path to import the class in the module field of the task YAML file. See tasks/lambada/tasks.py and tasks/lambada/lambada.yaml for how we customize the beam search generation strategy for LAMBADA tasks and configure the YAML file.