DROP Evaluation with Llama3 (vs. lm-evaluation-harness)

Evaluating Llama-3-8B on DROP throws a warning with the standard configuration (3-shot), as reported in [Llama3](https://github.com/meta-llama/llama3/blob/main/eval_details.md#drop), suggesting that the input size is greater than the maximum context size allowed by the model:

```
The smallest context of your batch (10010) is bigger than the maximum context size allowed by the model (8192) for a task in{'lighteval|drop|3'}. This is likely to lead to some errors.
```

Here is the command I use:
```
accelerate launch --num_processes=1 run_evals_accelerate.py \
    --model_args "pretrained=meta-llama/Meta-Llama-3-8B" \
    --tasks "lighteval|drop|3|0" \
    --override_batch_size 16 \
    --output_dir "./log/"
```
I am able to reproduce this even after progressively reducing the batch size to 1. 

Log:
```
WARNING:lighteval.logging.hierarchical_logger:    Model info: ModelInfo(model_name='meta-llama/Meta-Llama-3-8B', model_sha='561487d18c41c76bcb5fc6cfb73a324982f04f47', model_dtype='torch.bfloat16', model_size='15.08 GB')
WARNING:lighteval.logging.hierarchical_logger:  } [0:00:15.762582]
WARNING:lighteval.logging.hierarchical_logger:  Tasks loading {
WARNING:lighteval.logging.hierarchical_logger:    If you want to use extended_tasks, make sure you installed their dependencies using `pip install -e .[extended_tasks]`.
WARNING:lighteval.logging.hierarchical_logger:    lighteval/drop_harness default
WARNING:lighteval.logging.hierarchical_logger:    Loading documents, and requests
WARNING:lighteval.logging.hierarchical_logger:  } [0:00:34.926653]
WARNING:lighteval.logging.hierarchical_logger:  Setting seeds and waiting for all processes {
WARNING:lighteval.logging.hierarchical_logger:    setting seed to 1234 for random and numpy
WARNING:lighteval.logging.hierarchical_logger:  } [0:00:00.000371]
WARNING:lighteval.logging.hierarchical_logger:  Evaluation {
WARNING:lighteval.logging.hierarchical_logger:    Evaluate on 1 tasks.
WARNING:lighteval.logging.hierarchical_logger:    Running RequestType.GREEDY_UNTIL requests
Splits:   0%|                                                                                                                                                    | 0/4 [00:00<?, ?it/s]
WARNING:lighteval.logging.hierarchical_logger:    The smallest context of your batch (10010) is bigger than the maximum context size allowed by the model (8192) for a task in{'lighteval|drop|3'}. This is likely to lead to some errors.  0/38 [00:00<?, ?it/s]
```
The process then either stays stuck indefinitely until manually killed, or crashes as follows:

*note*: The following traceback happened even after reducing the batch size to 1 with `--override_batch_size`. 
```
WARNING:lighteval.logging.hierarchical_logger:    The smallest context of your batch (9262) is bigger than the maximum context size allowed by the model (8192) for a task in{'lighteval|drop|3'}. This is likely to lead to some errors.00:39<10:55,  3.48it/s]
WARNING:lighteval.logging.hierarchical_logger:    The smallest context of your batch (9192) is bigger than the maximum context size allowed by the model (8192) for a task in{'lighteval|drop|3'}. This is likely to lead to some errors.00:39<10:54,  3.48it/s]
WARNING:lighteval.logging.hierarchical_logger:    The smallest context of your batch (9538) is bigger than the maximum context size allowed by the model (8192) for a task in{'lighteval|drop|3'}. This is likely to lead to some errors.00:39<10:54,  3.48it/s]
Splits:   0%|                                                                                                                                                                                                                             | 0/4 [00:40<?, ?it/s]
WARNING:lighteval.logging.hierarchical_logger:  } [0:01:02.769634]
WARNING:lighteval.logging.hierarchical_logger:} [0:01:49.892104]
Traceback (most recent call last):
  File "/home/vipul.raheja/lighteval/run_evals_accelerate.py", line 82, in <module>
    main(args)
  File "/home/vipul.raheja/lighteval/src/lighteval/logging/hierarchical_logger.py", line 166, in wrapper
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/home/vipul.raheja/lighteval/src/lighteval/main_accelerate.py", line 111, in main
    evaluation_tracker = evaluate(
                         ^^^^^^^^^
  File "/home/vipul.raheja/lighteval/src/lighteval/evaluator.py", line 86, in evaluate
    full_resps = lm.greedy_until(requests, override_bs=override_bs)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/vipul.raheja/lighteval/src/lighteval/models/base_model.py", line 570, in greedy_until
    max_new_tokens = min(self.max_length - biggest_context, max_new_tokens)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
``` 


Running the same evaluation directly in `lm-evaluation-harness` does not throw any such warning and proceeds at a reasonable speed. 
```
~/lm-evaluation-harness$ lm_eval --model hf --model_args pretrained=meta-llama/Meta-Llama-3-8B --tasks drop --device cuda:0 --batch_size 16
2024-04-21:20:19:29,714 INFO     [__main__.py:251] Verbosity set to INFO
2024-04-21:20:19:33,062 INFO     [__main__.py:335] Selected Tasks: ['drop']
2024-04-21:20:19:33,063 INFO     [evaluator.py:131] Setting random seed to 0 | Setting numpy seed to 1234 | Setting torch manual seed to 1234
2024-04-21:20:19:33,064 INFO     [evaluator.py:177] Initializing hf model, with arguments: {'pretrained': 'meta-llama/Meta-Llama-3-8B'}
2024-04-21:20:19:33,164 INFO     [huggingface.py:164] Using device 'cuda:0'
Loading checkpoint shards: 100%|█████████████████████████████| 4/4 [00:06<00:00,  1.62s/it]
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Downloading builder script: 100%|█████████████████████████████| 7.46k/7.46k [00:00<00:00, 35.8MB/s]
Downloading readme: 100%|█████████████████████████████| 26.0/26.0 [00:00<00:00, 384kB/s]
Downloading data: 100%|█████████████████████████████| 8.31M/8.31M [00:00<00:00, 8.66MB/s]
Generating train split: 77409 examples [00:05, 13452.43 examples/s]
Generating validation split: 9536 examples [00:00, 11649.32 examples/s]
Map: 100%|█████████████████████████████| 77409/77409 [00:10<00:00, 7060.41 examples/s]
Map: 100%|█████████████████████████████| 9536/9536 [00:01<00:00, 4788.74 examples/s]
2024-04-21:20:20:11,675 INFO     [task.py:395] Building contexts for drop on rank 0...
100%|█████████████████████████████| 9536/9536 [00:03<00:00, 2793.13it/s]
2024-04-21:20:20:16,260 INFO     [evaluator.py:379] Running generate_until requests
Running generate_until requests:   9%|█████▊                             | 833/9536 [07:44<1:06:05,  2.19it/s]
```

Env:
transformers version: 4.39.3
Platform: Ubuntu 20.04.6 LTS
Python version: 3.11.9
Huggingface_hub version: 0.22.2
Safetensors version: 0.4.2
Accelerate version: 0.29.2
Lighteval version: 0.4.0.dev0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

DROP Evaluation with Llama3 (vs. lm-evaluation-harness) #165

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

DROP Evaluation with Llama3 (vs. lm-evaluation-harness) #165

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions