-
Notifications
You must be signed in to change notification settings - Fork 371
Description
Evaluating Llama-3-8B on DROP throws a warning with the standard configuration (3-shot), as reported in Llama3, suggesting that the input size is greater than the maximum context size allowed by the model:
The smallest context of your batch (10010) is bigger than the maximum context size allowed by the model (8192) for a task in{'lighteval|drop|3'}. This is likely to lead to some errors.
Here is the command I use:
accelerate launch --num_processes=1 run_evals_accelerate.py \
--model_args "pretrained=meta-llama/Meta-Llama-3-8B" \
--tasks "lighteval|drop|3|0" \
--override_batch_size 16 \
--output_dir "./log/"
I am able to reproduce this even after progressively reducing the batch size to 1.
Log:
WARNING:lighteval.logging.hierarchical_logger: Model info: ModelInfo(model_name='meta-llama/Meta-Llama-3-8B', model_sha='561487d18c41c76bcb5fc6cfb73a324982f04f47', model_dtype='torch.bfloat16', model_size='15.08 GB')
WARNING:lighteval.logging.hierarchical_logger: } [0:00:15.762582]
WARNING:lighteval.logging.hierarchical_logger: Tasks loading {
WARNING:lighteval.logging.hierarchical_logger: If you want to use extended_tasks, make sure you installed their dependencies using `pip install -e .[extended_tasks]`.
WARNING:lighteval.logging.hierarchical_logger: lighteval/drop_harness default
WARNING:lighteval.logging.hierarchical_logger: Loading documents, and requests
WARNING:lighteval.logging.hierarchical_logger: } [0:00:34.926653]
WARNING:lighteval.logging.hierarchical_logger: Setting seeds and waiting for all processes {
WARNING:lighteval.logging.hierarchical_logger: setting seed to 1234 for random and numpy
WARNING:lighteval.logging.hierarchical_logger: } [0:00:00.000371]
WARNING:lighteval.logging.hierarchical_logger: Evaluation {
WARNING:lighteval.logging.hierarchical_logger: Evaluate on 1 tasks.
WARNING:lighteval.logging.hierarchical_logger: Running RequestType.GREEDY_UNTIL requests
Splits: 0%| | 0/4 [00:00<?, ?it/s]
WARNING:lighteval.logging.hierarchical_logger: The smallest context of your batch (10010) is bigger than the maximum context size allowed by the model (8192) for a task in{'lighteval|drop|3'}. This is likely to lead to some errors. 0/38 [00:00<?, ?it/s]
The process then either stays stuck indefinitely until manually killed, or crashes as follows:
note: The following traceback happened even after reducing the batch size to 1 with --override_batch_size.
WARNING:lighteval.logging.hierarchical_logger: The smallest context of your batch (9262) is bigger than the maximum context size allowed by the model (8192) for a task in{'lighteval|drop|3'}. This is likely to lead to some errors.00:39<10:55, 3.48it/s]
WARNING:lighteval.logging.hierarchical_logger: The smallest context of your batch (9192) is bigger than the maximum context size allowed by the model (8192) for a task in{'lighteval|drop|3'}. This is likely to lead to some errors.00:39<10:54, 3.48it/s]
WARNING:lighteval.logging.hierarchical_logger: The smallest context of your batch (9538) is bigger than the maximum context size allowed by the model (8192) for a task in{'lighteval|drop|3'}. This is likely to lead to some errors.00:39<10:54, 3.48it/s]
Splits: 0%| | 0/4 [00:40<?, ?it/s]
WARNING:lighteval.logging.hierarchical_logger: } [0:01:02.769634]
WARNING:lighteval.logging.hierarchical_logger:} [0:01:49.892104]
Traceback (most recent call last):
File "/home/vipul.raheja/lighteval/run_evals_accelerate.py", line 82, in <module>
main(args)
File "/home/vipul.raheja/lighteval/src/lighteval/logging/hierarchical_logger.py", line 166, in wrapper
return fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^
File "/home/vipul.raheja/lighteval/src/lighteval/main_accelerate.py", line 111, in main
evaluation_tracker = evaluate(
^^^^^^^^^
File "/home/vipul.raheja/lighteval/src/lighteval/evaluator.py", line 86, in evaluate
full_resps = lm.greedy_until(requests, override_bs=override_bs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/vipul.raheja/lighteval/src/lighteval/models/base_model.py", line 570, in greedy_until
max_new_tokens = min(self.max_length - biggest_context, max_new_tokens)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Running the same evaluation directly in lm-evaluation-harness does not throw any such warning and proceeds at a reasonable speed.
~/lm-evaluation-harness$ lm_eval --model hf --model_args pretrained=meta-llama/Meta-Llama-3-8B --tasks drop --device cuda:0 --batch_size 16
2024-04-21:20:19:29,714 INFO [__main__.py:251] Verbosity set to INFO
2024-04-21:20:19:33,062 INFO [__main__.py:335] Selected Tasks: ['drop']
2024-04-21:20:19:33,063 INFO [evaluator.py:131] Setting random seed to 0 | Setting numpy seed to 1234 | Setting torch manual seed to 1234
2024-04-21:20:19:33,064 INFO [evaluator.py:177] Initializing hf model, with arguments: {'pretrained': 'meta-llama/Meta-Llama-3-8B'}
2024-04-21:20:19:33,164 INFO [huggingface.py:164] Using device 'cuda:0'
Loading checkpoint shards: 100%|█████████████████████████████| 4/4 [00:06<00:00, 1.62s/it]
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Downloading builder script: 100%|█████████████████████████████| 7.46k/7.46k [00:00<00:00, 35.8MB/s]
Downloading readme: 100%|█████████████████████████████| 26.0/26.0 [00:00<00:00, 384kB/s]
Downloading data: 100%|█████████████████████████████| 8.31M/8.31M [00:00<00:00, 8.66MB/s]
Generating train split: 77409 examples [00:05, 13452.43 examples/s]
Generating validation split: 9536 examples [00:00, 11649.32 examples/s]
Map: 100%|█████████████████████████████| 77409/77409 [00:10<00:00, 7060.41 examples/s]
Map: 100%|█████████████████████████████| 9536/9536 [00:01<00:00, 4788.74 examples/s]
2024-04-21:20:20:11,675 INFO [task.py:395] Building contexts for drop on rank 0...
100%|█████████████████████████████| 9536/9536 [00:03<00:00, 2793.13it/s]
2024-04-21:20:20:16,260 INFO [evaluator.py:379] Running generate_until requests
Running generate_until requests: 9%|█████▊ | 833/9536 [07:44<1:06:05, 2.19it/s]
Env:
transformers version: 4.39.3
Platform: Ubuntu 20.04.6 LTS
Python version: 3.11.9
Huggingface_hub version: 0.22.2
Safetensors version: 0.4.2
Accelerate version: 0.29.2
Lighteval version: 0.4.0.dev0