Propagate vLLM batch size controls #588

alvin319 · 2025-02-25T19:54:07Z

Description
In this PR, we fixed issue #573 by propagating the batch size control parameters to VLLMModelConfig. For more detailed explanation of their batch size parameters, see vllm-project/vllm#2492.

Testing

I ran the following CLI commands to invoke a simple evaluation job. Based on the results, we can see that controlling max_num_seqs will determine the batch size at the pre-filling stage, thus impacting the throughput of the model, i.e., 1m4s with max_num_seqs=256 (which is the default) vs. 3m15s with max_num_seqs=1. I'm testing this with an AWS instance of g6e.xlarge.

Default

> lighteval vllm "pretrained=HuggingFaceTB/SmolLM-1.7B-Instruct,revision=main,dtype=bfloat16" "leaderboard|truthfulqa:mc|0|0"

[2025-02-25 19:17:49,834] [    INFO]: PyTorch version 2.5.1 available. (config.py:54)
[2025-02-25 19:17:54,082] [    INFO]: --- LOADING MODEL --- (pipeline.py:186)
[2025-02-25 19:17:54,211] [    INFO]: Automatically detected platform cuda. (__init__.py:207)
[2025-02-25 19:18:00,443] [    INFO]: This model supports multiple tasks: {'embed', 'reward', 'score', 'classify', 'generate'}. Defaulting to 'generate'. (config.py:549)
[2025-02-25 19:18:00,444] [    INFO]: Initializing a V0 LLM engine (v0.7.3) with config: model='HuggingFaceTB/SmolLM-1.7B-Instruct', speculative_config=None, tokenizer='HuggingFaceTB/SmolLM-1.7B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=main, override_neuron_config=None, tokenizer_revision=main, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=2048, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto,  device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=1234, served_model_name=HuggingFaceTB/SmolLM-1.7B-Instruct, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=False, chunked_prefill_enabled=False, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":256}, use_cached_outputs=False,  (llm_engine.py:234)
[2025-02-25 19:18:01,286] [    INFO]: Using Flash Attention backend. (cuda.py:229)
[2025-02-25 19:18:01,661] [    INFO]: Starting to load model HuggingFaceTB/SmolLM-1.7B-Instruct... (model_runner.py:1110)
[2025-02-25 19:18:01,814] [    INFO]: Using model weights format ['*.safetensors'] (weight_utils.py:254)
model.safetensors: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3.42G/3.42G [04:28<00:00, 10.4MB/s]
[2025-02-25 19:22:30,510] [    INFO]: Time spent downloading weights for HuggingFaceTB/SmolLM-1.7B-Instruct: 268.695215 seconds (weight_utils.py:270)
[2025-02-25 19:22:30,545] [    INFO]: No model.safetensors.index.json found in remote. (weight_utils.py:304)
Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  2.44it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  2.44it/s]

[2025-02-25 19:22:31,308] [    INFO]: Loading model weights took 3.1880 GB (model_runner.py:1115)
[2025-02-25 19:22:32,627] [    INFO]: Memory profiling takes 1.07 seconds
the current vLLM instance can use total_gpu_memory (44.32GiB) x gpu_memory_utilization (0.90) = 39.89GiB
model weights take 3.19GiB; non_torch_memory takes 0.08GiB; PyTorch activation peak memory takes 0.46GiB; the rest of the memory reserved for KV Cache is 36.16GiB. (worker.py:267)
[2025-02-25 19:22:32,867] [    INFO]: # cuda blocks: 12342, # CPU blocks: 1365 (executor_base.py:111)
[2025-02-25 19:22:32,867] [    INFO]: Maximum concurrency for 2048 tokens per request: 96.42x (executor_base.py:116)
[2025-02-25 19:22:37,286] [    INFO]: Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. If out-of-memory error occurs during cudagraph capture, consider decreasing `gpu_memory_utilization` or switching to eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage. (model_runner.py:1434)
Capturing CUDA graph shapes: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 35/35 [00:19<00:00,  1.80it/s]
[2025-02-25 19:22:56,712] [    INFO]: Graph capturing finished in 19 secs, took 0.67 GiB (model_runner.py:1562)
[2025-02-25 19:22:56,713] [    INFO]: init engine (profile, create kv cache, warmup model) took 25.40 seconds (llm_engine.py:436)
[2025-02-25 19:22:56,773] [    INFO]: --- LOADING TASKS --- (pipeline.py:213)
[2025-02-25 19:22:56,774] [ WARNING]: If you want to use extended_tasks, make sure you installed their dependencies using `pip install -e .[extended_tasks]`. (registry.py:136)
[2025-02-25 19:22:56,776] [    INFO]: truthful_qa multiple_choice (lighteval_task.py:187)
[2025-02-25 19:22:56,776] [ WARNING]: Careful, the task leaderboard|truthfulqa:mc is using evaluation data to build the few shot examples. (lighteval_task.py:260)
README.md: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 9.59k/9.59k [00:00<00:00, 32.0MB/s]
validation-00000-of-00001.parquet: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 271k/271k [00:00<00:00, 70.8MB/s]
Generating validation split: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████| 817/817 [00:00<00:00, 171200.36 examples/s]
[2025-02-25 19:22:57,871] [    INFO]: --- INIT SEEDS --- (pipeline.py:259)
[2025-02-25 19:22:57,871] [    INFO]: --- RUNNING MODEL --- (pipeline.py:464)
[2025-02-25 19:22:57,871] [    INFO]: Running RequestType.LOGLIKELIHOOD requests (pipeline.py:468)
Processed prompts: 100%|████████████████████████████████████████████████████████████████████| 9996/9996 [01:04<00:00, 154.40it/s, est. speed input: 30670.51 toks/s, output: 154.40 toks/s]
1it [01:06, 66.12s/it]%|███████████████████████████████████████████████████████████████████▊| 9965/9996 [01:04<00:00, 179.86it/s, est. speed input: 30613.14 toks/s, output: 154.07 toks/s]
[2025-02-25 19:24:10,490] [    INFO]: --- COMPUTING METRICS --- (pipeline.py:500)
[2025-02-25 19:24:10,622] [    INFO]: --- DISPLAYING RESULTS --- (pipeline.py:542)
|           Task            |Version|    Metric    |Value |   |Stderr|
|---------------------------|------:|--------------|-----:|---|-----:|
|all                        |       |truthfulqa_mc1|0.2485|±  |0.0151|
|                           |       |truthfulqa_mc2|0.3969|±  |0.0144|
|leaderboard:truthfulqa:mc:0|      0|truthfulqa_mc1|0.2485|±  |0.0151|
|                           |       |truthfulqa_mc2|0.3969|±  |0.0144|

[2025-02-25 19:24:10,639] [    INFO]: --- SAVING AND PUSHING RESULTS --- (pipeline.py:532)
[2025-02-25 19:24:10,639] [    INFO]: Saving experiment tracker (evaluation_tracker.py:180)
[2025-02-25 19:24:12,315] [    INFO]: Saving results to /home/ubuntu/lighteval/results/results/HuggingFaceTB/SmolLM-1.7B-Instruct/results_2025-02-25T19-24-10.639358.json (evaluation_tracker.py:234)

BS=1

> lighteval vllm "pretrained=HuggingFaceTB/SmolLM-1.7B-Instruct,revision=main,dtype=bfloat16,max_num_seqs=1" "leaderboard|truthfulqa:mc|0|0"

[2025-02-25 19:45:55,275] [    INFO]: PyTorch version 2.5.1 available. (config.py:54)
[2025-02-25 19:45:59,530] [    INFO]: --- LOADING MODEL --- (pipeline.py:186)
[2025-02-25 19:45:59,657] [    INFO]: Automatically detected platform cuda. (__init__.py:207)
[2025-02-25 19:46:05,727] [    INFO]: This model supports multiple tasks: {'embed', 'generate', 'score', 'reward', 'classify'}. Defaulting to 'generate'. (config.py:549)
[2025-02-25 19:46:05,728] [    INFO]: Initializing a V0 LLM engine (v0.7.3) with config: model='HuggingFaceTB/SmolLM-1.7B-Instruct', speculative_config=None, tokenizer='HuggingFaceTB/SmolLM-1.7B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=main, override_neuron_config=None, tokenizer_revision=main, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=2048, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto,  device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=1234, served_model_name=HuggingFaceTB/SmolLM-1.7B-Instruct, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=False, chunked_prefill_enabled=False, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[1],"max_capture_size":1}, use_cached_outputs=False,  (llm_engine.py:234)
[2025-02-25 19:46:06,576] [    INFO]: Using Flash Attention backend. (cuda.py:229)
[2025-02-25 19:46:06,950] [    INFO]: Starting to load model HuggingFaceTB/SmolLM-1.7B-Instruct... (model_runner.py:1110)
[2025-02-25 19:46:07,096] [    INFO]: Using model weights format ['*.safetensors'] (weight_utils.py:254)
[2025-02-25 19:46:07,134] [    INFO]: No model.safetensors.index.json found in remote. (weight_utils.py:304)
Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  2.44it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  2.44it/s]

[2025-02-25 19:46:07,908] [    INFO]: Loading model weights took 3.1880 GB (model_runner.py:1115)
[2025-02-25 19:46:08,889] [    INFO]: Memory profiling takes 0.75 seconds
the current vLLM instance can use total_gpu_memory (44.32GiB) x gpu_memory_utilization (0.90) = 39.89GiB
model weights take 3.19GiB; non_torch_memory takes 0.08GiB; PyTorch activation peak memory takes 0.13GiB; the rest of the memory reserved for KV Cache is 36.49GiB. (worker.py:267)
[2025-02-25 19:46:09,120] [    INFO]: # cuda blocks: 12456, # CPU blocks: 1365 (executor_base.py:111)
[2025-02-25 19:46:09,121] [    INFO]: Maximum concurrency for 2048 tokens per request: 97.31x (executor_base.py:116)
[2025-02-25 19:46:12,896] [    INFO]: Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. If out-of-memory error occurs during cudagraph capture, consider decreasing `gpu_memory_utilization` or switching to eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage. (model_runner.py:1434)
Capturing CUDA graph shapes: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  1.84it/s]
[2025-02-25 19:46:13,445] [    INFO]: Graph capturing finished in 1 secs, took 0.06 GiB (model_runner.py:1562)
[2025-02-25 19:46:13,446] [    INFO]: init engine (profile, create kv cache, warmup model) took 5.54 seconds (llm_engine.py:436)
[2025-02-25 19:46:13,522] [    INFO]: --- LOADING TASKS --- (pipeline.py:213)
[2025-02-25 19:46:13,523] [ WARNING]: If you want to use extended_tasks, make sure you installed their dependencies using `pip install -e .[extended_tasks]`. (registry.py:136)
[2025-02-25 19:46:13,526] [    INFO]: truthful_qa multiple_choice (lighteval_task.py:187)
[2025-02-25 19:46:13,527] [ WARNING]: Careful, the task leaderboard|truthfulqa:mc is using evaluation data to build the few shot examples. (lighteval_task.py:260)
[2025-02-25 19:46:14,749] [    INFO]: --- INIT SEEDS --- (pipeline.py:259)
[2025-02-25 19:46:14,749] [    INFO]: --- RUNNING MODEL --- (pipeline.py:464)
[2025-02-25 19:46:14,749] [    INFO]: Running RequestType.LOGLIKELIHOOD requests (pipeline.py:468)
Processed prompts: 100%|██████████████████████████████████████████████████████████████████████| 9996/9996 [03:15<00:00, 51.24it/s, est. speed input: 10179.70 toks/s, output: 51.24 toks/s]
1it [03:16, 196.46s/it]|█████████████████████████████████████████████████████████████████████▉| 9994/9996 [03:15<00:00, 52.90it/s, est. speed input: 10179.80 toks/s, output: 51.24 toks/s]
[2025-02-25 19:49:39,146] [    INFO]: --- COMPUTING METRICS --- (pipeline.py:500)
[2025-02-25 19:49:39,277] [    INFO]: --- DISPLAYING RESULTS --- (pipeline.py:542)
|           Task            |Version|    Metric    |Value |   |Stderr|
|---------------------------|------:|--------------|-----:|---|-----:|
|all                        |       |truthfulqa_mc1|0.2497|±  |0.0152|
|                           |       |truthfulqa_mc2|0.3966|±  |0.0144|
|leaderboard:truthfulqa:mc:0|      0|truthfulqa_mc1|0.2497|±  |0.0152|
|                           |       |truthfulqa_mc2|0.3966|±  |0.0144|

[2025-02-25 19:49:39,293] [    INFO]: --- SAVING AND PUSHING RESULTS --- (pipeline.py:532)
[2025-02-25 19:49:39,293] [    INFO]: Saving experiment tracker (evaluation_tracker.py:180)
[2025-02-25 19:49:40,897] [    INFO]: Saving results to /home/ubuntu/lighteval/results/results/HuggingFaceTB/SmolLM-1.7B-Instruct/results_2025-02-25T19-49-39.294027.json (evaluation_tracker.py:234)

NathanHB

Thanks ! Only need to make sure that when the user does not specifiy we use the defaults used by vllm

src/lighteval/models/vllm/vllm_model.py

alvin319 · 2025-03-04T17:26:47Z

@NathanHB should be ready for another round of review!

HuggingFaceDocBuilderDev · 2025-03-24T14:03:54Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

alvin319 · 2025-03-24T16:36:44Z

@NathanHB thanks for the review! should I merge this now?

* expose vLLM batch size control config * comments * type casting * bump * fix defaults

alvin319 added 4 commits February 25, 2025 10:52

expose vLLM batch size control config

718cbec

comments

65379af

type casting

33fc419

bump

c63a5e2

alvin319 mentioned this pull request Feb 25, 2025

[FT] Propagate batch size control for vLLM backend #573

Closed

NathanHB reviewed Mar 4, 2025

View reviewed changes

src/lighteval/models/vllm/vllm_model.py Outdated Show resolved Hide resolved

fix defaults

c3a510b

alvin319 requested a review from NathanHB March 4, 2025 17:26

NathanHB approved these changes Mar 24, 2025

View reviewed changes

NathanHB merged commit bbbdd22 into huggingface:main Mar 25, 2025
3 checks passed

alvin319 deleted the vllm-batch-size-control branch March 25, 2025 19:52

NathanHB added the feature label May 5, 2025

hynky1999 pushed a commit that referenced this pull request May 22, 2025

Propagate vLLM batch size controls (#588)

aadd0a4

* expose vLLM batch size control config * comments * type casting * bump * fix defaults

NathanHB pushed a commit that referenced this pull request Sep 19, 2025

Propagate vLLM batch size controls (#588)

f5191eb

* expose vLLM batch size control config * comments * type casting * bump * fix defaults

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Propagate vLLM batch size controls #588

Propagate vLLM batch size controls #588

Uh oh!

alvin319 commented Feb 25, 2025

Uh oh!

NathanHB left a comment

Uh oh!

Uh oh!

alvin319 commented Mar 4, 2025

Uh oh!

HuggingFaceDocBuilderDev commented Mar 24, 2025

Uh oh!

alvin319 commented Mar 24, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Propagate vLLM batch size controls #588

Propagate vLLM batch size controls #588

Uh oh!

Conversation

alvin319 commented Feb 25, 2025

Default

BS=1

Uh oh!

NathanHB left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

alvin319 commented Mar 4, 2025

Uh oh!

HuggingFaceDocBuilderDev commented Mar 24, 2025

Uh oh!

alvin319 commented Mar 24, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants