refactoring of the benchmark dataset capabilities and added a random …#30150
refactoring of the benchmark dataset capabilities and added a random …#30150hypdeb wants to merge 1 commit intovllm-project:mainfrom
Conversation
…txt slices dataset option Signed-off-by: Julien Debache <jdebache@cpu-0005.cm.cluster> Signed-off-by: <>
There was a problem hiding this comment.
Code Review
This pull request introduces a new vllm.benchmarks.datasets module, centralizing benchmark dataset handling. It defines a BenchmarkDataset abstract base class and a SampleRequest dataclass for standardized request representation. The module includes implementations for various datasets: CustomDataset (for JSONL prompts), SpecBench (for Spec-Bench JSONL), BurstGPTDataset (from CSV), ShareGPTDataset (from JSON), SonnetDataset (from text files), PrefixRepetitionRandomDataset (synthetic with repeated prefixes), TxtSlicesDataset (from text files/URLs), and several HuggingFaceDataset subclasses (ConversationDataset, MultiModalConversationDataset, VisionArenaDataset, MMVUDataset, InstructCoderDataset, MTBenchDataset, BlazeditDataset, AIMODataset, NextEditPredictionDataset, ASRDataset, MLPerfDataset, MMStarDataset). A RandomDataset and its multimodal (RandomMultiModalDataset) and reranking (RandomDatasetForReranking) variants are also added for synthetic data generation. The arg_utils.py file provides a comprehensive argument parser for configuring these datasets. Review comments highlighted the need to refine the maybe_oversample_requests logic in abstractions.py to correctly handle None request_ids, fix an incorrect super().sample(**kwargs) call in SpecBench, and correct the __all__ list in __init__.py by removing duplicates and adding missing exports. Additionally, a comment suggested ensuring request_ids are always assigned in PrefixRepetitionRandomDataset for consistency.
| ids = [req.request_id for req in requests] | ||
| if len(ids) != len(set(ids)): | ||
| raise ValueError( | ||
| "Duplicate request_id found in the sampled " | ||
| "requests. Please ensure that each request_id " | ||
| "is unique." | ||
| ) |
There was a problem hiding this comment.
The check for duplicate request_ids is too strict. If multiple requests have request_id=None (which is allowed by the SampleRequest type hint), this check will incorrectly raise a ValueError for duplicate IDs. The check should only enforce uniqueness among non-None request_ids.
| ids = [req.request_id for req in requests] | |
| if len(ids) != len(set(ids)): | |
| raise ValueError( | |
| "Duplicate request_id found in the sampled " | |
| "requests. Please ensure that each request_id " | |
| "is unique." | |
| ) | |
| ids = [req.request_id for req in requests if req.request_id is not None] | |
| if len(ids) != len(set(ids)): | |
| raise ValueError( | |
| "Duplicate request_id found in the sampled " | |
| "requests. Please ensure that each request_id " | |
| "is unique." | |
| ) |
| **kwargs, | ||
| ) -> list[SampleRequest]: | ||
| # leverage CustomDataset sample | ||
| return super().sample(**kwargs) |
There was a problem hiding this comment.
The call super().sample(**kwargs) is incorrect. It only passes the kwargs dictionary to the parent method, omitting all other named arguments like tokenizer, num_requests, etc. This will lead to a TypeError at runtime.
You should pass all arguments to the parent's sample method. A better long-term solution would be to remove this overridden sample method entirely and let SpecBench inherit it directly from CustomDataset.
| return super().sample(**kwargs) | |
| return super().sample(tokenizer=tokenizer, num_requests=num_requests, request_id_prefix=request_id_prefix, no_oversample=no_oversample, lora_path=lora_path, max_loras=max_loras, output_len=output_len, enable_multimodal_chat=enable_multimodal_chat, skip_chat_template=skip_chat_template, **kwargs) |
| __all__ = [ | ||
| "add_dataset_parser", | ||
| "get_samples", | ||
| "BenchmarkDataset", | ||
| "SampleRequest", | ||
| "CustomDataset", | ||
| "SpecBench", | ||
| "HuggingFaceDataset", | ||
| "InstructCoderDataset", | ||
| "MLPerfDataset", | ||
| "MMStarDataset", | ||
| "MMVUDataset", | ||
| "MTBenchDataset", | ||
| "MultiModalConversationDataset", | ||
| "NextEditPredictionDataset", | ||
| "VisionArenaDataset", | ||
| "PrefixRepetitionRandomDataset", | ||
| "RandomDataset", | ||
| "RandomDatasetForReranking", | ||
| "RandomMultiModalDataset", | ||
| "ShareGPTDataset", | ||
| "SonnetDataset", | ||
| "BurstGPTDataset", | ||
| "CustomDataset", | ||
| "SpecBench", | ||
| "HuggingFaceDataset", | ||
| "InstructCoderDataset", | ||
| "MLPerfDataset", | ||
| "MMStarDataset", | ||
| "MMVUDataset", | ||
| "MTBenchDataset", | ||
| "MultiModalConversationDataset", | ||
| "NextEditPredictionDataset", | ||
| "VisionArenaDataset", | ||
| "TxtSlicesDataset", | ||
| ] |
There was a problem hiding this comment.
The __all__ list has several issues:
- It contains duplicate entries for many dataset classes (e.g.,
CustomDataset,SpecBench,HuggingFaceDatasetand its subclasses). - It is missing some imported classes like
AIMODataset,ASRDataset,BlazeditDataset, andConversationDataset.
To improve correctness and maintainability, the list should be cleaned up to remove duplicates and include all public objects from this module. Sorting the list alphabetically is also a good practice.
__all__ = [
"add_dataset_parser",
"AIMODataset",
"ASRDataset",
"BenchmarkDataset",
"BlazeditDataset",
"BurstGPTDataset",
"ConversationDataset",
"CustomDataset",
"get_samples",
"HuggingFaceDataset",
"InstructCoderDataset",
"MLPerfDataset",
"MMStarDataset",
"MMVUDataset",
"MTBenchDataset",
"MultiModalConversationDataset",
"NextEditPredictionDataset",
"PrefixRepetitionRandomDataset",
"RandomDataset",
"RandomDatasetForReranking",
"RandomMultiModalDataset",
"SampleRequest",
"ShareGPTDataset",
"SonnetDataset",
"SpecBench",
"TxtSlicesDataset",
"VisionArenaDataset",
]| requests.append( | ||
| SampleRequest( | ||
| prompt=prompt, | ||
| prompt_len=prompt_len, | ||
| expected_output_len=output_len, | ||
| ) | ||
| ) |
There was a problem hiding this comment.
The SampleRequest objects are created without a request_id. While this may not cause an immediate bug if this dataset isn't used with features that require unique IDs (like oversampling), it's inconsistent with other dataset implementations and good practice. The request_id_prefix argument is available but unused. Assigning a unique request_id to each request is important for tracking and debugging.
| requests.append( | |
| SampleRequest( | |
| prompt=prompt, | |
| prompt_len=prompt_len, | |
| expected_output_len=output_len, | |
| ) | |
| ) | |
| requests.append( | |
| SampleRequest( | |
| prompt=prompt, | |
| prompt_len=prompt_len, | |
| expected_output_len=output_len, | |
| request_id=request_id_prefix + str(len(requests)), | |
| ) | |
| ) |
|
This is too large. |
Purpose
The goal of this dataset is to provide a way to generate a benchmarking dataset with sequences that are more realistic than the standard random option. This is useful when benchmarking expert parallel and speculative decoding based deployments.
I took the opportunity to refactor the datasets as this single file was growing huge and becoming hard to maintain. I also fixed type hinting issues
TODO: add some more unit tests.
Note that I would like to merge this after #30009.