Draft: [TRTLLM-9089][chore] Port prepare_dataset into trtllm-bench #9250

FrankD412 · 2025-11-18T06:35:40Z

Summary by CodeRabbit

New Features
- Added dataset preparation CLI command with configurable tokenizer, output path, and random seed options.
- New real dataset tool for extracting and preparing HuggingFace datasets with text and multimodal support.
- New synthetic dataset generators: token lengths drawn from normal and uniform distributions.
- Added support for LoRA configuration and randomized task IDs across datasets.
Chores
- Updated .gitignore to exclude macOS metadata files.

Description

Test Coverage

PR Checklist

Please review the following before submitting your PR:

PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md
and the scripts/test_to_stage_mapping.py helper.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

Signed-off-by: Frank Di Natale <[email protected]>

coderabbitai · 2025-11-18T06:40:19Z

📝 Walkthrough

Walkthrough

This pull request adds dataset preparation infrastructure to the TensorRT LLM benchmarking suite, introducing CLI commands to generate synthetic datasets and prepare real HuggingFace datasets, along with supporting data models and utilities. It also updates the main bench CLI to register the new prepare-dataset command and adds .DS_Store to .gitignore.

Changes

Cohort / File(s)	Summary
Configuration `.gitignore`	Adds `.DS_Store` to ignore macOS Finder metadata files
Dataset Preparation CLI Entry Point `tensorrt_llm/bench/dataset/prepare_dataset.py`	Introduces Click group `prepare_dataset` with CLI options (output, random-seed, task-id, rand-task-id, lora-dir, log-level, trust-remote-code). Defines Pydantic model `RootArgs` for storing parsed arguments including tokenizer validation and loading. Registers subcommands for real, synthetic (normal and uniform distribution) dataset generation
Real Dataset Preparation `tensorrt_llm/bench/dataset/prepare_real_data.py`	Adds `real-dataset` CLI command with `DatasetConfig` model for HuggingFace dataset configuration. Provides utilities to load datasets, handle text-only and multimodal (image) inputs, support configurable output length distributions, and serialize to JSON. Includes validation for prompt key/source specifications
Synthetic Dataset Generation `tensorrt_llm/bench/dataset/prepare_synthetic_data.py`	Implements `token_norm_dist` and `token_unif_dist` CLI commands to generate synthetic token datasets using normal and uniform distributions respectively. Handles task ID generation, LoRA configuration, and structured dataset output with metadata
Dataset Utilities & Models `tensorrt_llm/bench/dataset/utils.py`	Defines `TextSample`, `MultimodalSample`, and `Workload` Pydantic models. Provides serialization (text_dataset_dump, multimodal_dataset_dump), JSON streaming (print_text_dataset, print_multimodal_dataset), distribution sampling (get_norm_dist_lengths, get_unif_dist_lengths, get_exponential_dist_delays), and random token generation utilities with deterministic seeding
CLI Registration `tensorrt_llm/commands/bench.py`	Imports and registers `prepare_dataset` as a subcommand in the trtllm-bench CLI group

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant PrepareDataset as prepare_dataset<br/>(CLI Group)
    participant RootArgs as RootArgs<br/>(Validation)
    participant Subcommands as Subcommands<br/>(real-dataset,<br/>token_norm_dist,<br/>token_unif_dist)
    participant Utils as Dataset Utils<br/>(Serialization &<br/>Sampling)
    participant Output as Output File

    User->>PrepareDataset: Run CLI with options
    PrepareDataset->>RootArgs: Parse CLI args & build model
    activate RootArgs
    RootArgs->>RootArgs: Load & validate tokenizer
    RootArgs-->>PrepareDataset: Return validated args
    deactivate RootArgs
    
    PrepareDataset->>Subcommands: Invoke subcommand<br/>(real-dataset or synthetic)
    activate Subcommands
    alt Real Dataset Path
        Subcommands->>Subcommands: Load HF dataset<br/>& DatasetConfig
        Subcommands->>Utils: Generate samples<br/>(text or multimodal)
    else Synthetic Dataset Path
        Subcommands->>Utils: Sample lengths from<br/>distribution
        Subcommands->>Utils: Generate random tokens
    end
    Subcommands->>Utils: Serialize Workload<br/>(TextSample/MultimodalSample)
    Utils->>Output: Write JSON to file
    Subcommands-->>User: Complete
    deactivate Subcommands

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Areas requiring extra attention:

prepare_real_data.py: HuggingFace dataset loading and multimodal (image) handling logic with PIL Image to file conversion; validation of dataset configuration options
utils.py: Deterministic randomization logic across multiple distribution functions; JSON serialization via Pydantic models; handling of both text and multimodal sample paths
prepare_synthetic_data.py: Token generation logic ensuring EOS token exclusion; task ID and LoRA configuration derivation
prepare_dataset.py: Tokenizer loading and validation in Pydantic model hook; context object wiring across CLI subcommands
Cross-file integration: Interaction between RootArgs, subcommands, and utility functions; correct threading of metadata, seeds, and configuration through the pipeline

Pre-merge checks and finishing touches

❌ Failed checks (2 warnings)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.
Description check	⚠️ Warning	PR description is entirely empty. Required sections (title format, description, test coverage) are not filled in, only the template structure remains.	Fill in the PR title following the template (e.g., [TRTLLM-9089][feat] Port prepare_dataset into trtllm-bench), add a clear description of changes, and document test coverage for the new dataset preparation features.

✅ Passed checks (1 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly describes the main change: porting the prepare_dataset functionality into the trtllm-bench CLI command, which aligns with the file additions and integrations shown in the changeset.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 6

🧹 Nitpick comments (8)

tensorrt_llm/bench/dataset/prepare_dataset.py (1)

34-49: Improve exception chaining for better debugging.

The exception handling would benefit from using raise ... from e to preserve the original traceback context.

Apply this diff:

         except EnvironmentError as e:
-            raise ValueError(
+            raise ValueError(
                 "Cannot find a tokenizer from the given string because of "
                 f"{e}\nPlease set tokenizer to the directory that contains "
                 "the tokenizer, or set to a model name in HuggingFace."
-            )
+            ) from e

tensorrt_llm/bench/dataset/prepare_synthetic_data.py (3)

15-38: Add stacklevel to warning for better debugging.

The warning should include stacklevel=2 to point to the caller's code rather than this helper function.

Apply this diff:
     if use_task_ids and not use_lora:
         warnings.warn(
-            "Task IDs require LoRA directory. Use --lora-dir or omit task IDs.", UserWarning
+            "Task IDs require LoRA directory. Use --lora-dir or omit task IDs.", 
+            UserWarning, 
+            stacklevel=2
         )
50-52: Remove unused variable initializations.

These variables are immediately overwritten and never used in their initial state.

Apply this diff:
-    input_ids = []
-    input_lens = []
-    output_lens = []
-
     input_lens = get_norm_dist_lengths(
110-112: Remove unused variable initializations.

These variables are immediately overwritten and never used in their initial state.

Apply this diff:
-    input_ids = []
-    input_lens = []
-    output_lens = []
-
     input_lens = get_unif_dist_lengths(

tensorrt_llm/bench/dataset/prepare_real_data.py (2)

21-32: Use Click's BadParameter exception for better error reporting.

Using click.BadParameter provides better integration with Click's error handling and gives users clearer feedback about which parameter failed.

Apply this diff:

 def validate_output_len_dist(ctx, param, value):
     """Validate the --output-len-dist option."""
     if value is None:
         return value
     m = re.match(r"(\d+),(\d+)", value)
     if m:
         return int(m.group(1)), int(m.group(2))
     else:
-        raise AssertionError(
+        raise click.BadParameter(
             "Incorrect specification for --output-len-dist. Correct format: "
             "--output-len-dist <output_len_mean>,<output_len_stdev>"
         )

55-61: Use ValueError instead of AssertionError in validator.

Validators should raise ValueError for invalid data rather than AssertionError, which is typically reserved for internal invariants.

Apply this diff:

     @model_validator(mode="after")
     def check_prompt(self) -> "DatasetConfig":
         if self.prompt_key and self.prompt:
-            raise AssertionError("--prompt-key and --prompt cannot be set at the same time.")
+            raise ValueError("--prompt-key and --prompt cannot be set at the same time.")
         if (not self.prompt_key) and (not self.prompt):
-            raise AssertionError("Either --prompt-key or --prompt must be set.")
+            raise ValueError("Either --prompt-key or --prompt must be set.")
         return self

tensorrt_llm/bench/dataset/utils.py (2)

29-40: Consider using Pydantic's standard patterns.

The custom __init__ could be replaced with a @model_validator(mode='after') or model_post_init hook for better alignment with Pydantic conventions.

Example using @model_validator:

 class Workload(BaseModel):
     metadata: dict
     samples: List[Union[TextSample, MultimodalSample]] = []
 
-    def __init__(self, **kwargs) -> None:
-        super().__init__(**kwargs)
-        self.setup_workload_name()
-
-    def setup_workload_name(self):
+    @model_validator(mode='after')
+    def setup_workload_name(self):
         # Keys to ignore
         ignore_keys = ["tokenizer"]
         # Create a string by concatenating keys and values with "__"
         workload_name = "__".join(
             f"{key}:{value}" for key, value in self.metadata.items() if key not in ignore_keys
         )
         self.metadata.setdefault("workload_name", workload_name)
+        return self

96-104: Add strict parameter to zip for safety.

Without strict=True, mismatched list lengths would be silently truncated, potentially hiding bugs.

Apply this diff:

 def print_multimodal_dataset(multimodal_texts, multimodal_image_paths, output_lens):
-    for i, (text, image_paths) in enumerate(zip(multimodal_texts, multimodal_image_paths)):
+    for i, (text, image_paths) in enumerate(zip(multimodal_texts, multimodal_image_paths, strict=True)):
         d = {
             "task_id": i,
             "prompt": text,
             "media_paths": image_paths,
             "output_tokens": output_lens[i],
         }
         yield json.dumps(d, separators=(",", ":"), ensure_ascii=False)

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 96cfdd8 and 5910dfe.

📒 Files selected for processing (6)

.gitignore (1 hunks)
tensorrt_llm/bench/dataset/prepare_dataset.py (1 hunks)
tensorrt_llm/bench/dataset/prepare_real_data.py (1 hunks)
tensorrt_llm/bench/dataset/prepare_synthetic_data.py (1 hunks)
tensorrt_llm/bench/dataset/utils.py (1 hunks)
tensorrt_llm/commands/bench.py (2 hunks)

🧰 Additional context used

🧠 Learnings (1)

📚 Learning: 2025-08-18T08:42:02.640Z

Learnt from: samuellees
Repo: NVIDIA/TensorRT-LLM PR: 6974
File: tensorrt_llm/serve/scripts/benchmark_dataset.py:558-566
Timestamp: 2025-08-18T08:42:02.640Z
Learning: In TensorRT-LLM's RandomDataset (tensorrt_llm/serve/scripts/benchmark_dataset.py), when using --random-token-ids option, sequence length accuracy is prioritized over semantic correctness for benchmarking purposes. The encode/decode operations should use skip_special_tokens=True and add_special_tokens=False to ensure exact target token lengths.

Applied to files:

tensorrt_llm/bench/dataset/prepare_dataset.py
tensorrt_llm/bench/dataset/utils.py
tensorrt_llm/bench/dataset/prepare_synthetic_data.py

🧬 Code graph analysis (4)

tensorrt_llm/commands/bench.py (1)

tensorrt_llm/bench/dataset/prepare_dataset.py (1)

prepare_dataset (73-84)

tensorrt_llm/bench/dataset/prepare_dataset.py (4)

tensorrt_llm/bench/dataset/prepare_real_data.py (1)

real_dataset (209-314)

tensorrt_llm/bench/dataset/prepare_synthetic_data.py (2)

token_norm_dist (48-90)

token_unif_dist (108-150)

tensorrt_llm/builder.py (1)

default (45-50)

tensorrt_llm/bench/benchmark/__init__.py (1)

checkpoint_path (75-76)

tensorrt_llm/bench/dataset/prepare_real_data.py (1)

tensorrt_llm/bench/dataset/utils.py (3)

get_norm_dist_lengths (122-126)

print_multimodal_dataset (96-104)

print_text_dataset (79-93)

tensorrt_llm/bench/dataset/prepare_synthetic_data.py (1)

tensorrt_llm/bench/dataset/utils.py (5)

gen_random_tokens (136-157)

get_norm_dist_lengths (122-126)

get_unif_dist_lengths (129-133)

print_text_dataset (79-93)

text_dataset_dump (43-57)

🪛 Ruff (0.14.5)

tensorrt_llm/bench/dataset/prepare_dataset.py

41-45: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)

41-45: Avoid specifying long messages outside the exception class

(TRY003)

tensorrt_llm/bench/dataset/utils.py

97-97: zip() without an explicit strict= parameter

Add explicit value for parameter strict=

(B905)

142-142: Standard pseudo-random generators are not suitable for cryptographic purposes

(S311)

tensorrt_llm/bench/dataset/prepare_real_data.py

21-21: Unused function argument: ctx

(ARG001)

21-21: Unused function argument: param

(ARG001)

29-32: Avoid specifying long messages outside the exception class

(TRY003)

58-58: Avoid specifying long messages outside the exception class

(TRY003)

60-60: Avoid specifying long messages outside the exception class

(TRY003)

109-115: Avoid specifying long messages outside the exception class

(TRY003)

150-150: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)

247-247: Avoid specifying long messages outside the exception class

(TRY003)

268-268: Standard pseudo-random generators are not suitable for cryptographic purposes

(S311)

tensorrt_llm/bench/dataset/prepare_synthetic_data.py

21-21: Standard pseudo-random generators are not suitable for cryptographic purposes

(S311)

30-30: No explicit stacklevel keyword argument found

Set stacklevel=2

(B028)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Pre-commit Check

🔇 Additional comments (6)

.gitignore (1)

88-90: ✓ Sensible addition to ignore macOS system files.

Adding .DS_Store to .gitignore is a standard best practice that prevents macOS Finder metadata from being committed. The comment clearly delineates the macOS-specific section, and the placement is logical.

tensorrt_llm/commands/bench.py (1)

9-9: LGTM! Clean integration of the new dataset preparation command.

The import and command registration follow the existing pattern consistently.

Also applies to: 61-61

tensorrt_llm/bench/dataset/prepare_dataset.py (2)

25-32: LGTM! Well-structured data model.

The RootArgs model cleanly captures all necessary configuration parameters for dataset preparation.

87-89: LGTM! Command registration follows Click conventions.

tensorrt_llm/bench/dataset/utils.py (2)

43-76: LGTM! Clean serialization functions.

Both dump functions properly handle directory creation and JSON serialization.

136-157: LGTM! Robust random token generation.

The function correctly handles EOS token exclusion and uses deterministic seeding for reproducibility. Based on learnings, this aligns with TensorRT-LLM's approach of prioritizing sequence length accuracy over semantic correctness for benchmarking.

coderabbitai · 2025-11-18T06:40:22Z

tensorrt_llm/bench/dataset/prepare_dataset.py

+def prepare_dataset(ctx, **kwargs):
+    """Prepare dataset for benchmarking with trtllm-bench."""
+    model = ctx.obj.model or ctx.obj.checkpoint_path
+    ctx.obj = RootArgs(
+        tokenizer=model,
+        output=kwargs["output"],
+        random_seed=kwargs["random_seed"],
+        task_id=kwargs["task_id"],
+        rand_task_id=kwargs["rand_task_id"],
+        lora_dir=kwargs["lora_dir"],
+        trust_remote_code=kwargs["trust_remote_code"],
+    )


⚠️ Potential issue | 🔴 Critical

Critical: Missing std_out field in RootArgs model.

The prepare_synthetic_data.py commands reference root_args.std_out (lines 70 and 130), but the RootArgs model doesn't define this field. This will cause an AttributeError at runtime.

Add the missing field to the RootArgs model:

class RootArgs(BaseModel): tokenizer: str output: str random_seed: int task_id: int trust_remote_code: bool = False rand_task_id: Optional[Tuple[int, int]] lora_dir: Optional[str] = None + std_out: bool = False

And add the corresponding CLI option:

@click.option( "--trust-remote-code", is_flag=True, default=False, envvar="TRUST_REMOTE_CODE", help="Trust remote code.", ) +@click.option( + "--std-out", + is_flag=True, + default=False, + help="Output to stdout instead of file.", +) @click.pass_context def prepare_dataset(ctx, **kwargs):

And pass it to RootArgs:

ctx.obj = RootArgs( tokenizer=model, output=kwargs["output"], random_seed=kwargs["random_seed"], task_id=kwargs["task_id"], rand_task_id=kwargs["rand_task_id"], lora_dir=kwargs["lora_dir"], trust_remote_code=kwargs["trust_remote_code"], + std_out=kwargs["std_out"], )

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents

In tensorrt_llm/bench/dataset/prepare_dataset.py around lines 73-84, prepare_dataset constructs RootArgs but omits the std_out field referenced elsewhere, causing AttributeError; add a std_out attribute to the RootArgs model definition (where RootArgs/dataclass is declared), add a corresponding CLI option (e.g., --std-out / std_out) to the CLI that builds kwargs so it is captured from user input, and pass that value into RootArgs here (std_out=kwargs["std_out"]) so root_args.std_out is available to prepare_synthetic_data.py.

coderabbitai · 2025-11-18T06:40:22Z

tensorrt_llm/bench/dataset/prepare_real_data.py

+    except ValueError as e:
+        if "Config" in e:
+            e += "\n Please add the config name to the dataset config yaml."
+        elif "split" in e:
+            e += "\n Please specify supported split in the dataset config yaml."
+        raise ValueError(e)


⚠️ Potential issue | 🔴 Critical

Critical: Fix exception handling logic.

The code attempts to concatenate strings to an exception object, which will fail. The error message should be constructed properly and re-raised with proper exception chaining.

Apply this diff:

except ValueError as e: + error_msg = str(e) if "Config" in str(e): - e += "\n Please add the config name to the dataset config yaml." + error_msg += "\n Please add the config name to the dataset config yaml." elif "split" in str(e): - e += "\n Please specify supported split in the dataset config yaml." - raise ValueError(e) + error_msg += "\n Please specify supported split in the dataset config yaml." + raise ValueError(error_msg) from e

🧰 Tools

🪛 Ruff (0.14.5)

150-150: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)

🤖 Prompt for AI Agents

In tensorrt_llm/bench/dataset/prepare_real_data.py around lines 145 to 150, the except block incorrectly tries to concatenate strings to the exception object (e += ...), which will raise a TypeError; instead, build a new message string from the original exception text (e.g., msg = str(e) + " ..."), append the appropriate guidance for "Config" or "split", then raise a new ValueError(msg) while preserving exception chaining using "raise ValueError(msg) from e".

coderabbitai · 2025-11-18T06:40:22Z

tensorrt_llm/bench/dataset/prepare_real_data.py

+        if any(key in req for key in ["image", "image_1", "video"]):
+            # multimodal input
+            if "video" in req and req["video"] is not None:
+                assert "Not supported yet"
+            assert kwargs["output_len_dist"] is not None, (
+                "Output length distribution must be set for multimodal requests."
+            )


⚠️ Potential issue | 🟠 Major

Replace assertions with proper exceptions.

Using bare assert statements with string messages and for input validation is incorrect. These should be proper exceptions.

Apply this diff:

if any(key in req for key in ["image", "image_1", "video"]): # multimodal input if "video" in req and req["video"] is not None: - assert "Not supported yet" - assert kwargs["output_len_dist"] is not None, ( - "Output length distribution must be set for multimodal requests." - ) + raise NotImplementedError("Video modality is not supported yet") + if kwargs["output_len_dist"] is None: + raise ValueError("Output length distribution must be set for multimodal requests.")

🤖 Prompt for AI Agents

In tensorrt_llm/bench/dataset/prepare_real_data.py around lines 224 to 230, replace the bare assert statements used for input validation with proper exceptions: when "video" is present and not supported, raise NotImplementedError("Video input not supported yet") instead of assert "Not supported yet"; and when kwargs["output_len_dist"] is required but missing or None, raise ValueError("Output length distribution must be set for multimodal requests.") (also guard with kwargs.get("output_len_dist") is None to avoid KeyError). Ensure the exceptions provide the same messages and stop execution cleanly.

tensorrt_llm/bench/dataset/prepare_real_data.py

coderabbitai · 2025-11-18T06:40:23Z

tensorrt_llm/bench/dataset/prepare_synthetic_data.py

+@click.option(
+    "--input-min", required=True, type=int, help="uniform dist (inclusive) min for input tokens"
+)
+@click.option(
+    "--input-max", required=True, type=int, help="normal dist (inclusive) max for input tokens"
+)
+@click.option(
+    "--output-min", required=True, type=int, help="normal dist (inclusive) min for output tokens"
+)
+@click.option(
+    "--output-max", required=True, type=int, help="normal dist (inclusive) max for output tokens"
+)


⚠️ Potential issue | 🟡 Minor

Fix copy-paste errors in help text.

The help text incorrectly says "normal dist" instead of "uniform dist" for the uniform distribution command.

Apply this diff:

@click.option( "--input-min", required=True, type=int, help="uniform dist (inclusive) min for input tokens" ) @click.option( - "--input-max", required=True, type=int, help="normal dist (inclusive) max for input tokens" + "--input-max", required=True, type=int, help="uniform dist (inclusive) max for input tokens" ) @click.option( - "--output-min", required=True, type=int, help="normal dist (inclusive) min for output tokens" + "--output-min", required=True, type=int, help="uniform dist (inclusive) min for output tokens" ) @click.option( - "--output-max", required=True, type=int, help="normal dist (inclusive) max for output tokens" + "--output-max", required=True, type=int, help="uniform dist (inclusive) max for output tokens" )

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

@click.option(

"--input-min", required=True, type=int, help="uniform dist (inclusive) min for input tokens"

)

@click.option(

"--input-max", required=True, type=int, help="normal dist (inclusive) max for input tokens"

)

@click.option(

"--output-min", required=True, type=int, help="normal dist (inclusive) min for output tokens"

)

@click.option(

"--output-max", required=True, type=int, help="normal dist (inclusive) max for output tokens"

)

@click.option(

"--input-min", required=True, type=int, help="uniform dist (inclusive) min for input tokens"

)

@click.option(

"--input-max", required=True, type=int, help="uniform dist (inclusive) max for input tokens"

)

@click.option(

"--output-min", required=True, type=int, help="uniform dist (inclusive) min for output tokens"

)

@click.option(

"--output-max", required=True, type=int, help="uniform dist (inclusive) max for output tokens"

)

🤖 Prompt for AI Agents

In tensorrt_llm/bench/dataset/prepare_synthetic_data.py around lines 95 to 106, the click.option help strings were copy-pasted and incorrectly label some uniform-distribution flags as "normal dist"; update the help text for --input-max, --output-min, and --output-max to say "uniform dist (inclusive) ..." matching --input-min (use "uniform dist (inclusive) max for input tokens" for --input-max and "uniform dist (inclusive) min/max for output tokens" for --output-min and --output-max respectively) so all four flags consistently describe a uniform distribution.

coderabbitai · 2025-11-18T06:40:23Z

tensorrt_llm/bench/dataset/utils.py

+def get_list_of_delays(delay_dist, mean_time_bet_reqs, num_reqs, random_seed):
+    if delay_dist == "constant":
+        delays = [mean_time_bet_reqs] * num_reqs
+    elif delay_dist == "exponential_dist":
+        delays = get_exponential_dist_delays(mean_time_bet_reqs, num_reqs, random_seed)
+
+    return delays


⚠️ Potential issue | 🟠 Major

Handle unsupported delay_dist values.

If delay_dist is neither "constant" nor "exponential_dist", the function returns an undefined variable, causing a runtime error.

Apply this diff:

def get_list_of_delays(delay_dist, mean_time_bet_reqs, num_reqs, random_seed): if delay_dist == "constant": delays = [mean_time_bet_reqs] * num_reqs elif delay_dist == "exponential_dist": delays = get_exponential_dist_delays(mean_time_bet_reqs, num_reqs, random_seed) + else: + raise ValueError(f"Unsupported delay_dist: {delay_dist}. Must be 'constant' or 'exponential_dist'.") return delays

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

def get_list_of_delays(delay_dist, mean_time_bet_reqs, num_reqs, random_seed):

if delay_dist == "constant":

delays = [mean_time_bet_reqs] * num_reqs

elif delay_dist == "exponential_dist":

delays = get_exponential_dist_delays(mean_time_bet_reqs, num_reqs, random_seed)

return delays

def get_list_of_delays(delay_dist, mean_time_bet_reqs, num_reqs, random_seed):

if delay_dist == "constant":

delays = [mean_time_bet_reqs] * num_reqs

elif delay_dist == "exponential_dist":

delays = get_exponential_dist_delays(mean_time_bet_reqs, num_reqs, random_seed)

else:

raise ValueError(f"Unsupported delay_dist: {delay_dist}. Must be 'constant' or 'exponential_dist'.")

return delays

🤖 Prompt for AI Agents

tensorrt_llm/bench/dataset/utils.py around lines 107 to 113: the function can leave 'delays' undefined if delay_dist is neither "constant" nor "exponential_dist"; add an explicit else branch that raises a clear ValueError (or TypeError) listing supported values (e.g., "constant", "exponential_dist") so callers get a meaningful error instead of an UnboundLocalError; ensure the error message includes the received delay_dist value for easier debugging.

Signed-off-by: Frank Di Natale <[email protected]>

FrankD412 added 13 commits November 17, 2025 08:43

Port prepare dataset to trtllm-bench.

d84c0aa

Signed-off-by: Frank Di Natale <[email protected]>

Add MacOSX DS_Store to gitignore.

013c718

Signed-off-by: Frank Di Natale <[email protected]>

Update imports.

694c4a9

Signed-off-by: Frank Di Natale <[email protected]>

Update click group.

e45d979

Signed-off-by: Frank Di Natale <[email protected]>

Updates to CLI.

77e1fa8

Rename.

09a8345

Signed-off-by: Frank Di Natale <[email protected]>

Add name.

ed3ff9e

Signed-off-by: Frank Di Natale <[email protected]>

Renamed real dataset command.

f58b305

Signed-off-by: Frank Di Natale <[email protected]>

Change to group.

5ec4166

Signed-off-by: Frank Di Natale <[email protected]>

Add docstring.

176d0c3

Signed-off-by: Frank Di Natale <[email protected]>

Remove pass_obj.

1ddeb4c

Signed-off-by: Frank Di Natale <[email protected]>

Fix context subscription.

f6dcf49

Signed-off-by: Frank Di Natale <[email protected]>

Updates to output.

5910dfe

Signed-off-by: Frank Di Natale <[email protected]>

FrankD412 self-assigned this Nov 18, 2025

FrankD412 requested a review from a team as a code owner November 18, 2025 06:35

FrankD412 requested a review from nv-yilinf November 18, 2025 06:35

coderabbitai bot reviewed Nov 18, 2025

View reviewed changes

FrankD412 added 3 commits November 17, 2025 23:07

Updates to remove stdout.

861811d

Signed-off-by: Frank Di Natale <[email protected]>

Add deprecation flag.

dae674f

Signed-off-by: Frank Di Natale <[email protected]>

Merge branch 'main' into trtllm-bench/prep_ds_entrypoint

cb8dca5

FrankD412 changed the title ~~Draft: [TRTLLM-9089] Port prepare_dataset into trtllm-bench~~ Draft: [TRTLLM-9089] Chore: Port prepare_dataset into trtllm-bench Nov 18, 2025

FrankD412 changed the title ~~Draft: [TRTLLM-9089] Chore: Port prepare_dataset into trtllm-bench~~ Draft: [TRTLLM-9089][chore] Port prepare_dataset into trtllm-bench Nov 18, 2025

FrankD412 added 3 commits November 18, 2025 22:36

Code clean up.

27970a7

Signed-off-by: Frank Di Natale <[email protected]>

Fix generator call.

eeff11e

Signed-off-by: Frank Di Natale <[email protected]>

Update prepare_dataset in docs.

97d503b

Signed-off-by: Frank Di Natale <[email protected]>

FrankD412 requested a review from a team as a code owner November 19, 2025 22:13

FrankD412 requested review from QiJune and chzblych November 19, 2025 22:13

Update examples.

dcbb3d4

Signed-off-by: Frank Di Natale <[email protected]>

Update testing for trtllm-bench dataset.

5fe028c

Signed-off-by: Frank Di Natale <[email protected]>

FrankD412 requested review from a team as code owners November 19, 2025 23:55

FrankD412 requested a review from Wanli-Jiang November 19, 2025 23:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Draft: [TRTLLM-9089][chore] Port prepare_dataset into trtllm-bench #9250

Draft: [TRTLLM-9089][chore] Port prepare_dataset into trtllm-bench #9250

Uh oh!

FrankD412 commented Nov 18, 2025 •

edited

Loading

Uh oh!

coderabbitai bot commented Nov 18, 2025 •

edited

Loading

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Nov 18, 2025

Uh oh!

coderabbitai bot Nov 18, 2025

Uh oh!

coderabbitai bot Nov 18, 2025

Uh oh!

Uh oh!

coderabbitai bot Nov 18, 2025

Uh oh!

coderabbitai bot Nov 18, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Draft: [TRTLLM-9089][chore] Port prepare_dataset into trtllm-bench #9250

Are you sure you want to change the base?

Draft: [TRTLLM-9089][chore] Port prepare_dataset into trtllm-bench #9250

Uh oh!

Conversation

FrankD412 commented Nov 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Description

Test Coverage

PR Checklist

GitHub Bot Help

kill

skip

reuse-pipeline

Uh oh!

coderabbitai bot commented Nov 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Nov 18, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Nov 18, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Nov 18, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot Nov 18, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Nov 18, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

FrankD412 commented Nov 18, 2025 •

edited

Loading

coderabbitai bot commented Nov 18, 2025 •

edited

Loading