Skip to content

Conversation

@FrankD412
Copy link
Collaborator

@FrankD412 FrankD412 commented Nov 18, 2025

Summary by CodeRabbit

  • New Features

    • Added dataset preparation CLI command with configurable tokenizer, output path, and random seed options.
    • New real dataset tool for extracting and preparing HuggingFace datasets with text and multimodal support.
    • New synthetic dataset generators: token lengths drawn from normal and uniform distributions.
    • Added support for LoRA configuration and randomized task IDs across datasets.
  • Chores

    • Updated .gitignore to exclude macOS metadata files.

Description

Test Coverage

PR Checklist

Please review the following before submitting your PR:

  • PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.

  • PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.

  • Test cases are provided for new code paths (see test instructions)

  • Any new dependencies have been scanned for license and vulnerabilities

  • CODEOWNERS updated if ownership changes

  • Documentation updated as needed

  • Update tava architecture diagram if there is a significant design change in PR.

  • The reviewers assigned automatically/manually are appropriate for the PR.

  • Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md
and the scripts/test_to_stage_mapping.py helper.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

Signed-off-by: Frank Di Natale <[email protected]>
Signed-off-by: Frank Di Natale <[email protected]>
Signed-off-by: Frank Di Natale <[email protected]>
Signed-off-by: Frank Di Natale <[email protected]>
Signed-off-by: Frank Di Natale <[email protected]>
Signed-off-by: Frank Di Natale <[email protected]>
Signed-off-by: Frank Di Natale <[email protected]>
Signed-off-by: Frank Di Natale <[email protected]>
Signed-off-by: Frank Di Natale <[email protected]>
Signed-off-by: Frank Di Natale <[email protected]>
Signed-off-by: Frank Di Natale <[email protected]>
@FrankD412 FrankD412 self-assigned this Nov 18, 2025
@FrankD412 FrankD412 requested a review from a team as a code owner November 18, 2025 06:35
@FrankD412 FrankD412 requested a review from nv-yilinf November 18, 2025 06:35
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Nov 18, 2025

📝 Walkthrough

Walkthrough

This pull request adds dataset preparation infrastructure to the TensorRT LLM benchmarking suite, introducing CLI commands to generate synthetic datasets and prepare real HuggingFace datasets, along with supporting data models and utilities. It also updates the main bench CLI to register the new prepare-dataset command and adds .DS_Store to .gitignore.

Changes

Cohort / File(s) Summary
Configuration
.gitignore
Adds .DS_Store to ignore macOS Finder metadata files
Dataset Preparation CLI Entry Point
tensorrt_llm/bench/dataset/prepare_dataset.py
Introduces Click group prepare_dataset with CLI options (output, random-seed, task-id, rand-task-id, lora-dir, log-level, trust-remote-code). Defines Pydantic model RootArgs for storing parsed arguments including tokenizer validation and loading. Registers subcommands for real, synthetic (normal and uniform distribution) dataset generation
Real Dataset Preparation
tensorrt_llm/bench/dataset/prepare_real_data.py
Adds real-dataset CLI command with DatasetConfig model for HuggingFace dataset configuration. Provides utilities to load datasets, handle text-only and multimodal (image) inputs, support configurable output length distributions, and serialize to JSON. Includes validation for prompt key/source specifications
Synthetic Dataset Generation
tensorrt_llm/bench/dataset/prepare_synthetic_data.py
Implements token_norm_dist and token_unif_dist CLI commands to generate synthetic token datasets using normal and uniform distributions respectively. Handles task ID generation, LoRA configuration, and structured dataset output with metadata
Dataset Utilities & Models
tensorrt_llm/bench/dataset/utils.py
Defines TextSample, MultimodalSample, and Workload Pydantic models. Provides serialization (text_dataset_dump, multimodal_dataset_dump), JSON streaming (print_text_dataset, print_multimodal_dataset), distribution sampling (get_norm_dist_lengths, get_unif_dist_lengths, get_exponential_dist_delays), and random token generation utilities with deterministic seeding
CLI Registration
tensorrt_llm/commands/bench.py
Imports and registers prepare_dataset as a subcommand in the trtllm-bench CLI group

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant PrepareDataset as prepare_dataset<br/>(CLI Group)
    participant RootArgs as RootArgs<br/>(Validation)
    participant Subcommands as Subcommands<br/>(real-dataset,<br/>token_norm_dist,<br/>token_unif_dist)
    participant Utils as Dataset Utils<br/>(Serialization &<br/>Sampling)
    participant Output as Output File

    User->>PrepareDataset: Run CLI with options
    PrepareDataset->>RootArgs: Parse CLI args & build model
    activate RootArgs
    RootArgs->>RootArgs: Load & validate tokenizer
    RootArgs-->>PrepareDataset: Return validated args
    deactivate RootArgs
    
    PrepareDataset->>Subcommands: Invoke subcommand<br/>(real-dataset or synthetic)
    activate Subcommands
    alt Real Dataset Path
        Subcommands->>Subcommands: Load HF dataset<br/>& DatasetConfig
        Subcommands->>Utils: Generate samples<br/>(text or multimodal)
    else Synthetic Dataset Path
        Subcommands->>Utils: Sample lengths from<br/>distribution
        Subcommands->>Utils: Generate random tokens
    end
    Subcommands->>Utils: Serialize Workload<br/>(TextSample/MultimodalSample)
    Utils->>Output: Write JSON to file
    Subcommands-->>User: Complete
    deactivate Subcommands
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Areas requiring extra attention:

  • prepare_real_data.py: HuggingFace dataset loading and multimodal (image) handling logic with PIL Image to file conversion; validation of dataset configuration options
  • utils.py: Deterministic randomization logic across multiple distribution functions; JSON serialization via Pydantic models; handling of both text and multimodal sample paths
  • prepare_synthetic_data.py: Token generation logic ensuring EOS token exclusion; task ID and LoRA configuration derivation
  • prepare_dataset.py: Tokenizer loading and validation in Pydantic model hook; context object wiring across CLI subcommands
  • Cross-file integration: Interaction between RootArgs, subcommands, and utility functions; correct threading of metadata, seeds, and configuration through the pipeline

Pre-merge checks and finishing touches

❌ Failed checks (2 warnings)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
Description check ⚠️ Warning PR description is entirely empty. Required sections (title format, description, test coverage) are not filled in, only the template structure remains. Fill in the PR title following the template (e.g., [TRTLLM-9089][feat] Port prepare_dataset into trtllm-bench), add a clear description of changes, and document test coverage for the new dataset preparation features.
✅ Passed checks (1 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly describes the main change: porting the prepare_dataset functionality into the trtllm-bench CLI command, which aligns with the file additions and integrations shown in the changeset.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

🧹 Nitpick comments (8)
tensorrt_llm/bench/dataset/prepare_dataset.py (1)

34-49: Improve exception chaining for better debugging.

The exception handling would benefit from using raise ... from e to preserve the original traceback context.

Apply this diff:

         except EnvironmentError as e:
-            raise ValueError(
+            raise ValueError(
                 "Cannot find a tokenizer from the given string because of "
                 f"{e}\nPlease set tokenizer to the directory that contains "
                 "the tokenizer, or set to a model name in HuggingFace."
-            )
+            ) from e
tensorrt_llm/bench/dataset/prepare_synthetic_data.py (3)

15-38: Add stacklevel to warning for better debugging.

The warning should include stacklevel=2 to point to the caller's code rather than this helper function.

Apply this diff:

     if use_task_ids and not use_lora:
         warnings.warn(
-            "Task IDs require LoRA directory. Use --lora-dir or omit task IDs.", UserWarning
+            "Task IDs require LoRA directory. Use --lora-dir or omit task IDs.", 
+            UserWarning, 
+            stacklevel=2
         )

50-52: Remove unused variable initializations.

These variables are immediately overwritten and never used in their initial state.

Apply this diff:

-    input_ids = []
-    input_lens = []
-    output_lens = []
-
     input_lens = get_norm_dist_lengths(

110-112: Remove unused variable initializations.

These variables are immediately overwritten and never used in their initial state.

Apply this diff:

-    input_ids = []
-    input_lens = []
-    output_lens = []
-
     input_lens = get_unif_dist_lengths(
tensorrt_llm/bench/dataset/prepare_real_data.py (2)

21-32: Use Click's BadParameter exception for better error reporting.

Using click.BadParameter provides better integration with Click's error handling and gives users clearer feedback about which parameter failed.

Apply this diff:

 def validate_output_len_dist(ctx, param, value):
     """Validate the --output-len-dist option."""
     if value is None:
         return value
     m = re.match(r"(\d+),(\d+)", value)
     if m:
         return int(m.group(1)), int(m.group(2))
     else:
-        raise AssertionError(
+        raise click.BadParameter(
             "Incorrect specification for --output-len-dist. Correct format: "
             "--output-len-dist <output_len_mean>,<output_len_stdev>"
         )

55-61: Use ValueError instead of AssertionError in validator.

Validators should raise ValueError for invalid data rather than AssertionError, which is typically reserved for internal invariants.

Apply this diff:

     @model_validator(mode="after")
     def check_prompt(self) -> "DatasetConfig":
         if self.prompt_key and self.prompt:
-            raise AssertionError("--prompt-key and --prompt cannot be set at the same time.")
+            raise ValueError("--prompt-key and --prompt cannot be set at the same time.")
         if (not self.prompt_key) and (not self.prompt):
-            raise AssertionError("Either --prompt-key or --prompt must be set.")
+            raise ValueError("Either --prompt-key or --prompt must be set.")
         return self
tensorrt_llm/bench/dataset/utils.py (2)

29-40: Consider using Pydantic's standard patterns.

The custom __init__ could be replaced with a @model_validator(mode='after') or model_post_init hook for better alignment with Pydantic conventions.

Example using @model_validator:

 class Workload(BaseModel):
     metadata: dict
     samples: List[Union[TextSample, MultimodalSample]] = []
 
-    def __init__(self, **kwargs) -> None:
-        super().__init__(**kwargs)
-        self.setup_workload_name()
-
-    def setup_workload_name(self):
+    @model_validator(mode='after')
+    def setup_workload_name(self):
         # Keys to ignore
         ignore_keys = ["tokenizer"]
         # Create a string by concatenating keys and values with "__"
         workload_name = "__".join(
             f"{key}:{value}" for key, value in self.metadata.items() if key not in ignore_keys
         )
         self.metadata.setdefault("workload_name", workload_name)
+        return self

96-104: Add strict parameter to zip for safety.

Without strict=True, mismatched list lengths would be silently truncated, potentially hiding bugs.

Apply this diff:

 def print_multimodal_dataset(multimodal_texts, multimodal_image_paths, output_lens):
-    for i, (text, image_paths) in enumerate(zip(multimodal_texts, multimodal_image_paths)):
+    for i, (text, image_paths) in enumerate(zip(multimodal_texts, multimodal_image_paths, strict=True)):
         d = {
             "task_id": i,
             "prompt": text,
             "media_paths": image_paths,
             "output_tokens": output_lens[i],
         }
         yield json.dumps(d, separators=(",", ":"), ensure_ascii=False)
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 96cfdd8 and 5910dfe.

📒 Files selected for processing (6)
  • .gitignore (1 hunks)
  • tensorrt_llm/bench/dataset/prepare_dataset.py (1 hunks)
  • tensorrt_llm/bench/dataset/prepare_real_data.py (1 hunks)
  • tensorrt_llm/bench/dataset/prepare_synthetic_data.py (1 hunks)
  • tensorrt_llm/bench/dataset/utils.py (1 hunks)
  • tensorrt_llm/commands/bench.py (2 hunks)
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-08-18T08:42:02.640Z
Learnt from: samuellees
Repo: NVIDIA/TensorRT-LLM PR: 6974
File: tensorrt_llm/serve/scripts/benchmark_dataset.py:558-566
Timestamp: 2025-08-18T08:42:02.640Z
Learning: In TensorRT-LLM's RandomDataset (tensorrt_llm/serve/scripts/benchmark_dataset.py), when using --random-token-ids option, sequence length accuracy is prioritized over semantic correctness for benchmarking purposes. The encode/decode operations should use skip_special_tokens=True and add_special_tokens=False to ensure exact target token lengths.

Applied to files:

  • tensorrt_llm/bench/dataset/prepare_dataset.py
  • tensorrt_llm/bench/dataset/utils.py
  • tensorrt_llm/bench/dataset/prepare_synthetic_data.py
🧬 Code graph analysis (4)
tensorrt_llm/commands/bench.py (1)
tensorrt_llm/bench/dataset/prepare_dataset.py (1)
  • prepare_dataset (73-84)
tensorrt_llm/bench/dataset/prepare_dataset.py (4)
tensorrt_llm/bench/dataset/prepare_real_data.py (1)
  • real_dataset (209-314)
tensorrt_llm/bench/dataset/prepare_synthetic_data.py (2)
  • token_norm_dist (48-90)
  • token_unif_dist (108-150)
tensorrt_llm/builder.py (1)
  • default (45-50)
tensorrt_llm/bench/benchmark/__init__.py (1)
  • checkpoint_path (75-76)
tensorrt_llm/bench/dataset/prepare_real_data.py (1)
tensorrt_llm/bench/dataset/utils.py (3)
  • get_norm_dist_lengths (122-126)
  • print_multimodal_dataset (96-104)
  • print_text_dataset (79-93)
tensorrt_llm/bench/dataset/prepare_synthetic_data.py (1)
tensorrt_llm/bench/dataset/utils.py (5)
  • gen_random_tokens (136-157)
  • get_norm_dist_lengths (122-126)
  • get_unif_dist_lengths (129-133)
  • print_text_dataset (79-93)
  • text_dataset_dump (43-57)
🪛 Ruff (0.14.5)
tensorrt_llm/bench/dataset/prepare_dataset.py

41-45: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)


41-45: Avoid specifying long messages outside the exception class

(TRY003)

tensorrt_llm/bench/dataset/utils.py

97-97: zip() without an explicit strict= parameter

Add explicit value for parameter strict=

(B905)


142-142: Standard pseudo-random generators are not suitable for cryptographic purposes

(S311)

tensorrt_llm/bench/dataset/prepare_real_data.py

21-21: Unused function argument: ctx

(ARG001)


21-21: Unused function argument: param

(ARG001)


29-32: Avoid specifying long messages outside the exception class

(TRY003)


58-58: Avoid specifying long messages outside the exception class

(TRY003)


60-60: Avoid specifying long messages outside the exception class

(TRY003)


109-115: Avoid specifying long messages outside the exception class

(TRY003)


150-150: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)


247-247: Avoid specifying long messages outside the exception class

(TRY003)


268-268: Standard pseudo-random generators are not suitable for cryptographic purposes

(S311)

tensorrt_llm/bench/dataset/prepare_synthetic_data.py

21-21: Standard pseudo-random generators are not suitable for cryptographic purposes

(S311)


30-30: No explicit stacklevel keyword argument found

Set stacklevel=2

(B028)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (6)
.gitignore (1)

88-90: ✓ Sensible addition to ignore macOS system files.

Adding .DS_Store to .gitignore is a standard best practice that prevents macOS Finder metadata from being committed. The comment clearly delineates the macOS-specific section, and the placement is logical.

tensorrt_llm/commands/bench.py (1)

9-9: LGTM! Clean integration of the new dataset preparation command.

The import and command registration follow the existing pattern consistently.

Also applies to: 61-61

tensorrt_llm/bench/dataset/prepare_dataset.py (2)

25-32: LGTM! Well-structured data model.

The RootArgs model cleanly captures all necessary configuration parameters for dataset preparation.


87-89: LGTM! Command registration follows Click conventions.

tensorrt_llm/bench/dataset/utils.py (2)

43-76: LGTM! Clean serialization functions.

Both dump functions properly handle directory creation and JSON serialization.


136-157: LGTM! Robust random token generation.

The function correctly handles EOS token exclusion and uses deterministic seeding for reproducibility. Based on learnings, this aligns with TensorRT-LLM's approach of prioritizing sequence length accuracy over semantic correctness for benchmarking.

Comment on lines 73 to 84
def prepare_dataset(ctx, **kwargs):
"""Prepare dataset for benchmarking with trtllm-bench."""
model = ctx.obj.model or ctx.obj.checkpoint_path
ctx.obj = RootArgs(
tokenizer=model,
output=kwargs["output"],
random_seed=kwargs["random_seed"],
task_id=kwargs["task_id"],
rand_task_id=kwargs["rand_task_id"],
lora_dir=kwargs["lora_dir"],
trust_remote_code=kwargs["trust_remote_code"],
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Critical: Missing std_out field in RootArgs model.

The prepare_synthetic_data.py commands reference root_args.std_out (lines 70 and 130), but the RootArgs model doesn't define this field. This will cause an AttributeError at runtime.

Add the missing field to the RootArgs model:

 class RootArgs(BaseModel):
     tokenizer: str
     output: str
     random_seed: int
     task_id: int
     trust_remote_code: bool = False
     rand_task_id: Optional[Tuple[int, int]]
     lora_dir: Optional[str] = None
+    std_out: bool = False

And add the corresponding CLI option:

 @click.option(
     "--trust-remote-code",
     is_flag=True,
     default=False,
     envvar="TRUST_REMOTE_CODE",
     help="Trust remote code.",
 )
+@click.option(
+    "--std-out",
+    is_flag=True,
+    default=False,
+    help="Output to stdout instead of file.",
+)
 @click.pass_context
 def prepare_dataset(ctx, **kwargs):

And pass it to RootArgs:

     ctx.obj = RootArgs(
         tokenizer=model,
         output=kwargs["output"],
         random_seed=kwargs["random_seed"],
         task_id=kwargs["task_id"],
         rand_task_id=kwargs["rand_task_id"],
         lora_dir=kwargs["lora_dir"],
         trust_remote_code=kwargs["trust_remote_code"],
+        std_out=kwargs["std_out"],
     )

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents
In tensorrt_llm/bench/dataset/prepare_dataset.py around lines 73-84,
prepare_dataset constructs RootArgs but omits the std_out field referenced
elsewhere, causing AttributeError; add a std_out attribute to the RootArgs model
definition (where RootArgs/dataclass is declared), add a corresponding CLI
option (e.g., --std-out / std_out) to the CLI that builds kwargs so it is
captured from user input, and pass that value into RootArgs here
(std_out=kwargs["std_out"]) so root_args.std_out is available to
prepare_synthetic_data.py.

Comment on lines +145 to +150
except ValueError as e:
if "Config" in e:
e += "\n Please add the config name to the dataset config yaml."
elif "split" in e:
e += "\n Please specify supported split in the dataset config yaml."
raise ValueError(e)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Critical: Fix exception handling logic.

The code attempts to concatenate strings to an exception object, which will fail. The error message should be constructed properly and re-raised with proper exception chaining.

Apply this diff:

     except ValueError as e:
+        error_msg = str(e)
         if "Config" in str(e):
-            e += "\n Please add the config name to the dataset config yaml."
+            error_msg += "\n Please add the config name to the dataset config yaml."
         elif "split" in str(e):
-            e += "\n Please specify supported split in the dataset config yaml."
-        raise ValueError(e)
+            error_msg += "\n Please specify supported split in the dataset config yaml."
+        raise ValueError(error_msg) from e
🧰 Tools
🪛 Ruff (0.14.5)

150-150: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)

🤖 Prompt for AI Agents
In tensorrt_llm/bench/dataset/prepare_real_data.py around lines 145 to 150, the
except block incorrectly tries to concatenate strings to the exception object (e
+= ...), which will raise a TypeError; instead, build a new message string from
the original exception text (e.g., msg = str(e) + " ..."), append the
appropriate guidance for "Config" or "split", then raise a new ValueError(msg)
while preserving exception chaining using "raise ValueError(msg) from e".

Comment on lines +224 to +230
if any(key in req for key in ["image", "image_1", "video"]):
# multimodal input
if "video" in req and req["video"] is not None:
assert "Not supported yet"
assert kwargs["output_len_dist"] is not None, (
"Output length distribution must be set for multimodal requests."
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Replace assertions with proper exceptions.

Using bare assert statements with string messages and for input validation is incorrect. These should be proper exceptions.

Apply this diff:

         if any(key in req for key in ["image", "image_1", "video"]):
             # multimodal input
             if "video" in req and req["video"] is not None:
-                assert "Not supported yet"
-            assert kwargs["output_len_dist"] is not None, (
-                "Output length distribution must be set for multimodal requests."
-            )
+                raise NotImplementedError("Video modality is not supported yet")
+            if kwargs["output_len_dist"] is None:
+                raise ValueError("Output length distribution must be set for multimodal requests.")
🤖 Prompt for AI Agents
In tensorrt_llm/bench/dataset/prepare_real_data.py around lines 224 to 230,
replace the bare assert statements used for input validation with proper
exceptions: when "video" is present and not supported, raise
NotImplementedError("Video input not supported yet") instead of assert "Not
supported yet"; and when kwargs["output_len_dist"] is required but missing or
None, raise ValueError("Output length distribution must be set for multimodal
requests.") (also guard with kwargs.get("output_len_dist") is None to avoid
KeyError). Ensure the exceptions provide the same messages and stop execution
cleanly.

Comment on lines +95 to +106
@click.option(
"--input-min", required=True, type=int, help="uniform dist (inclusive) min for input tokens"
)
@click.option(
"--input-max", required=True, type=int, help="normal dist (inclusive) max for input tokens"
)
@click.option(
"--output-min", required=True, type=int, help="normal dist (inclusive) min for output tokens"
)
@click.option(
"--output-max", required=True, type=int, help="normal dist (inclusive) max for output tokens"
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Fix copy-paste errors in help text.

The help text incorrectly says "normal dist" instead of "uniform dist" for the uniform distribution command.

Apply this diff:

 @click.option(
     "--input-min", required=True, type=int, help="uniform dist (inclusive) min for input tokens"
 )
 @click.option(
-    "--input-max", required=True, type=int, help="normal dist (inclusive) max for input tokens"
+    "--input-max", required=True, type=int, help="uniform dist (inclusive) max for input tokens"
 )
 @click.option(
-    "--output-min", required=True, type=int, help="normal dist (inclusive) min for output tokens"
+    "--output-min", required=True, type=int, help="uniform dist (inclusive) min for output tokens"
 )
 @click.option(
-    "--output-max", required=True, type=int, help="normal dist (inclusive) max for output tokens"
+    "--output-max", required=True, type=int, help="uniform dist (inclusive) max for output tokens"
 )
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
@click.option(
"--input-min", required=True, type=int, help="uniform dist (inclusive) min for input tokens"
)
@click.option(
"--input-max", required=True, type=int, help="normal dist (inclusive) max for input tokens"
)
@click.option(
"--output-min", required=True, type=int, help="normal dist (inclusive) min for output tokens"
)
@click.option(
"--output-max", required=True, type=int, help="normal dist (inclusive) max for output tokens"
)
@click.option(
"--input-min", required=True, type=int, help="uniform dist (inclusive) min for input tokens"
)
@click.option(
"--input-max", required=True, type=int, help="uniform dist (inclusive) max for input tokens"
)
@click.option(
"--output-min", required=True, type=int, help="uniform dist (inclusive) min for output tokens"
)
@click.option(
"--output-max", required=True, type=int, help="uniform dist (inclusive) max for output tokens"
)
🤖 Prompt for AI Agents
In tensorrt_llm/bench/dataset/prepare_synthetic_data.py around lines 95 to 106,
the click.option help strings were copy-pasted and incorrectly label some
uniform-distribution flags as "normal dist"; update the help text for
--input-max, --output-min, and --output-max to say "uniform dist (inclusive)
..." matching --input-min (use "uniform dist (inclusive) max for input tokens"
for --input-max and "uniform dist (inclusive) min/max for output tokens" for
--output-min and --output-max respectively) so all four flags consistently
describe a uniform distribution.

Comment on lines +107 to +113
def get_list_of_delays(delay_dist, mean_time_bet_reqs, num_reqs, random_seed):
if delay_dist == "constant":
delays = [mean_time_bet_reqs] * num_reqs
elif delay_dist == "exponential_dist":
delays = get_exponential_dist_delays(mean_time_bet_reqs, num_reqs, random_seed)

return delays
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Handle unsupported delay_dist values.

If delay_dist is neither "constant" nor "exponential_dist", the function returns an undefined variable, causing a runtime error.

Apply this diff:

 def get_list_of_delays(delay_dist, mean_time_bet_reqs, num_reqs, random_seed):
     if delay_dist == "constant":
         delays = [mean_time_bet_reqs] * num_reqs
     elif delay_dist == "exponential_dist":
         delays = get_exponential_dist_delays(mean_time_bet_reqs, num_reqs, random_seed)
+    else:
+        raise ValueError(f"Unsupported delay_dist: {delay_dist}. Must be 'constant' or 'exponential_dist'.")
 
     return delays
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def get_list_of_delays(delay_dist, mean_time_bet_reqs, num_reqs, random_seed):
if delay_dist == "constant":
delays = [mean_time_bet_reqs] * num_reqs
elif delay_dist == "exponential_dist":
delays = get_exponential_dist_delays(mean_time_bet_reqs, num_reqs, random_seed)
return delays
def get_list_of_delays(delay_dist, mean_time_bet_reqs, num_reqs, random_seed):
if delay_dist == "constant":
delays = [mean_time_bet_reqs] * num_reqs
elif delay_dist == "exponential_dist":
delays = get_exponential_dist_delays(mean_time_bet_reqs, num_reqs, random_seed)
else:
raise ValueError(f"Unsupported delay_dist: {delay_dist}. Must be 'constant' or 'exponential_dist'.")
return delays
🤖 Prompt for AI Agents
tensorrt_llm/bench/dataset/utils.py around lines 107 to 113: the function can
leave 'delays' undefined if delay_dist is neither "constant" nor
"exponential_dist"; add an explicit else branch that raises a clear ValueError
(or TypeError) listing supported values (e.g., "constant", "exponential_dist")
so callers get a meaningful error instead of an UnboundLocalError; ensure the
error message includes the received delay_dist value for easier debugging.

@FrankD412 FrankD412 changed the title Draft: [TRTLLM-9089] Port prepare_dataset into trtllm-bench Draft: [TRTLLM-9089] Chore: Port prepare_dataset into trtllm-bench Nov 18, 2025
@FrankD412 FrankD412 changed the title Draft: [TRTLLM-9089] Chore: Port prepare_dataset into trtllm-bench Draft: [TRTLLM-9089][chore] Port prepare_dataset into trtllm-bench Nov 18, 2025
Signed-off-by: Frank Di Natale <[email protected]>
Signed-off-by: Frank Di Natale <[email protected]>
Signed-off-by: Frank Di Natale <[email protected]>
@FrankD412 FrankD412 requested a review from a team as a code owner November 19, 2025 22:13
@FrankD412 FrankD412 requested review from QiJune and chzblych November 19, 2025 22:13
Signed-off-by: Frank Di Natale <[email protected]>
@FrankD412 FrankD412 requested review from a team as code owners November 19, 2025 23:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant