Skip to content

Conversation

@galagam
Copy link
Collaborator

@galagam galagam commented Oct 22, 2025

Summary by CodeRabbit

  • New Features
    • Benchmark reports now include MACHINE DETAILS with primary GPU info (index, name, total memory, memory clock when available).
    • Console/log output embeds machine details before other statistics for clearer diagnostics.
    • Automatic, graceful fallback: reports still generate and explicitly indicate when GPU details cannot be retrieved.

Description

Add printing of devices' information in trtllm-bench report.
Prints GPU name, memory size (GB), memory clock (GHz).
Assumes homogeneous only setup, queries the first device only.
e.g.

===========================================================
= MACHINE DETAILS 
===========================================================
NVIDIA H100 80GB HBM3, memory 79.11 GB, 2.62 GHz

@galagam galagam self-assigned this Oct 22, 2025
@galagam galagam requested a review from a team as a code owner October 22, 2025 12:36
@galagam galagam requested a review from FrankD412 October 22, 2025 12:36
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Oct 22, 2025

📝 Walkthrough

Walkthrough

Add optional GPU querying to reporting: new _query_gpu_info() static methods in StatsKeeper and ReportUtility collect first-GPU details (name, memory.total, memory clock) using CUDA_VISIBLE_DEVICES, torch and pynvml when available, with safe fallbacks and non-fatal per-GPU error handling; machine info is added to statistics output.

Changes

Cohort / File(s) Change Summary
GPU Information Querying & Reporting
tensorrt_llm/bench/dataclasses/reporting.py
Added _query_gpu_info() static methods to StatsKeeper and ReportUtility to obtain first-GPU details (index, name, memory.total, memory clocks) using guarded imports of torch and optional pynvml, honoring CUDA_VISIBLE_DEVICES with fallback to index 0. Integrated machine info into get_statistics_dict() and report_statistics() output; per-GPU errors are caught and mapped to missing values. Updated imports and logging composition to include machine block.

Sequence Diagram(s)

sequenceDiagram
  participant Reporter as report_statistics
  participant Stats as get_statistics_dict
  participant Query as _query_gpu_info
  participant Env as CUDA_VISIBLE_DEVICES / os
  participant Torch as torch (optional)
  participant NVML as pynvml (optional)

  Reporter->>Stats: request statistics
  Stats->>Query: query machine/gpu info
  Query->>Env: read CUDA_VISIBLE_DEVICES (select GPU index)
  alt torch available
    Query->>Torch: query device count / properties
  else
    Query-->>Query: skip torch checks
  end
  alt pynvml available
    Query->>NVML: init & read memory clocks
    NVML-->>Query: memory clock value
  else
    Query-->>Query: no NVML, skip clocks
  end
  Query-->>Stats: return machine_info (or None)
  Stats-->>Reporter: include machine_info in statistics dict
  Reporter-->>Reporter: format and append machine block to log output
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

  • Pay attention to guarded imports and their unit-testability.
  • Verify correctness of CUDA_VISIBLE_DEVICES parsing and GPU index selection.
  • Confirm NVML initialization/cleanup and error handling do not leak resources.

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Description Check ⚠️ Warning The pull request description includes a well-written "Description" section that clearly explains what was added (device information printing), what details are captured (GPU name, memory size, memory clock), and documents the assumptions (homogeneous setup, first device only) with a concrete example. However, the description is missing the "Test Coverage" section, which should list relevant test(s) that safeguard the changes, and the "PR Checklist" section shows no items checked off. While the technical content under "Description" is complete and informative, these two template sections are important for understanding validation and ensuring the author reviewed critical items before submission. Add the missing "Test Coverage" section to document what tests safeguard these changes (new GPU info querying methods and reporting integration), and complete the "PR Checklist" by reviewing and checking off the applicable items to confirm adherence to coding guidelines, test coverage, dependencies, and other review requirements. This ensures the PR has appropriate validation and the author has verified the necessary pre-submission items.
✅ Passed checks (2 passed)
Check name Status Explanation
Title Check ✅ Passed The pull request title "[None][chore] Print device info in trtllm-bench report" follows the required template format with [None] as the ticket reference and [chore] as the type. The summary is concise, clear, and directly describes the main change of the PR—adding device information printing to the trtllm-bench report. The title matches well with the raw summary showing additions to GPU information querying and machine details in the reporting module. The phrasing is readable and specific enough for a developer scanning history to understand the primary objective.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
tensorrt_llm/bench/dataclasses/reporting.py (1)

1-22: Add the required NVIDIA Apache-2.0 copyright header with current year (2025) to the top of the file.

Per coding guidelines, all Python source files must include the NVIDIA Apache-2.0 copyright header with the current year at the top. The file currently starts directly with imports, missing this required header.

Add the following header as the first lines of the file:

# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 910e6b9 and d901997.

📒 Files selected for processing (1)
  • tensorrt_llm/bench/dataclasses/reporting.py (9 hunks)
🧰 Additional context used
📓 Path-based instructions (3)
**/*.{h,hpp,hh,hxx,cpp,cxx,cc,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Use only spaces, no tabs; indent with 4 spaces.

Files:

  • tensorrt_llm/bench/dataclasses/reporting.py
**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Python code must target Python 3.8+.
Indent Python code with 4 spaces; do not use tabs.
Maintain module namespace when importing; prefer 'from package.subpackage import foo' then 'foo.SomeClass()' instead of importing the class directly.
Python filenames should be snake_case (e.g., some_file.py).
Python classes use PascalCase names.
Functions and methods use snake_case names.
Local variables use snake_case; prefix 'k' for variables that start with a number (e.g., k_99th_percentile).
Global variables use upper SNAKE_CASE prefixed with 'G' (e.g., G_MY_GLOBAL).
Constants use upper SNAKE_CASE (e.g., MY_CONSTANT).
Avoid shadowing variables from an outer scope.
Initialize all externally visible members of a class in the constructor.
Prefer docstrings for interfaces that may be used outside a file; comments for in-function or file-local interfaces.
Use Google-style docstrings for classes and functions (Sphinx-parsable).
Document attributes and variables inline so they render under the class/function docstring.
Avoid reflection when a simpler, explicit approach suffices (e.g., avoid dict(**locals()) patterns).
In try/except, catch the most specific exceptions possible.
For duck-typing try/except, keep the try body minimal and use else for the main logic.

Files:

  • tensorrt_llm/bench/dataclasses/reporting.py
**/*.{cpp,cxx,cc,h,hpp,hh,hxx,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Prepend the NVIDIA Apache-2.0 copyright header with current year to the top of all source files (e.g., .cpp, .h, .cu, .py).

Files:

  • tensorrt_llm/bench/dataclasses/reporting.py
🧬 Code graph analysis (1)
tensorrt_llm/bench/dataclasses/reporting.py (2)
tensorrt_llm/profiler.py (1)
  • PyNVMLContext (113-120)
tensorrt_llm/bench/dataclasses/configuration.py (1)
  • world_size (194-195)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (3)
tensorrt_llm/bench/dataclasses/reporting.py (3)

206-263: LGTM! Robust error handling for GPU querying.

The implementation correctly:

  • Handles missing pynvml gracefully
  • Uses PyNVMLContext for proper NVML lifecycle management
  • Implements per-GPU error handling to avoid failing the entire query if one GPU is inaccessible
  • Handles byte string decoding for GPU names
  • Returns structured data with clear error notes

753-761: LGTM! Machine info properly integrated into the report.

The machine information is correctly added to the logging output in the appropriate position.


65-65: LGTM! Documentation improvements.

The updated comments and docstrings improve clarity.

Also applies to: 123-124

@galagam galagam requested a review from nzmora-nvidia October 22, 2025 12:46
@nzmora-nvidia
Copy link
Collaborator

nzmora-nvidia commented Oct 22, 2025

Hi @galagam ,

Using the torch.cuda API will simplify the code

>>> torch.cuda.get_device_properties()
_CudaDeviceProperties(name='NVIDIA H100 80GB HBM3', major=9, minor=0, total_memory=81007MB, multi_processor_count=132, uuid=c37bded2-9879-50b8-e3a8-77b5c60d9a22, pci_bus_id=25, pci_device_id=0, pci_domain_id=0, L2_cache_size=50MB)

@galagam galagam marked this pull request as draft October 22, 2025 17:08
@galagam galagam force-pushed the user/galagam/bench-report-add-devices-info branch from d901997 to c721f03 Compare October 29, 2025 16:54
@galagam galagam marked this pull request as ready for review October 29, 2025 16:54
@galagam galagam force-pushed the user/galagam/bench-report-add-devices-info branch from d0743c7 to bb3e851 Compare November 2, 2025 09:21
@galagam galagam changed the title [None][chore] Print devices info in trtllm-bench report [None][chore] Print device info in trtllm-bench report Nov 2, 2025
@galagam
Copy link
Collaborator Author

galagam commented Nov 2, 2025

@coderabbitai full review

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Nov 2, 2025

✅ Actions performed

Full review triggered.

1 similar comment
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Nov 2, 2025

✅ Actions performed

Full review triggered.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (3)
tensorrt_llm/bench/dataclasses/reporting.py (3)

209-236: Consider adding debug logging for GPU query failures.

The method silently returns None on exceptions, which may complicate debugging when GPU information is unavailable. While the broad exception handling is acceptable for non-critical functionality, adding a debug-level log message would help operators understand why machine details are missing.

Apply this diff to add logging:

+from tensorrt_llm.logger import logger
+
 @staticmethod
 def _query_gpu_info() -> Dict[str, Any]:
     """Query first GPU info (all GPUs must be identical for TRT-LLM)."""
     if not torch.cuda.is_available():
+        logger.debug("CUDA not available, skipping GPU info query")
         return None
 
     try:
         cuda_visible = os.environ.get("CUDA_VISIBLE_DEVICES", "").strip()
         physical_idx = int(
             cuda_visible.split(",")[0].strip()) if cuda_visible else 0
 
         props = torch.cuda.get_device_properties(physical_idx)
         gpu_info = {
             "name":
             getattr(props, "name", "Unknown"),
             "memory.total":
             float(getattr(props, "total_memory", 0.0)) / (1024.0**3),
             "clocks.mem":
             None,
         }
         if pynvml:
             # Memory clock information is not reported by torch, using NVML instead
             handle = pynvml.nvmlDeviceGetHandleByIndex(physical_idx)
             gpu_info["clocks.mem"] = pynvml.nvmlDeviceGetMaxClockInfo(
                 handle, pynvml.NVML_CLOCK_MEM) / 1000.0
         return gpu_info
     except (RuntimeError, AssertionError):
+        logger.debug("Failed to query GPU info", exc_info=True)
         return None

229-236: Optional: Restructure exception handling per static analysis hint.

The static analysis tool suggests moving the success-path return statement into an else block to make the control flow more explicit. This is a stylistic preference that can improve code clarity.

Apply this diff:

         if pynvml:
             # Memory clock information is not reported by torch, using NVML instead
             handle = pynvml.nvmlDeviceGetHandleByIndex(physical_idx)
             gpu_info["clocks.mem"] = pynvml.nvmlDeviceGetMaxClockInfo(
                 handle, pynvml.NVML_CLOCK_MEM) / 1000.0
-        return gpu_info
-    except (RuntimeError, AssertionError):
-        return None
+    except (RuntimeError, AssertionError):
+        return None
+    else:
+        return gpu_info

211-211: Consider adding validation or enhancing documentation for the GPU homogeneity assumption.

The docstring documents that "all GPUs must be identical for TRT-LLM," but the code only queries the first GPU without validating that other GPUs match. For heterogeneous setups, consider either:

  1. Enhancing the docstring to explicitly warn this is a hard requirement (with implications if violated), or
  2. Adding a lightweight check that validates all visible GPUs match and logs a warning if they don't

No GPU homogeneity validation was found elsewhere in the codebase, so this reporting method is the only place documenting the assumption to users.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between da73410 and bb3e851.

📒 Files selected for processing (1)
  • tensorrt_llm/bench/dataclasses/reporting.py (6 hunks)
🧰 Additional context used
📓 Path-based instructions (3)
**/*.{h,hpp,hh,hxx,cpp,cxx,cc,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Use only spaces, no tabs; indent with 4 spaces.

Files:

  • tensorrt_llm/bench/dataclasses/reporting.py
**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Python code must target Python 3.8+.
Indent Python code with 4 spaces; do not use tabs.
Maintain module namespace when importing; prefer 'from package.subpackage import foo' then 'foo.SomeClass()' instead of importing the class directly.
Python filenames should be snake_case (e.g., some_file.py).
Python classes use PascalCase names.
Functions and methods use snake_case names.
Local variables use snake_case; prefix 'k' for variables that start with a number (e.g., k_99th_percentile).
Global variables use upper SNAKE_CASE prefixed with 'G' (e.g., G_MY_GLOBAL).
Constants use upper SNAKE_CASE (e.g., MY_CONSTANT).
Avoid shadowing variables from an outer scope.
Initialize all externally visible members of a class in the constructor.
Prefer docstrings for interfaces that may be used outside a file; comments for in-function or file-local interfaces.
Use Google-style docstrings for classes and functions (Sphinx-parsable).
Document attributes and variables inline so they render under the class/function docstring.
Avoid reflection when a simpler, explicit approach suffices (e.g., avoid dict(**locals()) patterns).
In try/except, catch the most specific exceptions possible.
For duck-typing try/except, keep the try body minimal and use else for the main logic.

Files:

  • tensorrt_llm/bench/dataclasses/reporting.py
**/*.{cpp,cxx,cc,h,hpp,hh,hxx,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Prepend the NVIDIA Apache-2.0 copyright header with current year to the top of all source files (e.g., .cpp, .h, .cu, .py).

Files:

  • tensorrt_llm/bench/dataclasses/reporting.py
🧠 Learnings (6)
📚 Learning: 2025-09-23T14:58:05.372Z
Learnt from: nv-lschneider
Repo: NVIDIA/TensorRT-LLM PR: 7910
File: cpp/tensorrt_llm/kernels/nccl_device/config.cu:42-49
Timestamp: 2025-09-23T14:58:05.372Z
Learning: In TensorRT-LLM NCCL device kernels (cpp/tensorrt_llm/kernels/nccl_device/), the token partitioning intentionally uses ceil-like distribution (same token_per_rank for all ranks) to ensure all ranks launch the same number of blocks. This is required for optimal NCCL device API barrier performance, even though it may launch extra blocks for non-existent tokens on later ranks. Runtime bounds checking in the kernel (blockID validation) handles the overshoot cases.

Applied to files:

  • tensorrt_llm/bench/dataclasses/reporting.py
📚 Learning: 2025-10-13T19:45:03.518Z
Learnt from: nv-lschneider
Repo: NVIDIA/TensorRT-LLM PR: 7910
File: tests/unittest/_torch/multi_gpu/test_nccl_device.py:138-149
Timestamp: 2025-10-13T19:45:03.518Z
Learning: In test_nccl_device.py, the NCCL device AllReduce implementation compares the entire residual tensor on each rank, unlike the UB implementation which compares per-rank chunks. The residual chunking calculations in the test are intentionally overridden to reflect this design difference.

Applied to files:

  • tensorrt_llm/bench/dataclasses/reporting.py
📚 Learning: 2025-08-28T10:22:02.288Z
Learnt from: ixlmar
Repo: NVIDIA/TensorRT-LLM PR: 7294
File: tensorrt_llm/_torch/pyexecutor/sampler.py:1191-1197
Timestamp: 2025-08-28T10:22:02.288Z
Learning: In tensorrt_llm/_torch/pyexecutor/sampler.py, the object identity comparison `softmax_req_indices is not group_req_indices_cuda` on line ~1191 is intentional and used as an optimization to determine whether to reuse an existing indexer or create a new one, based on which code path was taken during tensor assignment.

Applied to files:

  • tensorrt_llm/bench/dataclasses/reporting.py
📚 Learning: 2025-07-17T09:01:27.402Z
Learnt from: amitz-nv
Repo: NVIDIA/TensorRT-LLM PR: 5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.402Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks `is_adapter_in_cpu_cache()` and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.

Applied to files:

  • tensorrt_llm/bench/dataclasses/reporting.py
📚 Learning: 2025-08-26T09:49:04.956Z
Learnt from: pengbowang-nv
Repo: NVIDIA/TensorRT-LLM PR: 7192
File: tests/integration/test_lists/test-db/l0_dgx_b200.yml:56-72
Timestamp: 2025-08-26T09:49:04.956Z
Learning: In TensorRT-LLM test configuration files, the test scheduling system handles wildcard matching with special rules that prevent duplicate test execution even when the same tests appear in multiple yaml files with overlapping GPU wildcards (e.g., "*b200*" and "*gb200*").

Applied to files:

  • tensorrt_llm/bench/dataclasses/reporting.py
📚 Learning: 2025-08-26T09:37:10.463Z
Learnt from: jiaganc
Repo: NVIDIA/TensorRT-LLM PR: 7031
File: tensorrt_llm/bench/dataclasses/configuration.py:90-104
Timestamp: 2025-08-26T09:37:10.463Z
Learning: In TensorRT-LLM's bench configuration, the `get_pytorch_perf_config()` method returns `self.pytorch_config` which is a Dict[str, Any] that can contain default values including `cuda_graph_config`, making the fallback `llm_args["cuda_graph_config"]` safe to use.

Applied to files:

  • tensorrt_llm/bench/dataclasses/reporting.py
🪛 Ruff (0.14.2)
tensorrt_llm/bench/dataclasses/reporting.py

234-234: Consider moving this statement to an else block

(TRY300)

🔇 Additional comments (3)
tensorrt_llm/bench/dataclasses/reporting.py (3)

4-4: LGTM: Import additions are well-structured.

The unconditional torch import is appropriate since the method checks torch.cuda.is_available() at runtime. The optional pynvml import with graceful fallback is a good pattern for non-critical functionality.

Also applies to: 8-8, 10-13


313-314: LGTM: Clean integration of machine details.

The machine details are properly integrated into the statistics dictionary with a clear comment explaining the single-GPU query approach.


570-582: LGTM: Proper handling of None values in formatting.

The conditional formatting correctly handles None values for memory and clock fields, avoiding the TypeError that would occur if .2f were applied to a string. This addresses the previous review feedback.

@galagam
Copy link
Collaborator Author

galagam commented Nov 6, 2025

@FrankD412 can you please review?

@galagam galagam force-pushed the user/galagam/bench-report-add-devices-info branch from bb3e851 to ac45a24 Compare November 17, 2025 17:47
@galagam galagam enabled auto-merge (squash) November 17, 2025 17:47
@galagam galagam force-pushed the user/galagam/bench-report-add-devices-info branch from ac45a24 to dcfaf04 Compare November 18, 2025 07:24
@galagam
Copy link
Collaborator Author

galagam commented Nov 18, 2025

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #24864 [ run ] triggered by Bot. Commit: dcfaf04

@tensorrt-cicd
Copy link
Collaborator

PR_Github #24864 [ run ] completed with state SUCCESS. Commit: dcfaf04
/LLM/main/L0_MergeRequest_PR pipeline #18770 completed with status: 'FAILURE'

@galagam
Copy link
Collaborator Author

galagam commented Nov 18, 2025

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #24907 [ run ] triggered by Bot. Commit: dcfaf04

@tensorrt-cicd
Copy link
Collaborator

PR_Github #24907 [ run ] completed with state SUCCESS. Commit: dcfaf04
/LLM/main/L0_MergeRequest_PR pipeline #18808 completed with status: 'SUCCESS'

@galagam galagam merged commit 36d3d8f into NVIDIA:main Nov 18, 2025
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants