fix: add hack to support --gpu-telemetry without any arguments #329

ajcasagrande · 2025-10-07T03:14:06Z

Summary by CodeRabbit

New Features
- CLI now auto-fills a default GPU telemetry endpoint when the flag is provided without a value, improving usability.
Bug Fixes
- More reliable behavior when GPU telemetry collectors fail: telemetry is cleanly disabled and the run stops predictably.
Chores
- Updated the command entry point to route through a new launcher; no change to how you invoke the tool.
Style
- Minor formatting cleanup.
Tests
- Added coverage for CLI argument handling and GPU telemetry configuration scenarios.

- Add comprehensive GPU telemetry data collection and monitoring system - Implement TelemetryManager and TelemetryDataCollector for real-time GPU metrics - Create GPU telemetry console exporter for CLI display of telemetry data - Add telemetry results post-processor for data analysis and aggregation - Implement comprehensive unit test coverage for all telemetry components - Add integration tests for end-to-end telemetry functionality - Update system controller to integrate GPU telemetry with existing infrastructure - Extend existing exporters to support telemetry data formats - Add comprehensive error handling and async callback support - Include code cleanup, linting fixes, and documentation improvements

…otten

… to records manager

Signed-off-by: ilana-n <[email protected]>

coderabbitai · 2025-10-07T03:14:16Z

Walkthrough

Introduces a new CLI entry point aiperf.main.main that augments argv for --gpu-telemetry when value-less, then delegates to app(). Updates pyproject script to target the new entry. Fixes telemetry_manager start flow by awaiting disable when no collectors start. Adds tests for argv augmentation and telemetry config. Minor CLI whitespace cleanup.

Changes

Cohort / File(s)	Summary
CLI entry point and packaging `aiperf/__main__.py`, `pyproject.toml`	Added main() that injects DEFAULT_DCGM_ENDPOINT when --gpu-telemetry lacks a value and then calls app(); updated [project.scripts] aiperf to point to aiperf.main:main.
Telemetry manager start-flow fix `aiperf/gpu_telemetry/telemetry_manager.py`	Imported CommAddress; corrected _on_start_profiling to await self._disable_telemetry_and_stop(...) when started_count == 0.
CLI minor cleanup `aiperf/cli.py`	Removed a stray blank line; no logic change.
Tests for main entry behavior `tests/test_main.py`	New tests covering --gpu-telemetry argv augmentation and propagation into user_config; patches run_system_controller to inspect inputs.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  actor User
  participant Shell as CLI (aiperf)
  participant Main as aiperf.__main__.main
  participant CLI as aiperf.cli.app
  participant Runner as cli_runner.run_system_controller

  User->>Shell: Invoke "aiperf [args]"
  Shell->>Main: __main__.py main(argv)
  Main->>Main: Inspect argv for --gpu-telemetry
  alt flag present without value or followed by another option
    Main->>Main: Inject DEFAULT_DCGM_ENDPOINT into argv
    note right of Main: Augmented argv
  else flag absent or has value
    Main->>Main: Leave argv unchanged
  end
  Main->>CLI: app(argv)
  CLI->>Runner: run_system_controller(user_config)
  Runner-->>CLI: Exit code
  CLI-->>Main: Exit code
  Main-->>Shell: sys.exit(code)

sequenceDiagram
  autonumber
  participant TM as TelemetryManager
  participant Collectors as GPU Collectors
  participant Stopper as _disable_telemetry_and_stop

  TM->>Collectors: Start all collectors
  Collectors-->>TM: started_count
  alt started_count == 0
    TM->>Stopper: await disable and stop("all collectors failed to start")
    Stopper-->>TM: done
  else started_count > 0
    TM->>TM: proceed with profiling
  end

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

I nudge the flags with whiskered care,
A default endpoint fills the air.
If GPUs stay mum, I halt the show—
Await, then stop, and let you know.
New paths to run, tests hop in line,
A tidy burrow, crisp and fine. 🐇✨

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 60.00% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check	✅ Passed	The provided title clearly summarizes the primary change by stating that it fixes support for the --gpu-telemetry flag when no argument is given, directly reflecting the main purpose of the patch.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch ajc/gpu-flag

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

ajcasagrande · 2025-10-07T03:14:36Z

@CodeRabbit review

coderabbitai · 2025-10-07T03:15:02Z

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

codecov · 2025-10-07T03:16:02Z

Codecov Report

❌ Patch coverage is 74.15730% with 46 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
aiperf/gpu_telemetry/telemetry_manager.py	12.50%	21 Missing ⚠️
aiperf/records/records_manager.py	11.11%	16 Missing ⚠️
aiperf/controller/system_controller.py	0.00%	4 Missing ⚠️
aiperf/__main__.py	81.81%	1 Missing and 1 partial ⚠️
aiperf/exporters/gpu_telemetry_console_exporter.py	97.10%	1 Missing and 1 partial ⚠️
aiperf/exporters/csv_exporter.py	90.90%	1 Missing ⚠️

📢 Thoughts on this report? Let us know!

coderabbitai

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 5320199 and e61deb6.

📒 Files selected for processing (5)

aiperf/__main__.py (1 hunks)
aiperf/cli.py (0 hunks)
aiperf/gpu_telemetry/telemetry_manager.py (2 hunks)
pyproject.toml (1 hunks)
tests/test_main.py (1 hunks)

💤 Files with no reviewable changes (1)

aiperf/cli.py

🧰 Additional context used

🪛 Ruff (0.13.3)

tests/test_main.py

65-65: Unused function argument: service_config

(ARG001)

🔇 Additional comments (8)

aiperf/gpu_telemetry/telemetry_manager.py (2)

11-11: LGTM! Missing import now added.

The CommAddress import was already being used at Line 72 but was missing from the imports. This change corrects that omission.

199-199: Critical fix: Missing await added.

The _disable_telemetry_and_stop() method is async (defined at Line 226), so this call must be awaited. Without the await, the coroutine would not execute, leaving telemetry improperly enabled and the service running when it should stop.

pyproject.toml (1)

55-55: LGTM! Entry point updated to support argument augmentation.

The entry point now targets the new main() function in __main__.py, which prepares the CLI invocation by augmenting --gpu-telemetry arguments before delegating to app().

aiperf/__main__.py (1)

11-14: TODO reminder: Remove hack after cyclopts v4 upgrade.

The TODO comment indicates this is a temporary workaround. Ensure this is tracked and removed once cyclopts v4 is adopted.

Do you want me to open a new issue to track this technical debt and the cyclopts v4 upgrade?

tests/test_main.py (4)

15-20: LGTM! Proper test isolation.

The fixture correctly preserves and restores sys.argv after each test, ensuring test isolation and preventing side effects.

22-56: LGTM! Comprehensive test coverage.

The parametrized test cases thoroughly cover all expected scenarios:

Flag at end without value

Flag followed by single/double dash options

Flag with custom value

Flag absent

65-65: False positive: Parameter required by signature.

The service_config parameter is intentionally unused in the mock but must match the signature of run_system_controller to satisfy the patch. This is a false positive from the static analysis tool.

57-78: LGTM! Test correctly verifies augmentation and propagation.

The test properly:

Constructs sys.argv with test parameters

Mocks run_system_controller to capture user_config

Verifies both sys.argv augmentation and user_config.gpu_telemetry propagation

aiperf/__main__.py

ajcasagrande · 2025-10-07T17:44:20Z

commited directly into base branch

ilana-n and others added 11 commits September 30, 2025 18:19

feat: changed cli argument to --gpu-telemetry

bbb69dc

feat: added parsing logic so that /metrics is appended to url if forg…

0a51a0a

…otten

fix: address @ajcasagrande comments

4eca609

fix: unit tests for gpu telemetry and add docs

03feb69

update docs

547ce95

remove docs from this pr

06f758f

feat: add new metrics and address push records from telemetry manager…

dd0168a

… to records manager

fix: address nitpicks

708f9e2

Merge branch 'feature/gpu-telemetry' into gpu-telemetry-cli

5320199

Signed-off-by: ilana-n <[email protected]>

fix: add hack to support --gpu-telemetry without any arguments

e61deb6

ajcasagrande requested a review from ilana-n October 7, 2025 03:14

github-actions bot added the fix label Oct 7, 2025

coderabbitai bot reviewed Oct 7, 2025

View reviewed changes

aiperf/__main__.py Show resolved Hide resolved

ilana-n approved these changes Oct 7, 2025

View reviewed changes

Base automatically changed from gpu-telemetry-cli to feature/gpu-telemetry October 7, 2025 15:42

ajcasagrande closed this Oct 7, 2025

ajcasagrande deleted the ajc/gpu-flag branch October 7, 2025 23:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: add hack to support --gpu-telemetry without any arguments #329

fix: add hack to support --gpu-telemetry without any arguments #329

Uh oh!

ajcasagrande commented Oct 7, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Oct 7, 2025 •

edited

Loading

Uh oh!

ajcasagrande commented Oct 7, 2025

Uh oh!

coderabbitai bot commented Oct 7, 2025

Uh oh!

codecov bot commented Oct 7, 2025 •

edited

Loading

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

ajcasagrande commented Oct 7, 2025

Uh oh!

Uh oh!

fix: add hack to support --gpu-telemetry without any arguments #329

fix: add hack to support --gpu-telemetry without any arguments #329

Uh oh!

Conversation

ajcasagrande commented Oct 7, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Oct 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

Pre-merge checks and finishing touches

Uh oh!

ajcasagrande commented Oct 7, 2025

Uh oh!

coderabbitai bot commented Oct 7, 2025

Uh oh!

codecov bot commented Oct 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ajcasagrande commented Oct 7, 2025

Uh oh!

Uh oh!

ajcasagrande commented Oct 7, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Oct 7, 2025 •

edited

Loading

codecov bot commented Oct 7, 2025 •

edited

Loading