Skip to content

Conversation

ajcasagrande
Copy link
Contributor

@ajcasagrande ajcasagrande commented Oct 7, 2025

Summary by CodeRabbit

  • New Features
    • CLI now auto-fills a default GPU telemetry endpoint when the flag is provided without a value, improving usability.
  • Bug Fixes
    • More reliable behavior when GPU telemetry collectors fail: telemetry is cleanly disabled and the run stops predictably.
  • Chores
    • Updated the command entry point to route through a new launcher; no change to how you invoke the tool.
  • Style
    • Minor formatting cleanup.
  • Tests
    • Added coverage for CLI argument handling and GPU telemetry configuration scenarios.

ilana-n and others added 11 commits September 30, 2025 18:19
- Add comprehensive GPU telemetry data collection and monitoring system
- Implement TelemetryManager and TelemetryDataCollector for real-time GPU metrics
- Create GPU telemetry console exporter for CLI display of telemetry data
- Add telemetry results post-processor for data analysis and aggregation
- Implement comprehensive unit test coverage for all telemetry components
- Add integration tests for end-to-end telemetry functionality
- Update system controller to integrate GPU telemetry with existing infrastructure
- Extend existing exporters to support telemetry data formats
- Add comprehensive error handling and async callback support
- Include code cleanup, linting fixes, and documentation improvements
@ajcasagrande ajcasagrande requested a review from ilana-n October 7, 2025 03:14
Copy link

coderabbitai bot commented Oct 7, 2025

Walkthrough

Introduces a new CLI entry point aiperf.main.main that augments argv for --gpu-telemetry when value-less, then delegates to app(). Updates pyproject script to target the new entry. Fixes telemetry_manager start flow by awaiting disable when no collectors start. Adds tests for argv augmentation and telemetry config. Minor CLI whitespace cleanup.

Changes

Cohort / File(s) Summary
CLI entry point and packaging
aiperf/__main__.py, pyproject.toml
Added main() that injects DEFAULT_DCGM_ENDPOINT when --gpu-telemetry lacks a value and then calls app(); updated [project.scripts] aiperf to point to aiperf.main:main.
Telemetry manager start-flow fix
aiperf/gpu_telemetry/telemetry_manager.py
Imported CommAddress; corrected _on_start_profiling to await self._disable_telemetry_and_stop(...) when started_count == 0.
CLI minor cleanup
aiperf/cli.py
Removed a stray blank line; no logic change.
Tests for main entry behavior
tests/test_main.py
New tests covering --gpu-telemetry argv augmentation and propagation into user_config; patches run_system_controller to inspect inputs.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  actor User
  participant Shell as CLI (aiperf)
  participant Main as aiperf.__main__.main
  participant CLI as aiperf.cli.app
  participant Runner as cli_runner.run_system_controller

  User->>Shell: Invoke "aiperf [args]"
  Shell->>Main: __main__.py main(argv)
  Main->>Main: Inspect argv for --gpu-telemetry
  alt flag present without value or followed by another option
    Main->>Main: Inject DEFAULT_DCGM_ENDPOINT into argv
    note right of Main: Augmented argv
  else flag absent or has value
    Main->>Main: Leave argv unchanged
  end
  Main->>CLI: app(argv)
  CLI->>Runner: run_system_controller(user_config)
  Runner-->>CLI: Exit code
  CLI-->>Main: Exit code
  Main-->>Shell: sys.exit(code)
Loading
sequenceDiagram
  autonumber
  participant TM as TelemetryManager
  participant Collectors as GPU Collectors
  participant Stopper as _disable_telemetry_and_stop

  TM->>Collectors: Start all collectors
  Collectors-->>TM: started_count
  alt started_count == 0
    TM->>Stopper: await disable and stop("all collectors failed to start")
    Stopper-->>TM: done
  else started_count > 0
    TM->>TM: proceed with profiling
  end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

I nudge the flags with whiskered care,
A default endpoint fills the air.
If GPUs stay mum, I halt the show—
Await, then stop, and let you know.
New paths to run, tests hop in line,
A tidy burrow, crisp and fine. 🐇✨

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 60.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check ✅ Passed The provided title clearly summarizes the primary change by stating that it fixes support for the --gpu-telemetry flag when no argument is given, directly reflecting the main purpose of the patch.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch ajc/gpu-flag

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions github-actions bot added the fix label Oct 7, 2025
@ajcasagrande
Copy link
Contributor Author

@CodeRabbit review

Copy link

coderabbitai bot commented Oct 7, 2025

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Copy link

codecov bot commented Oct 7, 2025

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 5320199 and e61deb6.

📒 Files selected for processing (5)
  • aiperf/__main__.py (1 hunks)
  • aiperf/cli.py (0 hunks)
  • aiperf/gpu_telemetry/telemetry_manager.py (2 hunks)
  • pyproject.toml (1 hunks)
  • tests/test_main.py (1 hunks)
💤 Files with no reviewable changes (1)
  • aiperf/cli.py
🧰 Additional context used
🪛 Ruff (0.13.3)
tests/test_main.py

65-65: Unused function argument: service_config

(ARG001)

🔇 Additional comments (8)
aiperf/gpu_telemetry/telemetry_manager.py (2)

11-11: LGTM! Missing import now added.

The CommAddress import was already being used at Line 72 but was missing from the imports. This change corrects that omission.


199-199: Critical fix: Missing await added.

The _disable_telemetry_and_stop() method is async (defined at Line 226), so this call must be awaited. Without the await, the coroutine would not execute, leaving telemetry improperly enabled and the service running when it should stop.

pyproject.toml (1)

55-55: LGTM! Entry point updated to support argument augmentation.

The entry point now targets the new main() function in __main__.py, which prepares the CLI invocation by augmenting --gpu-telemetry arguments before delegating to app().

aiperf/__main__.py (1)

11-14: TODO reminder: Remove hack after cyclopts v4 upgrade.

The TODO comment indicates this is a temporary workaround. Ensure this is tracked and removed once cyclopts v4 is adopted.

Do you want me to open a new issue to track this technical debt and the cyclopts v4 upgrade?

tests/test_main.py (4)

15-20: LGTM! Proper test isolation.

The fixture correctly preserves and restores sys.argv after each test, ensuring test isolation and preventing side effects.


22-56: LGTM! Comprehensive test coverage.

The parametrized test cases thoroughly cover all expected scenarios:

  • Flag at end without value
  • Flag followed by single/double dash options
  • Flag with custom value
  • Flag absent

65-65: False positive: Parameter required by signature.

The service_config parameter is intentionally unused in the mock but must match the signature of run_system_controller to satisfy the patch. This is a false positive from the static analysis tool.


57-78: LGTM! Test correctly verifies augmentation and propagation.

The test properly:

  • Constructs sys.argv with test parameters
  • Mocks run_system_controller to capture user_config
  • Verifies both sys.argv augmentation and user_config.gpu_telemetry propagation

Base automatically changed from gpu-telemetry-cli to feature/gpu-telemetry October 7, 2025 15:42
@ajcasagrande
Copy link
Contributor Author

commited directly into base branch

@ajcasagrande ajcasagrande deleted the ajc/gpu-flag branch October 7, 2025 23:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants