-
Notifications
You must be signed in to change notification settings - Fork 1
fix: add hack to support --gpu-telemetry without any arguments #329
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
- Add comprehensive GPU telemetry data collection and monitoring system - Implement TelemetryManager and TelemetryDataCollector for real-time GPU metrics - Create GPU telemetry console exporter for CLI display of telemetry data - Add telemetry results post-processor for data analysis and aggregation - Implement comprehensive unit test coverage for all telemetry components - Add integration tests for end-to-end telemetry functionality - Update system controller to integrate GPU telemetry with existing infrastructure - Extend existing exporters to support telemetry data formats - Add comprehensive error handling and async callback support - Include code cleanup, linting fixes, and documentation improvements
… to records manager
Signed-off-by: ilana-n <[email protected]>
WalkthroughIntroduces a new CLI entry point aiperf.main.main that augments argv for --gpu-telemetry when value-less, then delegates to app(). Updates pyproject script to target the new entry. Fixes telemetry_manager start flow by awaiting disable when no collectors start. Adds tests for argv augmentation and telemetry config. Minor CLI whitespace cleanup. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
actor User
participant Shell as CLI (aiperf)
participant Main as aiperf.__main__.main
participant CLI as aiperf.cli.app
participant Runner as cli_runner.run_system_controller
User->>Shell: Invoke "aiperf [args]"
Shell->>Main: __main__.py main(argv)
Main->>Main: Inspect argv for --gpu-telemetry
alt flag present without value or followed by another option
Main->>Main: Inject DEFAULT_DCGM_ENDPOINT into argv
note right of Main: Augmented argv
else flag absent or has value
Main->>Main: Leave argv unchanged
end
Main->>CLI: app(argv)
CLI->>Runner: run_system_controller(user_config)
Runner-->>CLI: Exit code
CLI-->>Main: Exit code
Main-->>Shell: sys.exit(code)
sequenceDiagram
autonumber
participant TM as TelemetryManager
participant Collectors as GPU Collectors
participant Stopper as _disable_telemetry_and_stop
TM->>Collectors: Start all collectors
Collectors-->>TM: started_count
alt started_count == 0
TM->>Stopper: await disable and stop("all collectors failed to start")
Stopper-->>TM: done
else started_count > 0
TM->>TM: proceed with profiling
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Poem
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
@CodeRabbit review |
✅ Actions performedReview triggered.
|
Codecov Report❌ Patch coverage is 📢 Thoughts on this report? Let us know! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (5)
aiperf/__main__.py
(1 hunks)aiperf/cli.py
(0 hunks)aiperf/gpu_telemetry/telemetry_manager.py
(2 hunks)pyproject.toml
(1 hunks)tests/test_main.py
(1 hunks)
💤 Files with no reviewable changes (1)
- aiperf/cli.py
🧰 Additional context used
🪛 Ruff (0.13.3)
tests/test_main.py
65-65: Unused function argument: service_config
(ARG001)
🔇 Additional comments (8)
aiperf/gpu_telemetry/telemetry_manager.py (2)
11-11
: LGTM! Missing import now added.The
CommAddress
import was already being used at Line 72 but was missing from the imports. This change corrects that omission.
199-199
: Critical fix: Missing await added.The
_disable_telemetry_and_stop()
method is async (defined at Line 226), so this call must be awaited. Without theawait
, the coroutine would not execute, leaving telemetry improperly enabled and the service running when it should stop.pyproject.toml (1)
55-55
: LGTM! Entry point updated to support argument augmentation.The entry point now targets the new
main()
function in__main__.py
, which prepares the CLI invocation by augmenting--gpu-telemetry
arguments before delegating toapp()
.aiperf/__main__.py (1)
11-14
: TODO reminder: Remove hack after cyclopts v4 upgrade.The TODO comment indicates this is a temporary workaround. Ensure this is tracked and removed once cyclopts v4 is adopted.
Do you want me to open a new issue to track this technical debt and the cyclopts v4 upgrade?
tests/test_main.py (4)
15-20
: LGTM! Proper test isolation.The fixture correctly preserves and restores
sys.argv
after each test, ensuring test isolation and preventing side effects.
22-56
: LGTM! Comprehensive test coverage.The parametrized test cases thoroughly cover all expected scenarios:
- Flag at end without value
- Flag followed by single/double dash options
- Flag with custom value
- Flag absent
65-65
: False positive: Parameter required by signature.The
service_config
parameter is intentionally unused in the mock but must match the signature ofrun_system_controller
to satisfy the patch. This is a false positive from the static analysis tool.
57-78
: LGTM! Test correctly verifies augmentation and propagation.The test properly:
- Constructs
sys.argv
with test parameters- Mocks
run_system_controller
to captureuser_config
- Verifies both
sys.argv
augmentation anduser_config.gpu_telemetry
propagation
commited directly into base branch |
Summary by CodeRabbit