Add validation tests for amd-smi CLI output#3805
Conversation
There was a problem hiding this comment.
Pull request overview
Adds a new CI test to validate amd-smi list output across human/JSON/CSV modes (including --file output), and wires it into the GitHub Actions test matrix so it runs on Linux.
Changes:
- Add
test_amdsmi_cli.pypytest suite to validate required GPU fields inamd-smi listoutput across multiple output formats. - Register a new
amdsmi_clijob in the GitHub Actions test configuration matrix to execute the new pytest test on Linux.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| build_tools/github_actions/test_executable_scripts/test_amdsmi_cli.py | New pytest-based end-to-end validation for amd-smi list output modes (stdout/file, human/JSON/CSV). |
| build_tools/github_actions/fetch_test_configurations.py | Adds new amdsmi_cli test job entry to CI test matrix. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| th = os.getenv("THEROCK_BIN_DIR") | ||
| if not th: | ||
| pytest.skip("THEROCK_BIN_DIR not set; skipping amdsmi tests") | ||
| p = Path(th) / "amd-smi" |
There was a problem hiding this comment.
THEROCK_BIN_DIR can be a relative path in CI (e.g. ./build/bin). Using Path(th) / 'amd-smi' without resolve() makes this test depend on the current working directory; running the test from a different cwd can fail to find the binary. Consider resolving THEROCK_BIN_DIR (and/or constructing THEROCK_DIR like other test scripts) before checking existence.
| p = Path(th) / "amd-smi" | |
| bin_dir = Path(th).resolve() | |
| p = bin_dir / "amd-smi" |
| cmd = [str(amd_smi), "list"] + args | ||
| proc = subprocess.run(cmd, capture_output=True, text=True) | ||
| return proc.returncode, proc.stdout, proc.stderr |
There was a problem hiding this comment.
_run_amd_smi runs the binary without setting cwd (and without explicitly propagating any environment tweaks). Since CI commonly sets THEROCK_BIN_DIR as a relative path, calling amd-smi from a different working directory can break. Recommend running with an explicit cwd (e.g. repo root like other scripts) and/or resolving the binary path to an absolute path before invoking subprocess.
| "windows": 1, | ||
| }, | ||
| }, | ||
|
|
There was a problem hiding this comment.
There is trailing whitespace on the blank line before the new amdsmi_cli entry. Please remove it to avoid lint/noise in future diffs.
| th = os.getenv("THEROCK_BIN_DIR") | ||
| if not th: | ||
| pytest.skip("THEROCK_BIN_DIR not set; skipping amdsmi tests") | ||
| p = Path(th) / "amd-smi" |
There was a problem hiding this comment.
Can we use meaningful names instead of short names
| return blocks | ||
|
|
||
|
|
||
| def _validate_human_block(block_text: str) -> list[str]: |
There was a problem hiding this comment.
_validate_human_block??
human block is misleading, instead can we use _validate_gpu_block() ?? or something
|
It will be good if we add the PR description as well |
There was a problem hiding this comment.
identical to this?
TheRock/tests/test_rocm_sanity.py
Line 167 in af5cb3a
although, i am fine with this being separate as it can be used in other repos
cc: @jayhawk-commits
There was a problem hiding this comment.
I checked the script you mentioned, that's lib testing of amdsmi, this is for cli testing.
There was a problem hiding this comment.
meant to link this one:
TheRock/tests/test_rocm_sanity.py
Line 167 in 0478f9e
even so, i would think CLI testing is a sanity check and should be added there
| Skips the test via pytest if `THEROCK_BIN_DIR` is not set. Asserts that | ||
| the expected `amd-smi` binary exists at the resolved path. | ||
|
|
||
| Args: | ||
| None | ||
|
|
||
| Returns: | ||
| pathlib.Path: Path to the `amd-smi` binary. |
There was a problem hiding this comment.
looks like claude-generated docs. i think this function is self explanatory and can remove these comments
|
|
||
| The function invokes the binary via subprocess.run and captures text | ||
| output for assertions in the tests. | ||
|
|
||
| Args: | ||
| amd_smi_path (pathlib.Path): Path to the `amd-smi` binary. | ||
| modifiers (list[str]): Arguments to pass after `amd-smi list`. | ||
|
|
||
| Returns: | ||
| tuple[int, str, str]: Return code, stdout text, stderr text. |
There was a problem hiding this comment.
same here, looks like claude generated code
this can be removed as function is self explanatory
| }, | ||
| "amdsmi_cli": { | ||
| "job_name": "amdsmi_cli", | ||
| "fetch_artifact_args": "--tests", |
There was a problem hiding this comment.
let's add --base-only as we only need base packages
There was a problem hiding this comment.
this might be a good opportunity to combine with amdsmi tests, since this just validates output, might as well try to utilize GPU for amdsmi tests in parallel
There was a problem hiding this comment.
@HRISHIKESHTHULA-AMD we currently run amdsmi tests in https://github.com/ROCm/TheRock/blob/main/build_tools/github_actions/test_executable_scripts/test_amdsmi.py, is there an opportunity to combine? as this script will also install identical artifacts. no need for two separate jobs to do identical artifact extraction
There was a problem hiding this comment.
I went through the script you mentioned, the script's description suggests "this script must be run
manually by developers inside a privileged ROCm environment or container" , so it seems to be difficult combining manually triggered script and CI triggered script. Please share your thoughts on this.
There was a problem hiding this comment.
meant to link this one:
TheRock/tests/test_rocm_sanity.py
Line 167 in 0478f9e
There was a problem hiding this comment.
Addressed in this comment.
As recently added test_sanity.py calls different components' sanity tests, I've integrated that with amd-smi cli sanity tests .
non-sanity cli tests part of the same amd-smi cli script won't be executed by test_sanity.py, but will be executed by amdsmi_cli job.
Please review and share your thoughts.
There was a problem hiding this comment.
please review latest update as per latest review comments
… and refactor test_amdsmi_cli.py for improved clarity and functionality
| "amdsmi_cli": { | ||
| "job_name": "amdsmi_cli", | ||
| "fetch_artifact_args": "--base-only", | ||
| "timeout_minutes": 15, | ||
| "test_script": "pytest tests/test_amdsmi_cli.py -m not_sanity -o log_cli=true --log-cli-level=INFO", | ||
| "platform": ["linux"], | ||
| "total_shards_dict": { | ||
| "linux": 1, | ||
| }, | ||
| }, |
There was a problem hiding this comment.
this can be removed! as now sanity checks test this, we don't need to run these tests twice
| def _run_pytest( | ||
| cmd: list[str], *, cwd: Path, env: dict[str, str], check: bool | ||
| ) -> subprocess.CompletedProcess[str]: | ||
| logging.info("++ Exec [%s]$ %s", cwd, " ".join(cmd)) | ||
| return subprocess.run(cmd, cwd=cwd, env=env, check=check, text=True) | ||
|
|
||
|
|
There was a problem hiding this comment.
this can be removed. i would consider checking amdsmi_cli as a "sanity check". if amdsmi failed, i would see bigger problems
| # Default sanity behavior: run everything except tests marked as not_sanity. | ||
| phase_cmd = cmd + ["-m", "not not_sanity"] | ||
| _run_pytest(phase_cmd, cwd=THEROCK_DIR, env=env, check=True) |
There was a problem hiding this comment.
this can be removed. i would consider checking amdsmi_cli as a "sanity check". if amdsmi failed, i would see bigger problems
| assert gpu_blocks, "No GPU blocks found in amd-smi output" | ||
|
|
||
|
|
||
| @pytest.mark.not_sanity |
There was a problem hiding this comment.
this can be removed. i would consider checking amdsmi_cli as a "sanity check". if amdsmi failed, i would see bigger problems
…handling in ROCm sanity tests
…ctly executing pytest command
geomin12
left a comment
There was a problem hiding this comment.
lgtm but we do not have any linux signal right now. we will wait until machines are back
geomin12
left a comment
There was a problem hiding this comment.
i added a label, so this will trigger gfx94X tests (and check sanity). once signal is proven, this can be landed
#4004 is raised recently than this and having changes of this branch plus addition of more smi tests. So, both are valid PRs. |
|
@HRISHIKESHTHULA-AMD, I noticed there is an workflow for running both elevated/non-elevated tests of amd-smi. |
@madkasul , as confirmed by @phani544 , Dev are not testing cli yet |
geomin12
left a comment
There was a problem hiding this comment.
much cleaner, thanks for the updates and great work
Motivation
Adding amd-smi list test
Technical Details
amd-smi list with different options are tested
Test Plan
--json, --csv, --file and no option is being tested
Test Result
6 tests passed. https://github.com/ROCm/TheRock/actions/runs/22840058512
Submission Checklist