Skip to content

Add validation tests for amd-smi CLI output#3805

Closed
HRISHIKESHTHULA-AMD wants to merge 27 commits into
mainfrom
users/hrithula/smi_test
Closed

Add validation tests for amd-smi CLI output#3805
HRISHIKESHTHULA-AMD wants to merge 27 commits into
mainfrom
users/hrithula/smi_test

Conversation

@HRISHIKESHTHULA-AMD
Copy link
Copy Markdown

@HRISHIKESHTHULA-AMD HRISHIKESHTHULA-AMD commented Mar 6, 2026

Motivation

Adding amd-smi list test

Technical Details

amd-smi list with different options are tested

Test Plan

--json, --csv, --file and no option is being tested

Test Result

6 tests passed. https://github.com/ROCm/TheRock/actions/runs/22840058512

Submission Checklist

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new CI test to validate amd-smi list output across human/JSON/CSV modes (including --file output), and wires it into the GitHub Actions test matrix so it runs on Linux.

Changes:

  • Add test_amdsmi_cli.py pytest suite to validate required GPU fields in amd-smi list output across multiple output formats.
  • Register a new amdsmi_cli job in the GitHub Actions test configuration matrix to execute the new pytest test on Linux.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File Description
build_tools/github_actions/test_executable_scripts/test_amdsmi_cli.py New pytest-based end-to-end validation for amd-smi list output modes (stdout/file, human/JSON/CSV).
build_tools/github_actions/fetch_test_configurations.py Adds new amdsmi_cli test job entry to CI test matrix.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

th = os.getenv("THEROCK_BIN_DIR")
if not th:
pytest.skip("THEROCK_BIN_DIR not set; skipping amdsmi tests")
p = Path(th) / "amd-smi"
Copy link

Copilot AI Mar 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

THEROCK_BIN_DIR can be a relative path in CI (e.g. ./build/bin). Using Path(th) / 'amd-smi' without resolve() makes this test depend on the current working directory; running the test from a different cwd can fail to find the binary. Consider resolving THEROCK_BIN_DIR (and/or constructing THEROCK_DIR like other test scripts) before checking existence.

Suggested change
p = Path(th) / "amd-smi"
bin_dir = Path(th).resolve()
p = bin_dir / "amd-smi"

Copilot uses AI. Check for mistakes.
Comment on lines +55 to +57
cmd = [str(amd_smi), "list"] + args
proc = subprocess.run(cmd, capture_output=True, text=True)
return proc.returncode, proc.stdout, proc.stderr
Copy link

Copilot AI Mar 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_run_amd_smi runs the binary without setting cwd (and without explicitly propagating any environment tweaks). Since CI commonly sets THEROCK_BIN_DIR as a relative path, calling amd-smi from a different working directory can break. Recommend running with an explicit cwd (e.g. repo root like other scripts) and/or resolving the binary path to an absolute path before invoking subprocess.

Copilot uses AI. Check for mistakes.
"windows": 1,
},
},

Copy link

Copilot AI Mar 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is trailing whitespace on the blank line before the new amdsmi_cli entry. Please remove it to avoid lint/noise in future diffs.

Suggested change

Copilot uses AI. Check for mistakes.
th = os.getenv("THEROCK_BIN_DIR")
if not th:
pytest.skip("THEROCK_BIN_DIR not set; skipping amdsmi tests")
p = Path(th) / "amd-smi"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use meaningful names instead of short names

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

return blocks


def _validate_human_block(block_text: str) -> list[str]:
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_validate_human_block??
human block is misleading, instead can we use _validate_gpu_block() ?? or something

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@sriasapu
Copy link
Copy Markdown

sriasapu commented Mar 6, 2026

It will be good if we add the PR description as well

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

identical to this?

def test_amdsmi_suite(self):

although, i am fine with this being separate as it can be used in other repos
cc: @jayhawk-commits

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked the script you mentioned, that's lib testing of amdsmi, this is for cli testing.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

meant to link this one:

def test_amdsmi_suite(self):

even so, i would think CLI testing is a sanity check and should be added there

Comment on lines +29 to +36
Skips the test via pytest if `THEROCK_BIN_DIR` is not set. Asserts that
the expected `amd-smi` binary exists at the resolved path.

Args:
None

Returns:
pathlib.Path: Path to the `amd-smi` binary.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like claude-generated docs. i think this function is self explanatory and can remove these comments

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Comment thread build_tools/github_actions/test_executable_scripts/test_amdsmi_cli.py Outdated
Comment on lines +55 to +64

The function invokes the binary via subprocess.run and captures text
output for assertions in the tests.

Args:
amd_smi_path (pathlib.Path): Path to the `amd-smi` binary.
modifiers (list[str]): Arguments to pass after `amd-smi list`.

Returns:
tuple[int, str, str]: Return code, stdout text, stderr text.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here, looks like claude generated code

this can be removed as function is self explanatory

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

},
"amdsmi_cli": {
"job_name": "amdsmi_cli",
"fetch_artifact_args": "--tests",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's add --base-only as we only need base packages

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this might be a good opportunity to combine with amdsmi tests, since this just validates output, might as well try to utilize GPU for amdsmi tests in parallel

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@HRISHIKESHTHULA-AMD we currently run amdsmi tests in https://github.com/ROCm/TheRock/blob/main/build_tools/github_actions/test_executable_scripts/test_amdsmi.py, is there an opportunity to combine? as this script will also install identical artifacts. no need for two separate jobs to do identical artifact extraction

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went through the script you mentioned, the script's description suggests "this script must be run
manually by developers inside a privileged ROCm environment or container"
, so it seems to be difficult combining manually triggered script and CI triggered script. Please share your thoughts on this.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

meant to link this one:

def test_amdsmi_suite(self):

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in this comment.
As recently added test_sanity.py calls different components' sanity tests, I've integrated that with amd-smi cli sanity tests .

non-sanity cli tests part of the same amd-smi cli script won't be executed by test_sanity.py, but will be executed by amdsmi_cli job.

Please review and share your thoughts.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please review latest update as per latest review comments

Comment thread conftest.py Outdated
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure this is needed?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed

Comment on lines +171 to +180
"amdsmi_cli": {
"job_name": "amdsmi_cli",
"fetch_artifact_args": "--base-only",
"timeout_minutes": 15,
"test_script": "pytest tests/test_amdsmi_cli.py -m not_sanity -o log_cli=true --log-cli-level=INFO",
"platform": ["linux"],
"total_shards_dict": {
"linux": 1,
},
},
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this can be removed! as now sanity checks test this, we don't need to run these tests twice

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Comment on lines +16 to +22
def _run_pytest(
cmd: list[str], *, cwd: Path, env: dict[str, str], check: bool
) -> subprocess.CompletedProcess[str]:
logging.info("++ Exec [%s]$ %s", cwd, " ".join(cmd))
return subprocess.run(cmd, cwd=cwd, env=env, check=check, text=True)


Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this can be removed. i would consider checking amdsmi_cli as a "sanity check". if amdsmi failed, i would see bigger problems

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Comment on lines +50 to +52
# Default sanity behavior: run everything except tests marked as not_sanity.
phase_cmd = cmd + ["-m", "not not_sanity"]
_run_pytest(phase_cmd, cwd=THEROCK_DIR, env=env, check=True)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this can be removed. i would consider checking amdsmi_cli as a "sanity check". if amdsmi failed, i would see bigger problems

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Comment thread tests/test_amdsmi_cli.py Outdated
assert gpu_blocks, "No GPU blocks found in amd-smi output"


@pytest.mark.not_sanity
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this can be removed. i would consider checking amdsmi_cli as a "sanity check". if amdsmi failed, i would see bigger problems

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Copy Markdown
Contributor

@geomin12 geomin12 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm but we do not have any linux signal right now. we will wait until machines are back

@geomin12 geomin12 added the test:hipcub For pull requests, runs full tests for only hipcub and other labeled projects. label Mar 20, 2026
Copy link
Copy Markdown
Contributor

@geomin12 geomin12 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i added a label, so this will trigger gfx94X tests (and check sanity). once signal is proven, this can be landed

Copy link
Copy Markdown
Contributor

@geomin12 geomin12 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this the PR to review ? or #4004 ? we should close the PR that is no longer needed

@HRISHIKESHTHULA-AMD
Copy link
Copy Markdown
Author

HRISHIKESHTHULA-AMD commented Mar 25, 2026

is this the PR to review ? or #4004 ? we should close the PR that is no longer needed

#4004 is raised recently than this and having changes of this branch plus addition of more smi tests. So, both are valid PRs.
As the review of this PR is already done, can we merge this PR once approved so that tests of this PR at least test smi sanity till #4004 is being reviewed & approved?

@madkasul
Copy link
Copy Markdown
Contributor

@HRISHIKESHTHULA-AMD, I noticed there is an workflow for running both elevated/non-elevated tests of amd-smi.
Please cross-check is your script is not an duplicate effort of it.
#ref: rocm-systems/.github/workflows/amdsmi-build.yml

@HRISHIKESHTHULA-AMD
Copy link
Copy Markdown
Author

@HRISHIKESHTHULA-AMD, I noticed there is an workflow for running both elevated/non-elevated tests of amd-smi. Please cross-check is your script is not an duplicate effort of it. #ref: rocm-systems/.github/workflows/amdsmi-build.yml

@madkasul , as confirmed by @phani544 , Dev are not testing cli yet

Copy link
Copy Markdown
Contributor

@geomin12 geomin12 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

much cleaner, thanks for the updates and great work

@github-project-automation github-project-automation Bot moved this from TODO to Done in TheRock Triage Apr 9, 2026
@HRISHIKESHTHULA-AMD HRISHIKESHTHULA-AMD deleted the users/hrithula/smi_test branch April 9, 2026 05:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

test:hipcub For pull requests, runs full tests for only hipcub and other labeled projects.

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

5 participants