Skip to content

Use the newly added runner to test gfx110x-dgpu on windows OS for pre-submit, post-submit and nightly#2061

Closed
dezhiAmd wants to merge 7 commits into
ROCm:mainfrom
dezhiAmd:test_new_runner
Closed

Use the newly added runner to test gfx110x-dgpu on windows OS for pre-submit, post-submit and nightly#2061
dezhiAmd wants to merge 7 commits into
ROCm:mainfrom
dezhiAmd:test_new_runner

Conversation

@dezhiAmd
Copy link
Copy Markdown
Contributor

@dezhiAmd dezhiAmd commented Nov 7, 2025

Motivation

Use the newly added runner to test gfx110x-dgpu nightly, pre-submit and post-submit. A follow-up PR will be enabling test gfx110x-dgpu nightly only

Technical Details

Add gfx110x test to amdgpu_family_matrix.py nightly section. Currently there is bug that the nightly build overwrites the presubmit version in amdgpu_family_matrix.py. It will be fixed in a follow-up PR.

Test Plan

Enable Configure test matrix and all tests using this PR and observe all tests passing with this link

Test Result

Test pass

Submission Checklist

@dezhiAmd dezhiAmd force-pushed the test_new_runner branch 2 times, most recently from 41aaee3 to 59b85ac Compare November 11, 2025 17:29
@dezhiAmd dezhiAmd marked this pull request as ready for review November 11, 2025 17:45
@dezhiAmd dezhiAmd requested a review from ScottTodd November 11, 2025 17:46
@dezhiAmd dezhiAmd changed the title draft to test new runner Use the newly added runner to test gfx110x-dgpu nightly Nov 11, 2025
Comment on lines 52 to 53
"windows": {
"test-runs-on": "",
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if this code currently supports having a GPU family in both amdgpu_family_info_matrix_presubmit and amdgpu_family_info_matrix_nightly.

There are also currently unit test failures shown at https://github.com/ROCm/TheRock/actions/runs/19273842239/job/55108689779?pr=2061#step:9:77

Do we have enough runner capacity to add windows-gfx110X-gpu-rocm here to presubmit? If not, we may need to make some changes to build_tools/github_actions/configure_ci.py like I tried in #1350

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. For context, this test has been added to the nightly CI rather than the pre-submit pipeline due to its current execution time, which is relatively long. This approach helps maintain efficiency in the main CI runs. Once we improve the test performance and reduce runtime, we can revisit the idea of including it in the pre-submit workflow.

  2. The failure from this link appears to be not related to this PR. I ran the unit test locally and it passes.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those unit tests are for the code this PR modifies.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Logs after the latest push show that this is buggy (as expected): https://github.com/ROCm/TheRock/actions/runs/19283347174/job/55138893201?pr=2061#step:4:52

The presubmit run on this pull request should not be using "windows_variants": "[{\"test-runs-on\": \"windows-gfx110X-gpu-rocm\", \"family\": \"gfx110X-dgpu\" if you want this to only run nightly.

The code in https://github.com/ROCm/TheRock/blob/main/build_tools/github_actions/configure_ci.py and https://github.com/ROCm/TheRock/blob/main/build_tools/github_actions/amdgpu_family_matrix.py will need more changes.

See #1097 and #1350 for some ideas. The unit tests are there for easier testing. You'll want to extend them to cover this case of presubmit and nightly having some overlap.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Current fail reason when clicking "Explain Errors" button:
The job failed due to AWS S3 upload errors: "AccessDenied" when attempting to upload artifacts using the aws s3 cp command. This indicates that the AWS credentials used by the job do not have the required permissions to upload objects to the s3://therock-ci-artifacts-external/ROCm-TheRock/ bucket.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This still has the bug I mentioned, see logs at https://github.com/ROCm/TheRock/actions/runs/19285007469/job/55202935915?pr=2061#step:4:59 (again - you will need to edit the files I linked and run the unit tests locally to fix that issue, the code is not currently equipped to have a target family in both presubmit and nightly lists)

Upload errors are unrelated to the changes here. I guess my changes in #2046 didn't set the right permissions for PRs from forks to upload to therock-ci-artifacts-external using the therock-ci role (cc @marbre )

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Permissions for the upload failures should be fixed now, see #2099 (comment) and #2046 (comment))

Comment thread build_tools/github_actions/amdgpu_family_matrix.py
Signed-off-by: dezhliao <dezhliao@amd.com>
@dezhiAmd dezhiAmd requested a review from ScottTodd November 12, 2025 01:31
dezhiAmd and others added 5 commits November 11, 2025 17:36
Signed-off-by: dezhliao <dezhliao@amd.com>
Signed-off-by: dezhliao <dezhi.liao@amd.com>
Signed-off-by: dezhliao <dezhi.liao@amd.com>
Signed-off-by: dezhliao <dezhliao@amd.com>
Comment thread docs/rfcs/RFC0007-rdc-therock-integration.md
Comment on lines 52 to 53
"windows": {
"test-runs-on": "",
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This still has the bug I mentioned, see logs at https://github.com/ROCm/TheRock/actions/runs/19285007469/job/55202935915?pr=2061#step:4:59 (again - you will need to edit the files I linked and run the unit tests locally to fix that issue, the code is not currently equipped to have a target family in both presubmit and nightly lists)

Upload errors are unrelated to the changes here. I guess my changes in #2046 didn't set the right permissions for PRs from forks to upload to therock-ci-artifacts-external using the therock-ci role (cc @marbre )

@dezhiAmd dezhiAmd requested a review from ScottTodd November 12, 2025 21:44
Copy link
Copy Markdown
Member

@ScottTodd ScottTodd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This still enables the tests on presubmit and postsubmit builds, in contrast to the PR title and description.

@dezhiAmd dezhiAmd changed the title Use the newly added runner to test gfx110x-dgpu nightly Use the newly added runner to test gfx110x-dgpu on windows OS Nov 12, 2025
@dezhiAmd dezhiAmd changed the title Use the newly added runner to test gfx110x-dgpu on windows OS Use the newly added runner to test gfx110x-dgpu on windows OS for pre-submit, post-submit and nightly Nov 12, 2025
@dezhiAmd
Copy link
Copy Markdown
Contributor Author

This still enables the tests on presubmit and postsubmit builds, in contrast to the PR title and description.

Title is changed

@amd-justchen
Copy link
Copy Markdown
Contributor

amd-justchen commented Nov 13, 2025

Cross posting discussion summary:

  • Our original goal was to try out newly added cloud windows gfx110x runner functionality, and ideally on nightlies only
  • Initial test matrix changes only added gfx110x to nightlies section
  • However, bug with test matrix script causes tests to run on pre, post, and nightlies which doesn't correspond to intent of nightlies only commit, and the original PR title

With that said the next steps would either be:

implement granular per test stage functionality in a new PR, then merge this as is (so that since gfx110x is only added to nightlies then it will actually run it in nightlies on CI)
Reflect in current PR a title that implies all stages of testing are run, and also add gfx110x in code to all stages, then make the change in a new PR to add the capability to exclusively run in nightlies.

I believe either way allows the committed changes to correspond to actual behavior and a path to granular testing.

  1. would lead to lower overall impact of runner utilization (given that we only have one so far) leading to less blocking of signal for runs overall while

  2. would allow tests to begin running immediately on gfx110x despite being in all stages, ideally quickly following up with the granular test stages PR 

@dezhiAmd dezhiAmd deleted the test_new_runner branch November 14, 2025 00:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

3 participants