Use the newly added runner to test gfx110x-dgpu on windows OS for pre-submit, post-submit and nightly#2061
Use the newly added runner to test gfx110x-dgpu on windows OS for pre-submit, post-submit and nightly#2061dezhiAmd wants to merge 7 commits into
Conversation
41aaee3 to
59b85ac
Compare
| "windows": { | ||
| "test-runs-on": "", |
There was a problem hiding this comment.
I'm not sure if this code currently supports having a GPU family in both amdgpu_family_info_matrix_presubmit and amdgpu_family_info_matrix_nightly.
There are also currently unit test failures shown at https://github.com/ROCm/TheRock/actions/runs/19273842239/job/55108689779?pr=2061#step:9:77
Do we have enough runner capacity to add windows-gfx110X-gpu-rocm here to presubmit? If not, we may need to make some changes to build_tools/github_actions/configure_ci.py like I tried in #1350
There was a problem hiding this comment.
-
For context, this test has been added to the nightly CI rather than the pre-submit pipeline due to its current execution time, which is relatively long. This approach helps maintain efficiency in the main CI runs. Once we improve the test performance and reduce runtime, we can revisit the idea of including it in the pre-submit workflow.
-
The failure from this link appears to be not related to this PR. I ran the unit test locally and it passes.
There was a problem hiding this comment.
Those unit tests are for the code this PR modifies.
There was a problem hiding this comment.
Logs after the latest push show that this is buggy (as expected): https://github.com/ROCm/TheRock/actions/runs/19283347174/job/55138893201?pr=2061#step:4:52
The presubmit run on this pull request should not be using "windows_variants": "[{\"test-runs-on\": \"windows-gfx110X-gpu-rocm\", \"family\": \"gfx110X-dgpu\" if you want this to only run nightly.
The code in https://github.com/ROCm/TheRock/blob/main/build_tools/github_actions/configure_ci.py and https://github.com/ROCm/TheRock/blob/main/build_tools/github_actions/amdgpu_family_matrix.py will need more changes.
See #1097 and #1350 for some ideas. The unit tests are there for easier testing. You'll want to extend them to cover this case of presubmit and nightly having some overlap.
There was a problem hiding this comment.
Current fail reason when clicking "Explain Errors" button:
The job failed due to AWS S3 upload errors: "AccessDenied" when attempting to upload artifacts using the aws s3 cp command. This indicates that the AWS credentials used by the job do not have the required permissions to upload objects to the s3://therock-ci-artifacts-external/ROCm-TheRock/ bucket.
There was a problem hiding this comment.
This still has the bug I mentioned, see logs at https://github.com/ROCm/TheRock/actions/runs/19285007469/job/55202935915?pr=2061#step:4:59 (again - you will need to edit the files I linked and run the unit tests locally to fix that issue, the code is not currently equipped to have a target family in both presubmit and nightly lists)
Upload errors are unrelated to the changes here. I guess my changes in #2046 didn't set the right permissions for PRs from forks to upload to therock-ci-artifacts-external using the therock-ci role (cc @marbre )
There was a problem hiding this comment.
(Permissions for the upload failures should be fixed now, see #2099 (comment) and #2046 (comment))
8aca2f3 to
6299866
Compare
Signed-off-by: dezhliao <dezhliao@amd.com>
Signed-off-by: dezhliao <dezhliao@amd.com>
Signed-off-by: dezhliao <dezhi.liao@amd.com>
Signed-off-by: dezhliao <dezhi.liao@amd.com>
Signed-off-by: dezhliao <dezhliao@amd.com>
| "windows": { | ||
| "test-runs-on": "", |
There was a problem hiding this comment.
This still has the bug I mentioned, see logs at https://github.com/ROCm/TheRock/actions/runs/19285007469/job/55202935915?pr=2061#step:4:59 (again - you will need to edit the files I linked and run the unit tests locally to fix that issue, the code is not currently equipped to have a target family in both presubmit and nightly lists)
Upload errors are unrelated to the changes here. I guess my changes in #2046 didn't set the right permissions for PRs from forks to upload to therock-ci-artifacts-external using the therock-ci role (cc @marbre )
ScottTodd
left a comment
There was a problem hiding this comment.
This still enables the tests on presubmit and postsubmit builds, in contrast to the PR title and description.
Title is changed |
|
Cross posting discussion summary:
|
Motivation
Use the newly added runner to test gfx110x-dgpu nightly, pre-submit and post-submit. A follow-up PR will be enabling test gfx110x-dgpu nightly only
Technical Details
Add gfx110x test to amdgpu_family_matrix.py nightly section. Currently there is bug that the nightly build overwrites the presubmit version in amdgpu_family_matrix.py. It will be fixed in a follow-up PR.
Test Plan
Enable Configure test matrix and all tests using this PR and observe all tests passing with this link
Test Result
Test pass
Submission Checklist