Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
20 changes: 10 additions & 10 deletions build_tools/github_actions/amdgpu_family_matrix.py
Original file line number Diff line number Diff line change
Expand Up @@ -25,16 +25,6 @@
"bypass_tests_for_releases": True,
},
},
}

# The 'postsubmit' matrix runs on 'push' triggers (for every commit to the default branch).
amdgpu_family_info_matrix_postsubmit = {
"gfx950": {
"linux": {
"test-runs-on": "linux-mi355-1gpu-ossci-rocm",
"family": "gfx950-dcgpu",
}
},
"gfx115x": {
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@geomin12 The rocsparse tests are taking 1h20m on Windows gfx1151. Should we trim that set down to a more manageable size (<30 minutes) before moving these up from postsubmit to presubmit?

Sample logs: https://github.com/ROCm/TheRock/actions/runs/17123441166/job/48575971832

That's running ./build/bin/rocsparse-test '--gtest_filter=*quick*' --matrices-dir ./build/clients/matrices/

2025-08-21T12:51:19.3856792Z [----------] Global test environment tear-down
2025-08-21T12:51:19.4065339Z [==========] 123725 tests from 130 test suites ran. (4558051 ms total)
2025-08-21T12:51:19.4065998Z [  PASSED  ] 123725 tests.

I'm not seeing any huge outliers in the logs... just looks like 100000+ of anything being quite slow. That's a mean of 36.8ms per test. Are these running in parallel or serial? Linux gfx942 takes ~15 minutes and gfx950 takes ~16 minutes for the same total number of tests.

Here are the totals per test suite: https://gist.github.com/ScottTodd/2a14451a47a321cf87d25ac3362b75ef. We could further compute the mean test time for each of those if we wanted to prune.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah i wonder if this is just a GPU capability issue. But yes, I believe they are running serially (or whatever they run in default in gtest). I can take a look on Tues for this. 1h 20m is quite quite long...

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@geomin12 have you had time to look into optimizing the rocsparse tests on gfx1151 somehow (pruning to a smaller set, increasing parallelism, etc.)? Should we proceed with this PR even without optimizations there? I think we might have runner capacity to support the increased load. We could also make the rocsparse tests run at a reduced frequency to still get coverage for the other subproject tests.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm asking Madhu about the flags/stats he did to optimize! we could attempt smoke tests, or increase timeout (although 1h 20m is quite a while)

We now have runner capacity to handle this, so this should be okay to land and optimize imo.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I'm leaning towards "land and optimize". Having this coverage may have caught a regression in the MIOpen build: #1248 (comment)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sgtm, that's been my thoughts too to help increase coverage. "it's better than nothing!"

"linux": {
"test-runs-on": "",
Expand All @@ -46,6 +36,16 @@
"family": "gfx1151",
},
},
}

# The 'postsubmit' matrix runs on 'push' triggers (for every commit to the default branch).
amdgpu_family_info_matrix_postsubmit = {
"gfx950": {
"linux": {
"test-runs-on": "linux-mi355-1gpu-ossci-rocm",
"family": "gfx950-dcgpu",
}
},
"gfx120x": {
"linux": {
"test-runs-on": "", # removed due to machine issues, label is "linux-rx9070-gpu-rocm"
Expand Down
Loading