[ci] Move gfx1151 builds and tests to presubmit.#1349
Conversation
|
Good to know! I was wondering why the tests were being skipped and I now understand why we added the gfx1151 label to that PR manually at the time |
| "family": "gfx950-dcgpu", | ||
| } | ||
| }, | ||
| "gfx115x": { |
There was a problem hiding this comment.
@geomin12 The rocsparse tests are taking 1h20m on Windows gfx1151. Should we trim that set down to a more manageable size (<30 minutes) before moving these up from postsubmit to presubmit?
Sample logs: https://github.com/ROCm/TheRock/actions/runs/17123441166/job/48575971832
That's running ./build/bin/rocsparse-test '--gtest_filter=*quick*' --matrices-dir ./build/clients/matrices/
2025-08-21T12:51:19.3856792Z [----------] Global test environment tear-down
2025-08-21T12:51:19.4065339Z [==========] 123725 tests from 130 test suites ran. (4558051 ms total)
2025-08-21T12:51:19.4065998Z [ PASSED ] 123725 tests.
I'm not seeing any huge outliers in the logs... just looks like 100000+ of anything being quite slow. That's a mean of 36.8ms per test. Are these running in parallel or serial? Linux gfx942 takes ~15 minutes and gfx950 takes ~16 minutes for the same total number of tests.
Here are the totals per test suite: https://gist.github.com/ScottTodd/2a14451a47a321cf87d25ac3362b75ef. We could further compute the mean test time for each of those if we wanted to prune.
There was a problem hiding this comment.
Ah i wonder if this is just a GPU capability issue. But yes, I believe they are running serially (or whatever they run in default in gtest). I can take a look on Tues for this. 1h 20m is quite quite long...
There was a problem hiding this comment.
@geomin12 have you had time to look into optimizing the rocsparse tests on gfx1151 somehow (pruning to a smaller set, increasing parallelism, etc.)? Should we proceed with this PR even without optimizations there? I think we might have runner capacity to support the increased load. We could also make the rocsparse tests run at a reduced frequency to still get coverage for the other subproject tests.
There was a problem hiding this comment.
I'm asking Madhu about the flags/stats he did to optimize! we could attempt smoke tests, or increase timeout (although 1h 20m is quite a while)
We now have runner capacity to handle this, so this should be okay to land and optimize imo.
There was a problem hiding this comment.
Okay, I'm leaning towards "land and optimize". Having this coverage may have caught a regression in the MIOpen build: #1248 (comment)
There was a problem hiding this comment.
sgtm, that's been my thoughts too to help increase coverage. "it's better than nothing!"
This would have helped catch #1347 earlier. Without a change like this, we don't run any Windows tests on physical GPUs unless a PR opts in via labels.