Skip to content

[ci] Move gfx1151 builds and tests to presubmit.#1349

Merged
ScottTodd merged 4 commits into
ROCm:mainfrom
ScottTodd:presubmit-gfx1151
Sep 4, 2025
Merged

[ci] Move gfx1151 builds and tests to presubmit.#1349
ScottTodd merged 4 commits into
ROCm:mainfrom
ScottTodd:presubmit-gfx1151

Conversation

@ScottTodd
Copy link
Copy Markdown
Member

This would have helped catch #1347 earlier. Without a change like this, we don't run any Windows tests on physical GPUs unless a PR opts in via labels.

@amd-justchen
Copy link
Copy Markdown
Contributor

Good to know! I was wondering why the tests were being skipped and I now understand why we added the gfx1151 label to that PR manually at the time

"family": "gfx950-dcgpu",
}
},
"gfx115x": {
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@geomin12 The rocsparse tests are taking 1h20m on Windows gfx1151. Should we trim that set down to a more manageable size (<30 minutes) before moving these up from postsubmit to presubmit?

Sample logs: https://github.com/ROCm/TheRock/actions/runs/17123441166/job/48575971832

That's running ./build/bin/rocsparse-test '--gtest_filter=*quick*' --matrices-dir ./build/clients/matrices/

2025-08-21T12:51:19.3856792Z [----------] Global test environment tear-down
2025-08-21T12:51:19.4065339Z [==========] 123725 tests from 130 test suites ran. (4558051 ms total)
2025-08-21T12:51:19.4065998Z [  PASSED  ] 123725 tests.

I'm not seeing any huge outliers in the logs... just looks like 100000+ of anything being quite slow. That's a mean of 36.8ms per test. Are these running in parallel or serial? Linux gfx942 takes ~15 minutes and gfx950 takes ~16 minutes for the same total number of tests.

Here are the totals per test suite: https://gist.github.com/ScottTodd/2a14451a47a321cf87d25ac3362b75ef. We could further compute the mean test time for each of those if we wanted to prune.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah i wonder if this is just a GPU capability issue. But yes, I believe they are running serially (or whatever they run in default in gtest). I can take a look on Tues for this. 1h 20m is quite quite long...

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@geomin12 have you had time to look into optimizing the rocsparse tests on gfx1151 somehow (pruning to a smaller set, increasing parallelism, etc.)? Should we proceed with this PR even without optimizations there? I think we might have runner capacity to support the increased load. We could also make the rocsparse tests run at a reduced frequency to still get coverage for the other subproject tests.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm asking Madhu about the flags/stats he did to optimize! we could attempt smoke tests, or increase timeout (although 1h 20m is quite a while)

We now have runner capacity to handle this, so this should be okay to land and optimize imo.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I'm leaning towards "land and optimize". Having this coverage may have caught a regression in the MIOpen build: #1248 (comment)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sgtm, that's been my thoughts too to help increase coverage. "it's better than nothing!"

@ScottTodd ScottTodd marked this pull request as ready for review September 4, 2025 17:40
@ScottTodd ScottTodd merged commit 18d4f6a into ROCm:main Sep 4, 2025
26 checks passed
@ScottTodd ScottTodd deleted the presubmit-gfx1151 branch September 4, 2025 22:38
@github-project-automation github-project-automation Bot moved this from TODO to Done in TheRock Triage Sep 4, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

4 participants