Introducing a new, hierarchical amdgpu_family_matrix layout#3248
Conversation
|
To review this you need to compare it with: build_tools/github_actions/amdgpu_family_matrix.py I added some more reviewers just for awareness of these changes. |
b016770 to
614a797
Compare
jayhawk-commits
left a comment
There was a problem hiding this comment.
Going through the new matrix format and I am liking many of the changes.
I like adding new boolean fields for control of things like test enablement, so we don't have to delete runner group names temporarily for disabling tests.
Clear separation of definitions between build, test, and release is great.
The blueprint provided is a good template, as the matrix is growing in complexity and prone to error from manual input. Will you be adding pytests verifying the contents of the matrix? I'd feel more comfortable with this PR landing if we had the tests.
Another improvement that does not need to be in this pull request is some workflow and script combination that converts what is in the matrix into some .md file that has easier-to-read tables on what is being built and tested. Then, that workflow would run whenever the matrix gets changed to regenerate the .md file.
|
What will become of the |
geomin12
left a comment
There was a problem hiding this comment.
overall looks good! just a few housekeeping comments
| Cmake targets are defined in: cmake/therock_amdgpu_targets.cmake | ||
| """ | ||
|
|
||
| # blueprint = { |
There was a problem hiding this comment.
probably redundant as Data layout comment covers this
There was a problem hiding this comment.
i left it as easy copy and paste if you need to add new archs. could maybe put it at the end of the file and add a note to the front?
There was a problem hiding this comment.
IMO it's easier to copy/paste from an existing configuration. If the configurations were split between multiple files then having a clean reference/blueprint here would be more valuable.
There was a problem hiding this comment.
(So I would suggest deleting this comment)
| }, | ||
| } | ||
| }, | ||
| "gfx115x": { |
There was a problem hiding this comment.
there is some logic to make these keys lowercase (less mistake prone when declaring gpu families). should we make all these keys consistent and lowercase?
There was a problem hiding this comment.
I'd go the other way and use uppercase consistently, to match the source of truth in https://github.com/ROCm/TheRock/blob/main/cmake/therock_amdgpu_targets.cmake
| } | ||
|
|
||
|
|
||
| amdgpu_family_info_matrix_all = { |
There was a problem hiding this comment.
super nit and this is just an opinion (don't need to implement): can we order these to match amdgpu_family_predefined_groups's order?
There was a problem hiding this comment.
it started like this in the beginning... if moving i would rather move it into numeric ordering? what do you think?
TheRock/build_tools/github_actions/configure_amdgpu_matrix.py Lines 470 to 474 in 89c5e71
|
yes the generator will have tests for it. see for current version ( #1732 ): https://github.com/ROCm/TheRock/blob/89c5e71a4ea42b537fe76572035ecd7f719a46b8/build_tools/github_actions/tests/configure_amdgpu_matrix_test.py
yeah a better overview would be nice. but definitely needs another PR for it. i also thought about splitting the file per group of families in case we get more archs. |
Testing being done in the generator sounds good to me. Testing can add coverage for some of the nits from this PR like lower-case and ordering of items, which I don't have a strong opinion on. |
|
i guess we can ignore those CI failures as the files added here are not used anywhere yet? |
Agreed |
| Cmake targets are defined in: cmake/therock_amdgpu_targets.cmake | ||
| """ | ||
|
|
||
| # blueprint = { |
There was a problem hiding this comment.
IMO it's easier to copy/paste from an existing configuration. If the configurations were split between multiple files then having a clean reference/blueprint here would be more valuable.
| amdgpu_family_predefined_groups = { | ||
| # The 'presubmit' matrix runs on 'pull_request' triggers (on all PRs). | ||
| "amdgpu_presubmit": ["gfx94Xxdcgpu", "gfx110x-all", "gfx1151", "gfx120x-all"], |
There was a problem hiding this comment.
typos here? - and consistent uppercase or lowercase of X?
| amdgpu_family_predefined_groups = { | |
| # The 'presubmit' matrix runs on 'pull_request' triggers (on all PRs). | |
| "amdgpu_presubmit": ["gfx94Xxdcgpu", "gfx110x-all", "gfx1151", "gfx120x-all"], | |
| amdgpu_family_predefined_groups = { | |
| # The 'presubmit' matrix runs on 'pull_request' triggers (on all PRs). | |
| "amdgpu_presubmit": ["gfx94X-dcgpu", "gfx110x-all", "gfx1151", "gfx120X-all"], |
There was a problem hiding this comment.
yeah... i had it capitalized and then let myself convince from geo that lower case is also not a bad choice. i switch it back to "X". matching to the cmake file is definitely what i had on my mind in the original design ... which i then forgot over the last couple of months :(
| }, | ||
| } | ||
| }, | ||
| "gfx115x": { |
There was a problem hiding this comment.
I'd go the other way and use uppercase consistently, to match the source of truth in https://github.com/ROCm/TheRock/blob/main/cmake/therock_amdgpu_targets.cmake
| "gfx103x": { | ||
| "dgpu": { |
There was a problem hiding this comment.
I don't understand the nesting here. Is this planning towards having e.g.
"gfx103x": {
"all": { },
"dgpu": { },
"igpu": { }
}?
If we only ever have one at a time, it would be simpler to collapse down a level to just
"gfx103X-all": { },
"gfx103X-dgpu": { },
"gfx103X-igpu": { }There there wouldn't an extra level of nesting between
"gfx115x": {
"gfx1153": { // <--- single target
"linux": {
"build": {
"expect_failure": True,
"build_variants": ["release"],
},
"gfx90x": {
"dcgpu": { // <--- family grouping
"linux": {
"build": {
"build_variants": ["release"],There was a problem hiding this comment.
Comments like this and my previous thought on converting the contents of the matrix in an easier to read format has me thinking that maybe it is worth exploring the reverse. We could define the matrix in format (.yml or ,json for example) that is easier to read and maintain, and then script to consume and test its contents for things we want to standardize like capitalization.
| amdgpu_family_info_matrix_all = { | ||
| "gfx94x": { |
There was a problem hiding this comment.
I wonder if this would be easier to maintain and edit if we refactored like so:
amdgpu_family_gfx94X = {
"dcgpu": { ... }
}
amdgpu_family_gfx110X = {
"dcgpu": { ... }
}
amdgpu_family_info_matrix_all = {
"gfx94X": amdgpu_family_gfx94X,
"gfx110X": amdgpu_family_gfx110X,
...
}It would be possible to add a family and then forget to "register" it in the _all list though.
There was a problem hiding this comment.
I don't understand the nesting here.
my idea of the nesting was that everything below "gfx103x" can either be group or arch that are part of "gfx103". just anything matching "gfx103" and adding on demand when needed for the CI.
if we would want to run "igpu", or just the single gpu target "gfx1030" or "gfx1032" we would have the structure like this:
"gfx103x": {
"all": { },
"dgpu": { },
"igpu": { },
"gfx1030": { }.
"gfx1032": { }
}
There was a problem hiding this comment.
thinking about it having separate files for better editing:
we could have
amdgpu_family_matrix.py
amdgpu_family_gfx103x.py
amdgpu_family_gfx950.py
amdgpu_family_gfx94x.py
...
in which we do the same trick as we do for the pytorch skip testing:
- all
amdgpu_family_gfx<family>.pyhave the same dictionary name:amdgpu_family_info amdgpu_family_matrix.pybuildsamdgpu_family_info_matrix_allby importing allamdgpu_family_gfx*.pyfiles and adding the specificamdgpu_family_infotoamdgpu_family_info_matrix_all
structure within amdgpu_family_gfx<family>.py would be:
amdgpu_family_info = {
"gfx103x": {
"all": { },
"dgpu": { },
"igpu": { },
"gfx1030": { }.
"gfx1032": { }
}
}
There was a problem hiding this comment.
maybe having a short call about it would be good?
There was a problem hiding this comment.
just as an addition. this is the current plan in my head:
- new structure for amdgpu matrix to know what belongs to build, test, release and be independent of if arch belongs to presubmit, postsubmit or nightly_ci
- new configure_ci ( configure_amdgpu_matrix.py ) that allows for default values for each configuration step and removes entire matrices that are not required (e.g. no testing wanted = no test section in the matrix returned)
- new setup.yml that propagates the container image to only have to set the image at a single place
- add weekly ci
- starting to move over the other cis to use the new layout (should be easy as the only the toplevel workflow elements need to be changed)
- add multi arch generator in configure_amdgpu_matrix.py
- looking then if and how the rest of the configurators (fetch_.. , configure_...,) etc can be integrated into configure_amdgpu_matrix.py
- finish moving all cis to the new format
…ed in cmake/therock_amdgpu_targets.cmake
| Cmake targets are defined in: cmake/therock_amdgpu_targets.cmake | ||
| """ | ||
|
|
||
| # blueprint = { |
There was a problem hiding this comment.
(So I would suggest deleting this comment)
| }, | ||
| "sanity_check_only_for_family": True, | ||
| }, | ||
| "release": {"push_on_success": True, "bypass_tests_for_releases": True}, |
There was a problem hiding this comment.
nit: add a trailing comma after bypass_tests_for_releases so this wraps like the other entries
| "release": {"push_on_success": True, "bypass_tests_for_releases": True}, | |
| "release": { | |
| "push_on_success": True, | |
| "bypass_tests_for_releases": True, | |
| }, |
There was a problem hiding this comment.
i blame pre commit hook :-P
This PR is part of enabling CI weekly (big picture PR #1732 ) . For this, refactoring of amdgpu_family is needed for easier selection of a specific gpu, and not having to rely on knowing in which group the specific gpu is part of (presubmit/postsubmit/nightlies).
New layout puts all gpus in a single dictionary
amdgpu_family_info_matrix_all. Choices for pre/postsubmit and nightlies are now done via a list.amdgpu_family_info_matrix_allhas more depth in hierarchy to better define which parameters belong to which step: build, test, release.This new layout is introduced in parallel to the old one. The old one stays unchanged to allow gradual move for the CI to use the new layout.