-
Notifications
You must be signed in to change notification settings - Fork 227
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable ROCM in CI #999
Enable ROCM in CI #999
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/999
Note: Links to docs will display an error until the docs builds have been completed. ❌ 3 New Failures, 1 PendingAs of commit 593fb78 with merge base cedadc7 ( NEW FAILURES - The following jobs have failed:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
No ciflow labels are configured for this repo. |
@atalman im not sure the no-sudo flag does anything. Tried a few variants for the value like true or "true" and same result |
@pytorchbot rebase |
1 similar comment
@pytorchbot rebase |
docker-image: ${{ matrix.gpu-arch-type == 'rocm' && format('pytorch/manylinux2_28-builder:{0}{1}', | ||
matrix.gpu-arch-type, | ||
matrix.gpu-arch-version) | ||
|| 'pytorch/almalinux-builder' }} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pytorch/pytorch#140157
We've migrated to almalinux-builder due to the EOL CENTOS 7.
script: | | ||
conda create -n venv python=3.9 -y | ||
conda activate venv | ||
echo "::group::Install newer objcopy that supports --set-section-alignment" | ||
yum install -y devtoolset-10-binutils | ||
export PATH=/opt/rh/devtoolset-10/root/usr/bin/:$PATH |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The gcctoolset is installed through the dockerfile. Mentioned in this PR: pytorch/pytorch#140157
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The same is installed for rocm in this PR: pytorch/pytorch#141609
with: | ||
timeout: 120 | ||
no-sudo: ${{ matrix.gpu-arch-type == 'rocm' }} | ||
rocm: ${{ matrix.gpu-arch-type == 'rocm' }} | ||
continue-on-error: ${{ matrix.gpu-arch-type == 'rocm' }} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should definitely not be checked-in, since it's only for us to gather a complete list of test failures. @msaroufim Would we merge this PR only after ROCm CI is fully clean? I'd rather get all these infra changes merged, so that we run torchao CI on ROCm regularly, and maybe skip any failing tests for ROCm while we work separately to enable them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's up to you, the main constraint is we can't really be having CI per commit or on main run red since then it just causes confusion and people slowly learn to ignore seeing red. So if you'd like to merge some variant of this PR without running on commits to or on main then we can try to merge this more quickly
Personally I'd favor merging the skip tests as part of this work and we can do enablement for tests one by one easily while maintaining a green CI
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@petrex Please note that torchao team would like to have this PR be merged with a clean signal for ROCm, so please skip any failing tests as part of this PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done . #1563 based on the latest ROCm CI run
Needed for pytorch/test-infra#6003 and pytorch/ao#999 Pull Request resolved: #143590 Approved by: https://github.com/atalman Co-authored-by: Jithun Nair <[email protected]>
happy new year @jithunnair-amd @amdfaa Is this feature/PR ready to deploy? |
2 pending items:
|
The credential is working now. There is a new failure w.r.t chown on the CI job https://github.com/pytorch/ao/actions/runs/12656214677/job/35334719646, but it’s a different story I think |
f75c6a0
to
17289e7
Compare
17289e7
to
cb1331d
Compare
@msaroufim I dont seem to have access to this branch so #1563 instead. |
|
Hi @amdfaa, looks like this PR is still failing tests when landed. It's causing other unrelated PRs to fail the same tests: https://hud.pytorch.org/pr/pytorch/ao/1580#35788489448. Please make sure the tests are passing before landing. |
This reverts commit d96c6a7.
* Enable ROCM in CI --------- Co-authored-by: amdfaa <[email protected]>
* Enable ROCM in CI --------- Co-authored-by: amdfaa <[email protected]>
This PR to skip the unit test failures for ROCm + infra changes to enable ROCm CI. **NOTE:** This PR aims to enable the ROCm CI testing for torchao _only for pushes to main branch_. The ROCm tests should start showing up here once this PR is merged: https://hud.pytorch.org/hud/pytorch/ao/main/1?per_page=50&name_filter=regression Torchao PRs can also trigger the ROCm CI runs using the `ciflow/rocm` PR label (#1749). Enabling ROCm CI testing on *all* torchao PRs will be done in a follow-up PR. This pull request introduces the `skip_if_rocm` decorator across various test files to skip tests that are not yet supported on ROCm. The changes ensure that tests are conditionally skipped if ROCm is detected, improving the test suite's compatibility with different environments. # Key changes include: ### Cherry-pick ROCm CI infra changes from #999 ### Configure workflow to trigger ROCm CI only for pushes to main branch, OR on PRs with the `ciflow/rocm` label ### Introduction of `skip_if_rocm` decorator: * Added `skip_if_rocm` import in multiple test files to conditionally skip tests not supported on ROCm. (`test/dtypes/test_affine_quantized.py`, `test/dtypes/test_floatx.py`, `test/float8/test_base.py`, `test/hqq/test_hqq_affine.py`, `test/integration/test_integration.py`, `test/kernel/test_galore_downproj.py`, `test/prototype/test_awq.py`, `test/prototype/test_low_bit_optim.py`, `test/prototype/test_splitk.py`, `test/quantization/test_galore_quant.py`, `test/quantization/test_marlin_qqq.py`, `test/sparsity/test_marlin.py`, `test/test_ops.py`, `test/test_s8s4_linear_cutlass.py`, `torchao/utils.py`) [[1]](diffhunk://#diff-31b1ffcd78674b79cc65749176354ea4743683070120034709c1da7a3eac31f6R24) [[2]](diffhunk://#diff-0e811fa3416cd87d9a25b4fb680890098c69aa33ca4db4d347d4a10cc41e0eb3L30-R30) [[3]](diffhunk://#diff-05925b4469eb63ab854cc9891f088f570fa3822cdaeb4de109e0b1b9ab5038a7R21) [[4]](diffhunk://#diff-a9708dc28f15bb9cf665417e6c66601f9e8e2f1f672d1858603b74fa879a3357R13) [[5]](diffhunk://#diff-a977c33299f20a626cf650b2b6f0a49ef8fad7c97be21a5618e600b588b14b15R83) [[6]](diffhunk://#diff-4b0ddf8d1e85f4b4f1067f8d1d3e6b4d48785b3675c7202bf49bfbb1079d682fR14) [[7]](diffhunk://#diff-66249d5a8ed995b0a8e22c6354d6b270c5feeb982cb79a28f7c1b929700e89f4L8-R12) [[8]](diffhunk://#diff-244d33d1e8c30e765556011a4d3b76509f61433a346ba12ffc3115144e895aedR33) [[9]](diffhunk://#diff-2bcf3336ff64bfef786e6126813db46040b93628cab5faff3f0f5ed2cb077bf2L16-R24) [[10]](diffhunk://#diff-51ddab022797064be44ca38c87a56c6e87cd69444f4c6151a11b7f0141aef2b9R21) [[11]](diffhunk://#diff-133d8c7492ee2e7536328c8391545610750774e43d128d258380cb6787bb9e93L22-R22) [[12]](diffhunk://#diff-a58427e02fb5b05d26e03e8c2d216e5ae379d82084fd14bf77ea127b5505a43cL18-R18) [[13]](diffhunk://#diff-d183f2afc51d6a59bc70094e8f476d2468c45e415500f6eb60abad955e065156R22-R24) [[14]](diffhunk://#diff-85cc98d31eb8056e082ebdfbf2979aaa046ffc08bbacd4a65a31795b51998645R10-R12) [[15]](diffhunk://#diff-d2a11602a79e83305208472f1abe6a4106f02ce62a7f9524007181813863fcf6R10) ### Application of `skip_if_rocm` decorator: * Applied `@skip_if_rocm("ROCm development in progress")` to multiple test functions to skip them when running on ROCm. (`test/dtypes/test_affine_quantized.py`, `test/dtypes/test_floatx.py`, `test/float8/test_base.py`, `test/hqq/test_hqq_affine.py`, `test/integration/test_integration.py`, `test/kernel/test_galore_downproj.py`, `test/prototype/test_awq.py`, `test/prototype/test_low_bit_optim.py`, `test/prototype/test_splitk.py`, `test/quantization/test_galore_quant.py`, `test/quantization/test_marlin_qqq.py`, `test/sparsity/test_marlin.py`) [[1]](diffhunk://#diff-31b1ffcd78674b79cc65749176354ea4743683070120034709c1da7a3eac31f6R93) [[2]](diffhunk://#diff-31b1ffcd78674b79cc65749176354ea4743683070120034709c1da7a3eac31f6R173) [[3]](diffhunk://#diff-31b1ffcd78674b79cc65749176354ea4743683070120034709c1da7a3eac31f6R186) [[4]](diffhunk://#diff-0e811fa3416cd87d9a25b4fb680890098c69aa33ca4db4d347d4a10cc41e0eb3R111) [[5]](diffhunk://#diff-05925b4469eb63ab854cc9891f088f570fa3822cdaeb4de109e0b1b9ab5038a7R427) [[6]](diffhunk://#diff-a9708dc28f15bb9cf665417e6c66601f9e8e2f1f672d1858603b74fa879a3357R114) [[7]](diffhunk://#diff-a977c33299f20a626cf650b2b6f0a49ef8fad7c97be21a5618e600b588b14b15R571) [[8]](diffhunk://#diff-a977c33299f20a626cf650b2b6f0a49ef8fad7c97be21a5618e600b588b14b15R690) [[9]](diffhunk://#diff-a977c33299f20a626cf650b2b6f0a49ef8fad7c97be21a5618e600b588b14b15R710) [[10]](diffhunk://#diff-a977c33299f20a626cf650b2b6f0a49ef8fad7c97be21a5618e600b588b14b15R904) [[11]](diffhunk://#diff-a977c33299f20a626cf650b2b6f0a49ef8fad7c97be21a5618e600b588b14b15R924) [[12]](diffhunk://#diff-4b0ddf8d1e85f4b4f1067f8d1d3e6b4d48785b3675c7202bf49bfbb1079d682fR33) [[13]](diffhunk://#diff-66249d5a8ed995b0a8e22c6354d6b270c5feeb982cb79a28f7c1b929700e89f4R120) [[14]](diffhunk://#diff-244d33d1e8c30e765556011a4d3b76509f61433a346ba12ffc3115144e895aedR116) [[15]](diffhunk://#diff-2bcf3336ff64bfef786e6126813db46040b93628cab5faff3f0f5ed2cb077bf2L16-R24) [[16]](diffhunk://#diff-51ddab022797064be44ca38c87a56c6e87cd69444f4c6151a11b7f0141aef2b9R86) [[17]](diffhunk://#diff-133d8c7492ee2e7536328c8391545610750774e43d128d258380cb6787bb9e93R48) [[18]](diffhunk://#diff-133d8c7492ee2e7536328c8391545610750774e43d128d258380cb6787bb9e93R70) [[19]](diffhunk://#diff-a58427e02fb5b05d26e03e8c2d216e5ae379d82084fd14bf77ea127b5505a43cR40) [[20]](diffhunk://#diff-a58427e02fb5b05d26e03e8c2d216e5ae379d82084fd14bf77ea127b5505a43cL51-R58) ### Module-level skips for ROCm: * Added module-level skips for ROCm in specific test files to skip all tests within the module if ROCm is detected. (`test/test_ops.py`, `test/test_s8s4_linear_cutlass.py`) [[1]](diffhunk://#diff-d183f2afc51d6a59bc70094e8f476d2468c45e415500f6eb60abad955e065156R22-R24) [[2]](diffhunk://#diff-85cc98d31eb8056e082ebdfbf2979aaa046ffc08bbacd4a65a31795b51998645R10-R12)
Salient points:
The above PR shows that we've migrated to almalinux-builder due to the EOL CENTOS 7. Changes to regression_test.yml to not install devtoolset-10 have been made in accordance with this switch.
torchao/utils.py
in invocation oftorch.cuda.get_device_properties()
Needs changes in pytorch/test-infra#6104