Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
104 changes: 104 additions & 0 deletions .github/workflows/test_rocm_wheels.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,104 @@
name: Test ROCm Wheels

on:
workflow_dispatch:
inputs:
amdgpu_family:
description: GPU family to test (e.g., gfx94X-dcgpu, gfx110X-all)
required: true
type: string
default: "gfx94X-dcgpu"
test_runs_on:
description: Runner label to use. The selected runner should have a GPU supported by amdgpu_family
required: true
type: string
default: "linux-mi325-1gpu-ossci-rocm-frac"
package_index_url:
description: Base Python package index URL (without GPU family subdir)
required: true
type: string
default: "https://rocm.nightlies.amd.com/v2"
python_version:
required: true
type: string
default: "3.12"
rocm_version:
description: ROCm version to pip install (e.g. "7.10.0a20251124")
required: true
type: string

workflow_call:
inputs:
amdgpu_family:
required: true
type: string
test_runs_on:
required: true
type: string
package_index_url:
required: true
type: string
python_version:
required: true
type: string
rocm_version:
required: true
type: string
repository:
description: "Repository to checkout. Otherwise, defaults to `github.repository`."
type: string
ref:
description: "Branch, tag or SHA to checkout. Defaults to the reference or SHA that triggered the workflow."
type: string

permissions:
contents: read

run-name: Test ROCm Wheels (${{ inputs.amdgpu_family }}, ${{ inputs.rocm_version }}, ${{ inputs.test_runs_on }})

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
run-name: Test ROCm Wheels (${{ inputs.amdgpu_family }}, ${{ inputs.rocm_version }}, ${{ inputs.test_runs_on }})
run-name: Test ROCm Wheels (${{ inputs.amdgpu_family }}, ${{ inputs.rocm_version }}

Not sure we really need the runner name here or if we want it, we could drop the amdgpu_family.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copied from the style used for pytorch wheel tests:

run-name: Test PyTorch (${{ inputs.amdgpu_family }}, ${{ inputs.torch_version}}, ${{ inputs.test_runs_on }})
jobs:
test_wheels:
name: Test PyTorch | ${{ inputs.amdgpu_family }}

I think it's fine to include for workflow_dispatch:
image

Most use of this workflow once integrated will be using workflow_call

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well with the default for test_runs_on, we would get linux-mi325-1gpu-ossci-rocm-frac which already includes the GPU kind of (even though not the gfx architecture). Fine as is for me though.


jobs:
test_wheels:
name: Test ROCm Wheels | ${{ inputs.amdgpu_family }}
runs-on: ${{ inputs.test_runs_on }}
container:
image: ${{ contains(inputs.test_runs_on, 'linux') && 'ghcr.io/rocm/no_rocm_image_ubuntu24_04@sha256:405945a40deaff9db90b9839c0f41d4cba4a383c1a7459b28627047bf6302a26' || null }}
options: --ipc host
--group-add video
--device /dev/kfd
--device /dev/dri
--group-add 110
--env-file /etc/podinfo/gha-gpu-isolation-settings
--user 0:0 # Running as root, by recommendation of GitHub: https://docs.github.com/en/actions/reference/workflows-and-actions/dockerfile-support#user
defaults:
run:
shell: bash
env:
VENV_DIR: ${{ github.workspace }}/.venv

steps:
- name: Checkout
uses: actions/checkout@8e8c483db84b4bee98b60c0593521ed34d9990e8 # v6.0.1
with:
repository: ${{ inputs.repository || github.repository }}
ref: ${{ inputs.ref || '' }}

- name: Set up Python
uses: actions/setup-python@83679a892e2d95755f2dac6acb0bfd1e9ac5d548 # v6.1.0
with:
python-version: ${{ inputs.python_version }}

- name: Set up virtual environment and install ROCm packages
run: |
python build_tools/setup_venv.py ${VENV_DIR} \
--packages "rocm[libraries,devel]==${{ inputs.rocm_version }}" \
--index-url=${{ inputs.package_index_url }} \
--index-subdir=${{ inputs.amdgpu_family }} \
--activate-in-future-github-actions-steps
Comment on lines +92 to +96

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this needs more work as we won't have a full index for artifacts. Fine to address in a follow up though but might want to add a todo here (if it isn't tracked in the issue).

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, using this for CI / dev artifacts will need some extra work (changes to setup_venv.py and/or this workflow). I wanted to get the base workflow in place and tested using our existing dev/nightly packages.


- name: Show installed packages
run: |
pip freeze

- name: Run rocm-sdk sanity tests
run: |
rocm-sdk test
Comment on lines +102 to +104

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Testing on our self-hosted runners with GPUs now:

Both Linux runs failed with

testSharedLibrariesLoad (rocm_sdk.tests.devel_test.ROCmDevelTest.testSharedLibrariesLoad) ... ++ Exec [/__w/TheRock/TheRock]$ /__w/TheRock/TheRock/.venv/bin/python -P -m rocm_sdk path --root
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/__w/_tool/Python/3.12.12/x64/lib/python3.12/ctypes/__init__.py", line 379, in __init__
    self._handle = _dlopen(self._name, mode)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^
OSError: libgfortran.so.5: cannot open shared object file: No such file or directory
  testSharedLibrariesLoad (rocm_sdk.tests.devel_test.ROCmDevelTest.testSharedLibrariesLoad) [Check shared library loads] (so_path=PosixPath('/__w/TheRock/TheRock/.venv/lib/python3.12/site-packages/_rocm_sdk_devel/lib/libhipsolver_fortran.so.1.0')) ... ERROR
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/__w/_tool/Python/3.12.12/x64/lib/python3.12/ctypes/__init__.py", line 379, in __init__
    self._handle = _dlopen(self._name, mode)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^
OSError: libgfortran.so.5: cannot open shared object file: No such file or directory
  testSharedLibrariesLoad (rocm_sdk.tests.devel_test.ROCmDevelTest.testSharedLibrariesLoad) [Check shared library loads] (so_path=PosixPath('/__w/TheRock/TheRock/.venv/lib/python3.12/site-packages/_rocm_sdk_devel/lib/libhipsolver_fortran.so.1')) ... ERROR
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/__w/_tool/Python/3.12.12/x64/lib/python3.12/ctypes/__init__.py", line 379, in __init__
    self._handle = _dlopen(self._name, mode)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^
OSError: libgfortran.so.5: cannot open shared object file: No such file or directory
  testSharedLibrariesLoad (rocm_sdk.tests.devel_test.ROCmDevelTest.testSharedLibrariesLoad) [Check shared library loads] (so_path=PosixPath('/__w/TheRock/TheRock/.venv/lib/python3.12/site-packages/_rocm_sdk_devel/lib/libhipsolver_fortran.so')) ... ERROR
======================================================================
ERROR: testSharedLibrariesLoad (rocm_sdk.tests.devel_test.ROCmDevelTest.testSharedLibrariesLoad) [Check shared library loads] (so_path=PosixPath('/__w/TheRock/TheRock/.venv/lib/python3.12/site-packages/_rocm_sdk_devel/lib/libhipsolver_fortran.so.1.0'))
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/__w/TheRock/TheRock/.venv/lib/python3.12/site-packages/rocm_sdk/tests/devel_test.py", line 153, in testSharedLibrariesLoad
    subprocess.check_call(
  File "/__w/_tool/Python/3.12.12/x64/lib/python3.12/subprocess.py", line 413, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['/__w/TheRock/TheRock/.venv/bin/python', '-P', '-c', 'import ctypes; import sys; ctypes.CDLL(sys.argv[1])', '/__w/TheRock/TheRock/.venv/lib/python3.12/site-packages/_rocm_sdk_devel/lib/libhipsolver_fortran.so.1.0']' returned non-zero exit status 1.
======================================================================
ERROR: testSharedLibrariesLoad (rocm_sdk.tests.devel_test.ROCmDevelTest.testSharedLibrariesLoad) [Check shared library loads] (so_path=PosixPath('/__w/TheRock/TheRock/.venv/lib/python3.12/site-packages/_rocm_sdk_devel/lib/libhipsolver_fortran.so.1'))
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/__w/TheRock/TheRock/.venv/lib/python3.12/site-packages/rocm_sdk/tests/devel_test.py", line 153, in testSharedLibrariesLoad
    subprocess.check_call(
  File "/__w/_tool/Python/3.12.12/x64/lib/python3.12/subprocess.py", line 413, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['/__w/TheRock/TheRock/.venv/bin/python', '-P', '-c', 'import ctypes; import sys; ctypes.CDLL(sys.argv[1])', '/__w/TheRock/TheRock/.venv/lib/python3.12/site-packages/_rocm_sdk_devel/lib/libhipsolver_fortran.so.1']' returned non-zero exit status 1.
======================================================================
ERROR: testSharedLibrariesLoad (rocm_sdk.tests.devel_test.ROCmDevelTest.testSharedLibrariesLoad) [Check shared library loads] (so_path=PosixPath('/__w/TheRock/TheRock/.venv/lib/python3.12/site-packages/_rocm_sdk_devel/lib/libhipsolver_fortran.so'))
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/__w/TheRock/TheRock/.venv/lib/python3.12/site-packages/rocm_sdk/tests/devel_test.py", line 153, in testSharedLibrariesLoad
    subprocess.check_call(
  File "/__w/_tool/Python/3.12.12/x64/lib/python3.12/subprocess.py", line 413, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['/__w/TheRock/TheRock/.venv/bin/python', '-P', '-c', 'import ctypes; import sys; ctypes.CDLL(sys.argv[1])', '/__w/TheRock/TheRock/.venv/lib/python3.12/site-packages/_rocm_sdk_devel/lib/libhipsolver_fortran.so']' returned non-zero exit status 1.
----------------------------------------------------------------------
Ran 25 tests in 12.576s
FAILED (errors=3)

Looks similar to #1877 - I thought we fixed that? Is this a real issue or is the test workflow not configured correctly?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can reproduce this locally with our test dockerfile. Will follow-up with a new issue.

Loading