Add multi-architecture packaging support#3561
Conversation
cde2cb9 to
cc5ba70
Compare
35b4901 to
bf71ac8
Compare
ScottTodd
left a comment
There was a problem hiding this comment.
The overall approach here looks reasonable. Are there any tests (unit tests or manual validation) for this work, beyond the successful builds on those workflow runs? PRs that don't have automated test coverage are significantly more difficult to review, as I've stated before.
Enable building arch-specific package variants by iterating over
multiple gfx architectures per base package. Non-versioned packages
are now only created for generic architecture in multi-arch mode.
Key changes:
- Add --enable-multi-arch and multi-value --target args to build_package.py
- Add resolve_versioned_dependencies() and get_dependency_list_for_multiarch()
to packaging_utils.py to handle dependency resolution for generic, arch-specific,
and single-arch metapackages
- Add expand_metapackage_to_all_archs() for generic metapackage dependencies
- Update is_gfxarch_package() and filter_components_fromartifactory() to support
multi-arch mode
- Move clean_package_build_dir() from build_package.py to packaging_utils.py
- Fix build summary counts by tracking failed packages explicitly instead of
inferring from a hardcoded 2x multiplier, which caused negative failure counts
- Fix typo in artifact directory error message
Extend multi-arch packaging to properly handle dependencies for
architecture-specific packages. Generic packages now exclude gfxarch
dependencies (delegated to gfx-specific variants), while gfx-specific
packages depend on their generic counterpart plus architecture-specific
dependencies.
- In resolve_versioned_dependencies: add branch for gfx-specific
non-meta packages to split generic and gfxarch dependencies
- In get_dependency_list_for_multiarch: add branch for generic packages
to filter out gfxarch deps, and update gfx-specific branch to include
both generic self-dependency and gfxarch deps with arch suffix
This ensures proper dependency chains in multi-arch mode where each
gfx-specific package pulls in the generic base plus its arch-specific
requirements.
Changes to build_package.py:
- Handle empty sourcedir_list gracefully in multi-arch mode for DEB packages
* Return empty list instead of sys.exit() to allow build continuation
* Log ERROR message for visibility
- Add warning for RPM packages with empty sourcedir_list in multi-arch mode
* RPM can create empty packages, so continue with warning
- Track failed architecture variants in failed_pkglist
* When a package fails for specific architecture (e.g., gfx1151),
add variant name to failed list
* Provides visibility into which architecture variants failed vs succeeded
- Preserve backward compatibility for single-arch mode
* Still exits on error when not in multi-arch mode
This commit introduces a new workflow for building multi-arch native Linux
packages (DEB/RPM) that consolidates binaries for all GPU families into
unified packages, along with supporting infrastructure improvements.
Workflow changes:
- Add multi_arch_build_native_linux_packages.yml reusable workflow
- Fetches artifacts for all GPU families (gfx94X-dcgpu, gfx120X-all, etc.)
- Builds unified DEB/RPM packages containing all architectures
- Implements comprehensive S3 bucket selection logic with decision tree
- Supports multiple bucket types: CI artifacts, release packages, internal
- Determines appropriate IAM roles based on build context
build_package.py enhancements:
- Add normalize_target_list() function for flexible input parsing
- Accepts semicolon, comma, or space-separated GPU family lists
- Works seamlessly with existing --enable-multi-arch flag
- Example: "gfx94X-dcgpu;gfx120X-all" or "gfx94X-dcgpu,gfx120X-all"
upload_package_repo.py improvements:
- Add --s3-prefix parameter for explicit S3 prefix override
- Add "ci" job type to choices (dev/nightly/prerelease/ci)
- Make --amdgpu-family parameter optional (backward compatible)
- Implement S3 prefix logic:
* Explicit --s3-prefix: use provided value
* dev/nightly: <pkg_type>/<YYYYMMDD>-<artifact_id>
* prerelease: v3/packages/<pkg_type>
* ci: v3/packages/<pkg_type>/<YYYYMMDD>-<artifact_id>
- Maintain backward compatibility with existing callers
S3 bucket strategy:
- therock-ci-artifacts: Default CI builds (ROCm/TheRock non-fork)
- therock-ci-artifacts-external: Fork PRs and external repositories
- therock-artifacts-internal: ROCm/therock-releases-internal builds
- therock-dev-packages: Dev release packages (release_type=dev)
- therock-nightly-packages: Nightly release packages (release_type=nightly)
- therock-release-packages: Official releases (release_type=release/prerelease)
IAM role mapping:
- CI builds (job_type=ci): arn:aws:iam::692859939525:role/therock-ci
- Release builds: arn:aws:iam::692859939525:role/therock-{release_type}
Introduce helper functions to streamline dependency and name field
processing across DEB and RPM package generation.
Changes:
- Add process_dependency_field() for Get -> Filter -> Transform pattern
* Handles DEBDepends, DEBRecommends, DEBSuggests
* Handles RPMRequires, RPMRecommends, RPMSuggests
* Returns empty string for empty dependency lists (fixes IndexError)
* Supports use_multiarch flag for main dependencies
- Add process_name_field() for name fields (Provides, Conflicts, etc.)
* Simplifies get -> transform -> join operations
* Supports optional transform functions (e.g., debian_replace_devel_name)
- Refactor generate_control_file() to use new helpers
- Refactor generate_spec_file() to use new helpers
Bug fixes:
- Use boolean True instead of string "True" for disable_dh_strip
- Fix typo: "buillds" -> "builds"
- Remove duplicate failed package tracking that caused pkg_name to be
added twice when default architecture variant failed
- Revert multi_arch_ci_linux.yml to remove premature native package integration - Add build_tools/packaging/linux/get_s3_config.py to replace inline bash logic - Determines S3 bucket, prefix, and job type based on release type and repository - Extracts date from ROCm package version for consistency between version and S3 path - Supports wheel, deb, and rpm package types - Update multi_arch_build_native_linux_packages.yml: - Use get_s3_config.py script for S3 configuration - Optimize "Fetch Artifacts" step to eliminate bash loop - Convert semicolon-separated GPU families to comma-separated format - Pass full family list via --amdgpu-targets parameter - Add comprehensive unit tests in build_tools/packaging/linux/test/get_s3_config_test.py - 23 tests covering all decision tree branches and date extraction logic
…check - Add extract_date_from_version() function to parse dates from ROCm package versions - Supports Debian (8.1.0~dev20251203), RPM (8.1.0~20251203gf689a8e), and wheel (7.10.0a20251021) formats - Falls back to current date if no date pattern found - Update determine_s3_config() to accept rocm_version parameter - Use extracted date for S3 path consistency with package version - Ensures rebuilding same version produces same S3 location - Add --rocm-version CLI argument (optional, defaults to None) - Remove obsolete ROCm/therock-releases-internal repository check - Logic is redundant as prerelease/release types are handled by release_type parameter - Simplifies decision tree from 4 branches to 3 - Update tests to match implementation (22 tests, all passing) This fixes the workflow which was already passing --rocm-version but the script did not accept it, causing failures.
debugedit truncation issue affecting multi-arch builds.
e97a0d6 to
393c331
Compare
## Motivation As part of #3323, we're switching ROCm packaging from being "single-arch [family]" to being "multi-arch". This switches our default CI workflow from "CI" to "multi-arch CI" on presubmit (`pull_request`), fixing #3337. > [!NOTE] > The "ci.yml" workflow will still run on `push` events and opt-in on `pull_request` to help migrate remaining features. The "CI" pipelines run fully independent single-stage builds for each GPU family (e.g. gfx110X-all, gfx94X-dcgpu, etc.) while the "Multi-Arch CI" pipelines run a single build (per Windows/Linux platform) that has a graph of generic and target-specific build stages. We've been running these workflows in parallel while getting multi-arch functional. Now, enough features have been implemented (building artifacts, testing artifacts, building python packages, etc.) that switching is possible. Keeping multi-arch CI as opt-in / postsubmit only has been leading to increased feature drift and slower progress on multi-arch support work, so we want to switch as soon as possible. > [!NOTE] > The multi-arch workflows are still building with `-DTHEROCK_FLAG_KPACK_SPLIT_ARTIFACTS=OFF`, see #3338. ## Technical Details ### Triggering changes to workflows workflow | event | behavior before | behavior after -- | -- | -- | -- multi_arch_ci.yml | `pull_request` | opt-in |⚠️ always runs multi_arch_ci.yml | `push` | always runs | always runs ci.yml | `pull_request` | always runs |⚠️ opt-in ciyml | `push` | always runs | always runs ci_nightly.yml | `schedule` | runs | runs ci_asan.yml | `schedule` | runs | runs ci_tsan.yml | `schedule` | runs | runs ### Feature parity between workflows We've roughly achieved feature parity between Multi-arch CI (new) and CI (old). These features are missing: Feature | Supported in ci.yml? | Supported in multi_arch_ci.yml? | Notes -- | -- | -- | -- gfx950-dcgpu testing | Yes |⚠️ No | #3288 gfx950-dcgpu pytorch wheel build | Yes |⚠️ No | #3288 `run_functional_tests` input for test_artifacts.yml | Yes |⚠️ No `test_linux_benchmarks` job | Yes |⚠️ No `test_python_packages_per_family` on Windows | Yes |⚠️ No "Build summary" with links to logs/artifacts | Yes |⚠️ No `build_native_linux_packages` job | Yes |⚠️ No | #3561 `resource_info.py` and `analyze_build_times.py` | Yes |⚠️ No ## Test Plan and Results - [x] Check that CI _does not_ run on this PR by default - [x] Check that multi-arch CI _does_ run on this PR by default - [x] Create and add new CI opt-in label, check that adding it runs CI - [ ] Continue to monitor workflow status on `push` and on PRs created after this for issues ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
amd-aakash
left a comment
There was a problem hiding this comment.
Please cross check if the failing CI checks are known issues.
## Motivation As part of #3323, we're switching ROCm packaging from being "single-arch [family]" to being "multi-arch". This switches our default CI workflow from "CI" to "multi-arch CI" on presubmit (`pull_request`), fixing #3337. > [!NOTE] > The "ci.yml" workflow will still run on `push` events and opt-in on `pull_request` to help migrate remaining features. The "CI" pipelines run fully independent single-stage builds for each GPU family (e.g. gfx110X-all, gfx94X-dcgpu, etc.) while the "Multi-Arch CI" pipelines run a single build (per Windows/Linux platform) that has a graph of generic and target-specific build stages. We've been running these workflows in parallel while getting multi-arch functional. Now, enough features have been implemented (building artifacts, testing artifacts, building python packages, etc.) that switching is possible. Keeping multi-arch CI as opt-in / postsubmit only has been leading to increased feature drift and slower progress on multi-arch support work, so we want to switch as soon as possible. > [!NOTE] > The multi-arch workflows are still building with `-DTHEROCK_FLAG_KPACK_SPLIT_ARTIFACTS=OFF`, see #3338. ## Technical Details ### Triggering changes to workflows workflow | event | behavior before | behavior after -- | -- | -- | -- multi_arch_ci.yml | `pull_request` | opt-in |⚠️ always runs multi_arch_ci.yml | `push` | always runs | always runs ci.yml | `pull_request` | always runs |⚠️ opt-in ciyml | `push` | always runs | always runs ci_nightly.yml | `schedule` | runs | runs ci_asan.yml | `schedule` | runs | runs ci_tsan.yml | `schedule` | runs | runs ### Feature parity between workflows We've roughly achieved feature parity between Multi-arch CI (new) and CI (old). These features are missing: Feature | Supported in ci.yml? | Supported in multi_arch_ci.yml? | Notes -- | -- | -- | -- gfx950-dcgpu testing | Yes |⚠️ No | #3288 gfx950-dcgpu pytorch wheel build | Yes |⚠️ No | #3288 `run_functional_tests` input for test_artifacts.yml | Yes |⚠️ No `test_linux_benchmarks` job | Yes |⚠️ No `test_python_packages_per_family` on Windows | Yes |⚠️ No "Build summary" with links to logs/artifacts | Yes |⚠️ No `build_native_linux_packages` job | Yes |⚠️ No | #3561 `resource_info.py` and `analyze_build_times.py` | Yes |⚠️ No ## Test Plan and Results - [x] Check that CI _does not_ run on this PR by default - [x] Check that multi-arch CI _does_ run on this PR by default - [x] Create and add new CI opt-in label, check that adding it runs CI - [ ] Continue to monitor workflow status on `push` and on PRs created after this for issues ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
The failing CIs (2) are unnrelated to this PR |
| - name: Determine IAM role | ||
| id: iam_role | ||
| run: | | ||
| # ================================================================ | ||
| # IAM Role Selection Logic | ||
| # ================================================================ | ||
| # Determines which AWS IAM role to assume based on job_type from s3_config step. | ||
| # | ||
| # Role Mapping: | ||
| # ├─ IF job_type == "ci" | ||
| # │ └─ Use: therock-ci role | ||
| # │ (For all CI buckets: therock-ci-artifacts, therock-ci-artifacts-external, therock-artifacts-internal) | ||
| # │ | ||
| # └─ ELSE (job_type == dev/nightly/prerelease/release) | ||
| # └─ Use: therock-${job_type} role | ||
| # (For package buckets: therock-dev-packages, therock-nightly-packages, etc.) | ||
| # | ||
| # ================================================================ | ||
|
|
||
| JOB_TYPE="${{ steps.s3_config.outputs.job_type }}" | ||
|
|
||
| if [[ "${JOB_TYPE}" == "ci" ]]; then | ||
| # CI builds use the shared CI role (for all artifact buckets) | ||
| IAM_ROLE="arn:aws:iam::692859939525:role/therock-ci" | ||
| echo "✓ Using CI role: ${IAM_ROLE}" | ||
| else | ||
| # Release builds use release-type-specific roles (for package buckets) | ||
| IAM_ROLE="arn:aws:iam::692859939525:role/therock-${JOB_TYPE}" | ||
| echo "✓ Using release-type role: ${IAM_ROLE}" | ||
| fi | ||
|
|
||
| echo "iam_role=${IAM_ROLE}" >> $GITHUB_OUTPUT |
There was a problem hiding this comment.
FYI. I'm expecting to refactor this as part of standing up multi-arch releases (#3334). I think we can move this into the setup job and then plumb it through to this and other jobs via workflow inputs, rather than recompute in each job that needs to know a bucket and IAM role for that bucket
There was a problem hiding this comment.
@ScottTodd i have done some more changes in s3 config as part of #4310. Let me know if any comments
There was a problem hiding this comment.
I sent #4386. My focus is on the core rocm release pipelines then I can look more closely at what the native packages are doing.
## Multi-arch native package workflow improvements ### `multi_arch_build_native_linux_packages.yml` - **Switch artifact fetching to `artifact_manager.py`**: Replaces `fetch_artifacts.py` with `artifact_manager.py fetch` for consistent multi-arch artifact fetching across all GPU families - **Fix system requirements**: Use `llvm-20` instead of `llvm`; add `pyzstd` for `.tar.zst` artifact decompression - **Add `package_repository_url` output**: Workflow now exposes the public S3 install URL as an output for downstream consumption (e.g., install test jobs) - **Add AWS credential guard**: `Configure AWS Credentials` step now only runs for authorized repositories (`ROCm/TheRock`, `ROCm/rockrel`), skipped for forks ### `docs/packaging/native_packaging.md` - Add S3 bucket/prefix/URL reference tables for all release types (dev, nightly, prerelease, release, CI) - Separate tables for GFX Specific Packages and Multi-Arch Packages ## Motivation Update as per the review comments on #3561 --------- Co-authored-by: ArvindCheru <Aravindan.Cheruvally@amd.com> Co-authored-by: Claude Sonnet 4 <noreply@anthropic.com>
Enable building arch-specific package variants by iterating over
multiple gfx architectures per base package. Non-versioned packages
are now only created for generic architecture in multi-arch mode.
Workflow changes:
build_package.py enhancements:
upload_package_repo.py improvements:
Test Output:
https://github.com/ROCm/TheRock/actions/runs/22646265113/job/65634920221
https://github.com/ROCm/TheRock/actions/runs/22646265113/job/65634920219