Skip to content

Add multi-architecture packaging support#3561

Merged
raramakr merged 17 commits into
mainfrom
users/raramakr/multi-arch
Mar 31, 2026
Merged

Add multi-architecture packaging support#3561
raramakr merged 17 commits into
mainfrom
users/raramakr/multi-arch

Conversation

@raramakr
Copy link
Copy Markdown
Contributor

@raramakr raramakr commented Feb 23, 2026

Enable building arch-specific package variants by iterating over
multiple gfx architectures per base package. Non-versioned packages
are now only created for generic architecture in multi-arch mode.

Workflow changes:

  • Add multi_arch_build_native_linux_packages.yml reusable workflow
  • Fetches artifacts for all GPU families (gfx94X-dcgpu, gfx120X-all, etc.)
  • Builds unified DEB/RPM packages containing all architectures
  • Implements comprehensive S3 bucket selection logic with decision tree
  • Supports multiple bucket types: CI artifacts, release packages, internal
  • Determines appropriate IAM roles based on build context

build_package.py enhancements:

  • Add --enable-multi-arch and multi-value --target args to build_package.py
  • Add resolve_versioned_dependencies() and get_dependency_list_for_multiarch() to packaging_utils.py to handle dependency resolution for generic, arch-specific, and single-arch metapackages
  • Add expand_metapackage_to_all_archs() for generic metapackage dependencies
  • Update is_gfxarch_package() and filter_components_fromartifactory() to support multi-arch mode
  • Move clean_package_build_dir() from build_package.py to packaging_utils.py
  • Fix build summary counts by tracking failed packages explicitly instead of inferring from a hardcoded 2x multiplier, which caused negative failure counts
  • Add normalize_target_list() function for flexible input parsing
  • Fix typo in artifact directory error message

upload_package_repo.py improvements:

  • Add --s3-prefix parameter for explicit S3 prefix override
  • Add "ci" job type to release_type(dev/nightly/prerelease/ci)

Test Output:
https://github.com/ROCm/TheRock/actions/runs/22646265113/job/65634920221
https://github.com/ROCm/TheRock/actions/runs/22646265113/job/65634920219

Copy link
Copy Markdown
Member

@marbre marbre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Drive-by comment

Comment thread build_tools/packaging/linux/package.json
Copy link
Copy Markdown
Member

@ScottTodd ScottTodd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The overall approach here looks reasonable. Are there any tests (unit tests or manual validation) for this work, beyond the successful builds on those workflow runs? PRs that don't have automated test coverage are significantly more difficult to review, as I've stated before.

Comment thread .github/workflows/multi_arch_build_native_linux_packages.yml Outdated
Comment thread .github/workflows/multi_arch_build_native_linux_packages.yml Outdated
Comment thread .github/workflows/multi_arch_build_native_linux_packages.yml Outdated
Comment thread .github/workflows/multi_arch_build_native_linux_packages.yml Outdated
Comment thread .github/workflows/multi_arch_ci_linux.yml Outdated
Comment thread build_tools/packaging/linux/build_package.py
Comment thread build_tools/packaging/linux/build_package.py
Comment thread build_tools/packaging/linux/build_package.py Outdated
Comment thread build_tools/packaging/linux/build_package.py Outdated
Comment thread build_tools/packaging/linux/build_package.py Outdated
raramakr and others added 16 commits March 26, 2026 18:39
  Enable building arch-specific package variants by iterating over
  multiple gfx architectures per base package. Non-versioned packages
  are now only created for generic architecture in multi-arch mode.

  Key changes:
  - Add --enable-multi-arch and multi-value --target args to build_package.py
  - Add resolve_versioned_dependencies() and get_dependency_list_for_multiarch()
    to packaging_utils.py to handle dependency resolution for generic, arch-specific,
    and single-arch metapackages
  - Add expand_metapackage_to_all_archs() for generic metapackage dependencies
  - Update is_gfxarch_package() and filter_components_fromartifactory() to support
    multi-arch mode
  - Move clean_package_build_dir() from build_package.py to packaging_utils.py
  - Fix build summary counts by tracking failed packages explicitly instead of
    inferring from a hardcoded 2x multiplier, which caused negative failure counts
  - Fix typo in artifact directory error message
  Extend multi-arch packaging to properly handle dependencies for
  architecture-specific packages. Generic packages now exclude gfxarch
  dependencies (delegated to gfx-specific variants), while gfx-specific
  packages depend on their generic counterpart plus architecture-specific
  dependencies.

  - In resolve_versioned_dependencies: add branch for gfx-specific
    non-meta packages to split generic and gfxarch dependencies
  - In get_dependency_list_for_multiarch: add branch for generic packages
    to filter out gfxarch deps, and update gfx-specific branch to include
    both generic self-dependency and gfxarch deps with arch suffix

  This ensures proper dependency chains in multi-arch mode where each
  gfx-specific package pulls in the generic base plus its arch-specific
  requirements.
  Changes to build_package.py:
  - Handle empty sourcedir_list gracefully in multi-arch mode for DEB packages
    * Return empty list instead of sys.exit() to allow build continuation
    * Log ERROR message for visibility
  - Add warning for RPM packages with empty sourcedir_list in multi-arch mode
    * RPM can create empty packages, so continue with warning
  - Track failed architecture variants in failed_pkglist
    * When a package fails for specific architecture (e.g., gfx1151),
      add variant name to failed list
    * Provides visibility into which architecture variants failed vs succeeded
  - Preserve backward compatibility for single-arch mode
    * Still exits on error when not in multi-arch mode
This commit introduces a new workflow for building multi-arch native Linux
packages (DEB/RPM) that consolidates binaries for all GPU families into
unified packages, along with supporting infrastructure improvements.

Workflow changes:
- Add multi_arch_build_native_linux_packages.yml reusable workflow
  - Fetches artifacts for all GPU families (gfx94X-dcgpu, gfx120X-all, etc.)
  - Builds unified DEB/RPM packages containing all architectures
  - Implements comprehensive S3 bucket selection logic with decision tree
  - Supports multiple bucket types: CI artifacts, release packages, internal
  - Determines appropriate IAM roles based on build context

build_package.py enhancements:
- Add normalize_target_list() function for flexible input parsing
  - Accepts semicolon, comma, or space-separated GPU family lists
  - Works seamlessly with existing --enable-multi-arch flag
  - Example: "gfx94X-dcgpu;gfx120X-all" or "gfx94X-dcgpu,gfx120X-all"

upload_package_repo.py improvements:
- Add --s3-prefix parameter for explicit S3 prefix override
- Add "ci" job type to choices (dev/nightly/prerelease/ci)
- Make --amdgpu-family parameter optional (backward compatible)
- Implement S3 prefix logic:
  * Explicit --s3-prefix: use provided value
  * dev/nightly: <pkg_type>/<YYYYMMDD>-<artifact_id>
  * prerelease: v3/packages/<pkg_type>
  * ci: v3/packages/<pkg_type>/<YYYYMMDD>-<artifact_id>
- Maintain backward compatibility with existing callers

S3 bucket strategy:
- therock-ci-artifacts: Default CI builds (ROCm/TheRock non-fork)
- therock-ci-artifacts-external: Fork PRs and external repositories
- therock-artifacts-internal: ROCm/therock-releases-internal builds
- therock-dev-packages: Dev release packages (release_type=dev)
- therock-nightly-packages: Nightly release packages (release_type=nightly)
- therock-release-packages: Official releases (release_type=release/prerelease)

IAM role mapping:
- CI builds (job_type=ci): arn:aws:iam::692859939525:role/therock-ci
- Release builds: arn:aws:iam::692859939525:role/therock-{release_type}
  Introduce helper functions to streamline dependency and name field
  processing across DEB and RPM package generation.

  Changes:
  - Add process_dependency_field() for Get -> Filter -> Transform pattern
    * Handles DEBDepends, DEBRecommends, DEBSuggests
    * Handles RPMRequires, RPMRecommends, RPMSuggests
    * Returns empty string for empty dependency lists (fixes IndexError)
    * Supports use_multiarch flag for main dependencies

  - Add process_name_field() for name fields (Provides, Conflicts, etc.)
    * Simplifies get -> transform -> join operations
    * Supports optional transform functions (e.g., debian_replace_devel_name)

  - Refactor generate_control_file() to use new helpers
  - Refactor generate_spec_file() to use new helpers

  Bug fixes:
  - Use boolean True instead of string "True" for disable_dh_strip
  - Fix typo: "buillds" -> "builds"
  - Remove duplicate failed package tracking that caused pkg_name to be
    added twice when default architecture variant failed
- Revert multi_arch_ci_linux.yml to remove premature native package integration
- Add build_tools/packaging/linux/get_s3_config.py to replace inline bash logic
  - Determines S3 bucket, prefix, and job type based on release type and repository
  - Extracts date from ROCm package version for consistency between version and S3 path
  - Supports wheel, deb, and rpm package types
- Update multi_arch_build_native_linux_packages.yml:
  - Use get_s3_config.py script for S3 configuration
  - Optimize "Fetch Artifacts" step to eliminate bash loop
  - Convert semicolon-separated GPU families to comma-separated format
  - Pass full family list via --amdgpu-targets parameter
- Add comprehensive unit tests in build_tools/packaging/linux/test/get_s3_config_test.py
  - 23 tests covering all decision tree branches and date extraction logic
…check

- Add extract_date_from_version() function to parse dates from ROCm package versions
  - Supports Debian (8.1.0~dev20251203), RPM (8.1.0~20251203gf689a8e), and wheel (7.10.0a20251021) formats
  - Falls back to current date if no date pattern found
- Update determine_s3_config() to accept rocm_version parameter
  - Use extracted date for S3 path consistency with package version
  - Ensures rebuilding same version produces same S3 location
- Add --rocm-version CLI argument (optional, defaults to None)
- Remove obsolete ROCm/therock-releases-internal repository check
  - Logic is redundant as prerelease/release types are handled by release_type parameter
  - Simplifies decision tree from 4 branches to 3
- Update tests to match implementation (22 tests, all passing)

This fixes the workflow which was already passing --rocm-version but the script
did not accept it, causing failures.
debugedit truncation issue affecting multi-arch builds.
@raramakr raramakr force-pushed the users/raramakr/multi-arch branch from e97a0d6 to 393c331 Compare March 27, 2026 00:10
ScottTodd added a commit that referenced this pull request Mar 27, 2026
## Motivation

As part of #3323, we're switching
ROCm packaging from being "single-arch [family]" to being "multi-arch".
This switches our default CI workflow from "CI" to "multi-arch CI" on
presubmit (`pull_request`), fixing
#3337.

> [!NOTE]
> The "ci.yml" workflow will still run on `push` events and opt-in on
`pull_request` to help migrate remaining features.

The "CI" pipelines run fully independent single-stage builds for each
GPU family (e.g. gfx110X-all, gfx94X-dcgpu, etc.) while the "Multi-Arch
CI" pipelines run a single build (per Windows/Linux platform) that has a
graph of generic and target-specific build stages. We've been running
these workflows in parallel while getting multi-arch functional. Now,
enough features have been implemented (building artifacts, testing
artifacts, building python packages, etc.) that switching is possible.
Keeping multi-arch CI as opt-in / postsubmit only has been leading to
increased feature drift and slower progress on multi-arch support work,
so we want to switch as soon as possible.

> [!NOTE]
> The multi-arch workflows are still building with
`-DTHEROCK_FLAG_KPACK_SPLIT_ARTIFACTS=OFF`, see
#3338.

## Technical Details

### Triggering changes to workflows

workflow | event | behavior before | behavior after
-- | -- | -- | --
multi_arch_ci.yml | `pull_request` | opt-in | ⚠️ always runs
multi_arch_ci.yml | `push` | always runs | always runs
ci.yml | `pull_request` | always runs | ⚠️ opt-in
ciyml | `push` | always runs | always runs
ci_nightly.yml | `schedule` | runs | runs
ci_asan.yml | `schedule` | runs | runs
ci_tsan.yml | `schedule` | runs | runs

### Feature parity between workflows

We've roughly achieved feature parity between Multi-arch CI (new) and CI
(old). These features are missing:

Feature | Supported in ci.yml? | Supported in multi_arch_ci.yml? | Notes
-- | -- | -- | --
gfx950-dcgpu testing | Yes | ⚠️ No |
#3288
gfx950-dcgpu pytorch wheel build | Yes | ⚠️ No |
#3288
`run_functional_tests` input for test_artifacts.yml | Yes | ⚠️ No
`test_linux_benchmarks` job | Yes | ⚠️ No
`test_python_packages_per_family` on Windows | Yes | ⚠️ No
"Build summary" with links to logs/artifacts | Yes | ⚠️ No
`build_native_linux_packages` job | Yes | ⚠️ No |
#3561
`resource_info.py` and `analyze_build_times.py` | Yes | ⚠️ No

## Test Plan and Results

- [x] Check that CI _does not_ run on this PR by default
- [x] Check that multi-arch CI _does_ run on this PR by default
- [x] Create and add new CI opt-in label, check that adding it runs CI
- [ ] Continue to monitor workflow status on `push` and on PRs created
after this for issues

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
@amd-aakash amd-aakash self-requested a review March 31, 2026 03:01
Copy link
Copy Markdown
Contributor

@amd-aakash amd-aakash left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please cross check if the failing CI checks are known issues.

chiranjeevipattigidi pushed a commit that referenced this pull request Mar 31, 2026
## Motivation

As part of #3323, we're switching
ROCm packaging from being "single-arch [family]" to being "multi-arch".
This switches our default CI workflow from "CI" to "multi-arch CI" on
presubmit (`pull_request`), fixing
#3337.

> [!NOTE]
> The "ci.yml" workflow will still run on `push` events and opt-in on
`pull_request` to help migrate remaining features.

The "CI" pipelines run fully independent single-stage builds for each
GPU family (e.g. gfx110X-all, gfx94X-dcgpu, etc.) while the "Multi-Arch
CI" pipelines run a single build (per Windows/Linux platform) that has a
graph of generic and target-specific build stages. We've been running
these workflows in parallel while getting multi-arch functional. Now,
enough features have been implemented (building artifacts, testing
artifacts, building python packages, etc.) that switching is possible.
Keeping multi-arch CI as opt-in / postsubmit only has been leading to
increased feature drift and slower progress on multi-arch support work,
so we want to switch as soon as possible.

> [!NOTE]
> The multi-arch workflows are still building with
`-DTHEROCK_FLAG_KPACK_SPLIT_ARTIFACTS=OFF`, see
#3338.

## Technical Details

### Triggering changes to workflows

workflow | event | behavior before | behavior after
-- | -- | -- | --
multi_arch_ci.yml | `pull_request` | opt-in | ⚠️ always runs
multi_arch_ci.yml | `push` | always runs | always runs
ci.yml | `pull_request` | always runs | ⚠️ opt-in
ciyml | `push` | always runs | always runs
ci_nightly.yml | `schedule` | runs | runs
ci_asan.yml | `schedule` | runs | runs
ci_tsan.yml | `schedule` | runs | runs

### Feature parity between workflows

We've roughly achieved feature parity between Multi-arch CI (new) and CI
(old). These features are missing:

Feature | Supported in ci.yml? | Supported in multi_arch_ci.yml? | Notes
-- | -- | -- | --
gfx950-dcgpu testing | Yes | ⚠️ No |
#3288
gfx950-dcgpu pytorch wheel build | Yes | ⚠️ No |
#3288
`run_functional_tests` input for test_artifacts.yml | Yes | ⚠️ No
`test_linux_benchmarks` job | Yes | ⚠️ No
`test_python_packages_per_family` on Windows | Yes | ⚠️ No
"Build summary" with links to logs/artifacts | Yes | ⚠️ No
`build_native_linux_packages` job | Yes | ⚠️ No |
#3561
`resource_info.py` and `analyze_build_times.py` | Yes | ⚠️ No

## Test Plan and Results

- [x] Check that CI _does not_ run on this PR by default
- [x] Check that multi-arch CI _does_ run on this PR by default
- [x] Create and add new CI opt-in label, check that adding it runs CI
- [ ] Continue to monitor workflow status on `push` and on PRs created
after this for issues

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
@raramakr raramakr merged commit c590e1d into main Mar 31, 2026
98 of 101 checks passed
@raramakr raramakr deleted the users/raramakr/multi-arch branch March 31, 2026 15:58
@github-project-automation github-project-automation Bot moved this from TODO to Done in TheRock Triage Mar 31, 2026
@raramakr
Copy link
Copy Markdown
Contributor Author

Please cross check if the failing CI checks are known issues.

The failing CIs (2) are unnrelated to this PR

Comment on lines +100 to +131
- name: Determine IAM role
id: iam_role
run: |
# ================================================================
# IAM Role Selection Logic
# ================================================================
# Determines which AWS IAM role to assume based on job_type from s3_config step.
#
# Role Mapping:
# ├─ IF job_type == "ci"
# │ └─ Use: therock-ci role
# │ (For all CI buckets: therock-ci-artifacts, therock-ci-artifacts-external, therock-artifacts-internal)
# │
# └─ ELSE (job_type == dev/nightly/prerelease/release)
# └─ Use: therock-${job_type} role
# (For package buckets: therock-dev-packages, therock-nightly-packages, etc.)
#
# ================================================================

JOB_TYPE="${{ steps.s3_config.outputs.job_type }}"

if [[ "${JOB_TYPE}" == "ci" ]]; then
# CI builds use the shared CI role (for all artifact buckets)
IAM_ROLE="arn:aws:iam::692859939525:role/therock-ci"
echo "✓ Using CI role: ${IAM_ROLE}"
else
# Release builds use release-type-specific roles (for package buckets)
IAM_ROLE="arn:aws:iam::692859939525:role/therock-${JOB_TYPE}"
echo "✓ Using release-type role: ${IAM_ROLE}"
fi

echo "iam_role=${IAM_ROLE}" >> $GITHUB_OUTPUT
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI. I'm expecting to refactor this as part of standing up multi-arch releases (#3334). I think we can move this into the setup job and then plumb it through to this and other jobs via workflow inputs, rather than recompute in each job that needs to know a bucket and IAM role for that bucket

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ScottTodd i have done some more changes in s3 config as part of #4310. Let me know if any comments

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I sent #4386. My focus is on the core rocm release pipelines then I can look more closely at what the native packages are doing.

nunnikri added a commit that referenced this pull request Apr 25, 2026
## Multi-arch native package workflow improvements

### `multi_arch_build_native_linux_packages.yml`

- **Switch artifact fetching to `artifact_manager.py`**: Replaces
`fetch_artifacts.py` with `artifact_manager.py fetch` for consistent
multi-arch artifact fetching across all GPU families
- **Fix system requirements**: Use `llvm-20` instead of `llvm`; add
`pyzstd` for `.tar.zst` artifact decompression
- **Add `package_repository_url` output**: Workflow now exposes the
public S3 install URL as an output for downstream consumption (e.g.,
install test jobs)
- **Add AWS credential guard**: `Configure AWS Credentials` step now
only runs for authorized repositories (`ROCm/TheRock`, `ROCm/rockrel`),
skipped for forks

### `docs/packaging/native_packaging.md`

- Add S3 bucket/prefix/URL reference tables for all release types (dev,
nightly, prerelease, release, CI)
- Separate tables for GFX Specific Packages and Multi-Arch Packages


## Motivation

Update as per the review comments on
#3561

---------

Co-authored-by: ArvindCheru <Aravindan.Cheruvally@amd.com>
Co-authored-by: Claude Sonnet 4 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

5 participants