Skip to content

add jax core test#2035

Closed
kiran-thumma wants to merge 22 commits into
mainfrom
users/kithumma/jax-whl-coretest
Closed

add jax core test#2035
kiran-thumma wants to merge 22 commits into
mainfrom
users/kithumma/jax-whl-coretest

Conversation

@kiran-thumma
Copy link
Copy Markdown
Contributor

@kiran-thumma kiran-thumma commented Nov 6, 2025

Motivation

  • Add an automated Core GPU validation test that verifies built JAX wheel (.whl) files are functional.
  • We currently lack automated wheel-level GPU validation. This test will catch regressions in packaging, compatibility issues, missing binaries, or other wheel-level failures before wheels are promoted.
  • Faster feedback, reduces risk of shipping broken GPU wheels.

Technical Details

  • New workflow: test_linux_jax_wheels.yml
  • Test type: Core tests
  • Purpose: Install build JAX wheels in a docker and run tests.
  • Diagram:
    jax-test-workflow-11-06 024529

Test Plan

  • Local Verification
  • CI Verification

Test Result

  • Workflow run:
  • ENV Tested:
  • Output Summary:

Submission Checklist

@kiran-thumma
Copy link
Copy Markdown
Contributor Author

@kiran-thumma kiran-thumma marked this pull request as ready for review November 10, 2025 16:46
Comment thread .github/workflows/test_linux_jax_wheels.yml Outdated
Comment on lines +115 to +123
- name: Prepare wheelhouse and download JAX wheels
shell: bash
working-directory: jax
run: |
python3 ../build_tools/fetch_wheels.py \
--cloudfront_url ${{ inputs.package_index_url }} \
--amdgpu_family ${{ inputs.amdgpu_family }} \
--dir "wheelhouse" \
--list_whls '${{ inputs.jax_whl_list }}'
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this actually doing with the downloaded wheels? Is the python3 build/ci_build script using this "wheelhouse" directory?

Can this use setup_venv.py as we do already in the pytorch test workflow?

- name: Set up virtual environment
run: |
python build_tools/setup_venv.py ${VENV_DIR} \
--packages torch==${TORCH_VERSION} \
--index-url ${{ inputs.package_index_url }} \
--index-subdir ${{ inputs.amdgpu_family }} \
--activate-in-future-github-actions-steps

Here are some logs of that script running in the workflow: https://github.com/ROCm/TheRock/actions/runs/19228713043/job/54971284725#step:7:22

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, we cannot use a setup_venv.py. As we have to change the docker build steps all that if we have to use a venv. This is the same way just a wheelhouse directory.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is Docker a requirement? I still haven't seen a clear answer to that. What are the user install instructions like? Are they more than pip install jax --index-url=...? Tests should match what users will do, and we can't expect users to download files manually then build a dockerfile.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it does has other requriements to be configured and not the whole code is in theROCk as we are using this workflow as a wrapper to the rocm-jax and upstream code. I have update the script as per suggestion to use requests instead of wget.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ScottTodd It's not a requirement for running tests. This is just the way that we currently do it in our CI because some Ubuntu docker images are part of our deliverables. We do everything through the build/ci_build script, and the script assumes that you want to do the build in three stages:

  1. Build the wheels
  2. Build the docker images
  3. Test using the docker images

There isn't a step in our CI that installs the wheels (and requirements like ROCm or pytest) with pip and then runs the tests. You can do this, of course, but we just don't have an easy command for it in build/ci_build because we're taking care of that in step 3 of our CI. We could probably add a command that will do that to build/ci_build though, or you could try and do that right in the workflow.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this script should be needed, see my other comment.

If we do keep this script, some high level comments:

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We will need this script and will update as per your comments.

Comment on lines +151 to +156
- name: jax wheel list
id: jax_wheel_list
run: |
export jax_whl_list="$(python3 ./build_tools/mapping_built_jax_wheels.py --dir "${{ env.PACKAGE_DIST_DIR }}")"
echo "jax_whl_list=${jax_whl_list}" >> "$GITHUB_OUTPUT"
echo "${jax_whl_list}" >> "$GITHUB_STEP_SUMMARY"
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please follow the style of https://github.com/ROCm/TheRock/blob/main/build_tools/github_actions/write_torch_versions.py here with a build_tools/github_actions/write_jax_versions.py

Usage:

  • - name: Build PyTorch Wheels
    id: build-pytorch-wheels
    run: |
    echo "Building PyTorch wheels for ${{ inputs.amdgpu_family }}"
    ./external-builds/pytorch/build_prod_wheels.py \
    build \
    --install-rocm \
    --pip-cache-dir /tmp/pipcache \
    --index-url "${{ inputs.cloudfront_url }}/${{ inputs.amdgpu_family }}/" \
    --clean \
    --output-dir ${{ env.PACKAGE_DIST_DIR }} ${{ env.optional_build_prod_arguments }}
    python ./build_tools/github_actions/write_torch_versions.py --dist-dir ${{ env.PACKAGE_DIST_DIR }}
  • outputs:
    cp_version: ${{ env.cp_version }}
    torch_version: ${{ steps.build-pytorch-wheels.outputs.torch_version }}
    torchaudio_version: ${{ steps.build-pytorch-wheels.outputs.torchaudio_version }}
    torchvision_version: ${{ steps.build-pytorch-wheels.outputs.torchvision_version }}
    triton_version: ${{ steps.build-pytorch-wheels.outputs.triton_version }}
  • (and other parts of that file)
  • env:
    VENV_DIR: ${{ github.workspace }}/.venv
    TORCH_VERSION: ${{ inputs.torch_version }}
  • - name: Set up virtual environment
    run: |
    python build_tools/setup_venv.py ${VENV_DIR} \
    --packages torch==${TORCH_VERSION} \
    --index-url ${{ inputs.package_index_url }} \
    --index-subdir ${{ inputs.amdgpu_family }} \
    --activate-in-future-github-actions-steps

Note that the versions of the files are what is computed and passed through the workflows, not their filenames. Then pip commands use the versions to download or install wheels with those versions (as users would do outside of github actions workflows)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will think of adding a logging or summary later as part of enhancements as it already has the logs.

Comment on lines +125 to +134
- name: Compute ROCM_VERSION_SHORT
working-directory: jax
env:
ROCM_VERSION: ${{ inputs.rocm_version }}
run: |
# Extract major.minor.patch from ROCM_VERSION
ROCM_VERSION_SHORT="$(echo "${ROCM_VERSION}" | sed -E 's/^([0-9]+)\.([0-9]+)\.([0-9]+).*/\1.\2.\3/')"
echo "ROCM_VERSION_SHORT=${ROCM_VERSION_SHORT}" >> "$GITHUB_ENV"
echo "Full ROCm version: ${ROCM_VERSION}"
echo "Using semantic ROCm version: ${ROCM_VERSION_SHORT}"
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code is suspicious and I don't think it should be needed. At a minimum there should be a comment here explaining what problem this code is solving.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I'm passing the full rocmversion docker fails to add that as a tag. Added more comments.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no need to compute a tag. Rather spin up a docker, pip install the jax wheels, run tests and that's it. No need to build a docker container.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Either way its the same. Goal is to test the wheels. The tests currently which jax team is supporting are these tests.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like JAX has these docs, which do suggest running in a container: https://github.com/ROCm/rocm-jax/blob/master/BUILDING.md#3-running-tests, but I'm not seeing a technical explanation there for why tests are difficult to run outside of a container. We are strongly discouraging Docker and container usage in new packaging in TheRock and this is the time to investigate those details.

Again, and I'm getting frustrated repeating myself - this type of work needs to start with a design discussion, before we get into details during code review. We've wasted multiple weeks now going back and forth without seeing a plan written down. At a minimum provide links to pages like these so we can discuss the design points - that is your job as the PR author and not something reviewers should need to research on their own:

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They're not terribly difficult, it just requires the right dependencies. Building the image and then running with that is just what we do in our current CI setup. See my other comment. I wholeheartedly agree that doing some design work and getting clear on how TheRock expects frameworks to build and test is much needed, and I'd be happy to discuss it somewhere that's not a PR comment thread.

Comment on lines +136 to +149
- name: Build JAX Docker image
working-directory: jax
env:
ROCM_VERSION: ${{ inputs.rocm_version }}
run: |
python3 build/ci_build \
--rocm-version="${ROCM_VERSION_SHORT}" \
--therock-path="${{ inputs.tar_url }}" \
build_dockers \
-f ubu24
# Assign JAX_IMAGE to the expected image name produced by the build script
JAX_IMAGE="jax-ubu24.rocm$(echo "${ROCM_VERSION_SHORT}" | tr -d '.'):latest"
echo "JAX_IMAGE=${JAX_IMAGE}" >> "$GITHUB_ENV"
echo "Built JAX Docker image: ${JAX_IMAGE}"
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is a Docker image needed here?

If keeping the docker image, a tag other than :latest is probably called for (I assume this doesn't get uploaded anywhere but still safer to not have a test workflow directly set the tag to "latest")

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The core_tests and other GPU tests are ran using the docker image. Even if we test the docker image to latest or not the latest, it should be any difference as its running inside a pod and will be decommissioned after this workflow is completed running.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does this need to use a custom Docker image? I strongly suggest to follow the style used in testing torch wheels in TheRock. There a Docker image is only used to guarantee having a ROCm-free environment. Packages to test should than be installed via pip. See https://github.com/ROCm/TheRock/blob/main/.github/workflows/test_pytorch_wheels.yml

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not a custom docker image. Its the same docker file which jax upstream supports and the tests which we are running. The tests currently are running on a docker image based. I can have a new issue which is to enhance it the same way as torch. Most of the tests support only docker image based even the performance and accuracy tests. for JAX-MAX training. Even for the pytorch we have to build the docker image to support performance training pytorch which is PRIMUS.

@kiran-thumma
Copy link
Copy Markdown
Contributor Author

Copy link
Copy Markdown
Member

@marbre marbre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we're hitting the same issue as with the previous PR. This has a diagram but doesn't argue why it needs to use Docker the way it does. I really think this needs to follow the workflow style used for testing torch wheels.

Comment on lines +136 to +149
- name: Build JAX Docker image
working-directory: jax
env:
ROCM_VERSION: ${{ inputs.rocm_version }}
run: |
python3 build/ci_build \
--rocm-version="${ROCM_VERSION_SHORT}" \
--therock-path="${{ inputs.tar_url }}" \
build_dockers \
-f ubu24
# Assign JAX_IMAGE to the expected image name produced by the build script
JAX_IMAGE="jax-ubu24.rocm$(echo "${ROCM_VERSION_SHORT}" | tr -d '.'):latest"
echo "JAX_IMAGE=${JAX_IMAGE}" >> "$GITHUB_ENV"
echo "Built JAX Docker image: ${JAX_IMAGE}"
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does this need to use a custom Docker image? I strongly suggest to follow the style used in testing torch wheels in TheRock. There a Docker image is only used to guarantee having a ROCm-free environment. Packages to test should than be installed via pip. See https://github.com/ROCm/TheRock/blob/main/.github/workflows/test_pytorch_wheels.yml

Comment on lines +125 to +134
- name: Compute ROCM_VERSION_SHORT
working-directory: jax
env:
ROCM_VERSION: ${{ inputs.rocm_version }}
run: |
# Extract major.minor.patch from ROCM_VERSION
ROCM_VERSION_SHORT="$(echo "${ROCM_VERSION}" | sed -E 's/^([0-9]+)\.([0-9]+)\.([0-9]+).*/\1.\2.\3/')"
echo "ROCM_VERSION_SHORT=${ROCM_VERSION_SHORT}" >> "$GITHUB_ENV"
echo "Full ROCm version: ${ROCM_VERSION}"
echo "Using semantic ROCm version: ${ROCM_VERSION_SHORT}"
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no need to compute a tag. Rather spin up a docker, pip install the jax wheels, run tests and that's it. No need to build a docker container.

@kiran-thumma
Copy link
Copy Markdown
Contributor Author

I think we're hitting the same issue as with the previous PR. This has a diagram but doesn't argue why it needs to use Docker the way it does. I really think this needs to follow the workflow style used for testing torch wheels.

I have explained the same thing as the previous PR. To run the tests which currently JAX team supports are based on a docker image. And the we have to use the same dockerfile which is from upstream of JAX. Thats the reason I have added a flow diagram. Please check the flow diagram. Thats the flow if you need changes we have to check with the upstream jax and jax team.

@kiran-thumma
Copy link
Copy Markdown
Contributor Author

Latest Run : https://github.com/ROCm/TheRock/actions/runs/19287934450

As per our Discussion created a new issue for enhancement which will be by next release. #2124

@kiran-thumma
Copy link
Copy Markdown
Contributor Author

Closing this PR as its been reworked and as suggested started a new PR with changes #2247

@github-project-automation github-project-automation Bot moved this from TODO to Done in TheRock Triage Nov 26, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

5 participants