add jax core test#2035
Conversation
| - name: Prepare wheelhouse and download JAX wheels | ||
| shell: bash | ||
| working-directory: jax | ||
| run: | | ||
| python3 ../build_tools/fetch_wheels.py \ | ||
| --cloudfront_url ${{ inputs.package_index_url }} \ | ||
| --amdgpu_family ${{ inputs.amdgpu_family }} \ | ||
| --dir "wheelhouse" \ | ||
| --list_whls '${{ inputs.jax_whl_list }}' |
There was a problem hiding this comment.
What is this actually doing with the downloaded wheels? Is the python3 build/ci_build script using this "wheelhouse" directory?
Can this use setup_venv.py as we do already in the pytorch test workflow?
TheRock/.github/workflows/test_pytorch_wheels.yml
Lines 112 to 118 in 2e9e168
Here are some logs of that script running in the workflow: https://github.com/ROCm/TheRock/actions/runs/19228713043/job/54971284725#step:7:22
There was a problem hiding this comment.
No, we cannot use a setup_venv.py. As we have to change the docker build steps all that if we have to use a venv. This is the same way just a wheelhouse directory.
There was a problem hiding this comment.
Why is Docker a requirement? I still haven't seen a clear answer to that. What are the user install instructions like? Are they more than pip install jax --index-url=...? Tests should match what users will do, and we can't expect users to download files manually then build a dockerfile.
There was a problem hiding this comment.
Yes, it does has other requriements to be configured and not the whole code is in theROCk as we are using this workflow as a wrapper to the rocm-jax and upstream code. I have update the script as per suggestion to use requests instead of wget.
There was a problem hiding this comment.
@ScottTodd It's not a requirement for running tests. This is just the way that we currently do it in our CI because some Ubuntu docker images are part of our deliverables. We do everything through the build/ci_build script, and the script assumes that you want to do the build in three stages:
- Build the wheels
- Build the docker images
- Test using the docker images
There isn't a step in our CI that installs the wheels (and requirements like ROCm or pytest) with pip and then runs the tests. You can do this, of course, but we just don't have an easy command for it in build/ci_build because we're taking care of that in step 3 of our CI. We could probably add a command that will do that to build/ci_build though, or you could try and do that right in the workflow.
There was a problem hiding this comment.
I don't think this script should be needed, see my other comment.
If we do keep this script, some high level comments:
- For downloading files, I'd suggest using either
pip download(https://pip.pypa.io/en/stable/cli/pip_download/) orrequests(https://pypi.org/project/requests/), instead ofsubprocess.runwithwget(we also have helperexecfunctions that set better defaults forsubprocess) - Use
pathlibinstead ofosand raw strings, consistently throughout
There was a problem hiding this comment.
We will need this script and will update as per your comments.
| - name: jax wheel list | ||
| id: jax_wheel_list | ||
| run: | | ||
| export jax_whl_list="$(python3 ./build_tools/mapping_built_jax_wheels.py --dir "${{ env.PACKAGE_DIST_DIR }}")" | ||
| echo "jax_whl_list=${jax_whl_list}" >> "$GITHUB_OUTPUT" | ||
| echo "${jax_whl_list}" >> "$GITHUB_STEP_SUMMARY" |
There was a problem hiding this comment.
Please follow the style of https://github.com/ROCm/TheRock/blob/main/build_tools/github_actions/write_torch_versions.py here with a build_tools/github_actions/write_jax_versions.py
Usage:
TheRock/.github/workflows/build_portable_linux_pytorch_wheels.yml
Lines 177 to 188 in 2e9e168
TheRock/.github/workflows/build_portable_linux_pytorch_wheels.yml
Lines 117 to 122 in 2e9e168
- (and other parts of that file)
TheRock/.github/workflows/test_pytorch_wheels.yml
Lines 80 to 82 in 2e9e168
TheRock/.github/workflows/test_pytorch_wheels.yml
Lines 112 to 118 in 2e9e168
Note that the versions of the files are what is computed and passed through the workflows, not their filenames. Then pip commands use the versions to download or install wheels with those versions (as users would do outside of github actions workflows)
There was a problem hiding this comment.
I will think of adding a logging or summary later as part of enhancements as it already has the logs.
| - name: Compute ROCM_VERSION_SHORT | ||
| working-directory: jax | ||
| env: | ||
| ROCM_VERSION: ${{ inputs.rocm_version }} | ||
| run: | | ||
| # Extract major.minor.patch from ROCM_VERSION | ||
| ROCM_VERSION_SHORT="$(echo "${ROCM_VERSION}" | sed -E 's/^([0-9]+)\.([0-9]+)\.([0-9]+).*/\1.\2.\3/')" | ||
| echo "ROCM_VERSION_SHORT=${ROCM_VERSION_SHORT}" >> "$GITHUB_ENV" | ||
| echo "Full ROCm version: ${ROCM_VERSION}" | ||
| echo "Using semantic ROCm version: ${ROCM_VERSION_SHORT}" |
There was a problem hiding this comment.
This code is suspicious and I don't think it should be needed. At a minimum there should be a comment here explaining what problem this code is solving.
There was a problem hiding this comment.
If I'm passing the full rocmversion docker fails to add that as a tag. Added more comments.
There was a problem hiding this comment.
There is no need to compute a tag. Rather spin up a docker, pip install the jax wheels, run tests and that's it. No need to build a docker container.
There was a problem hiding this comment.
Either way its the same. Goal is to test the wheels. The tests currently which jax team is supporting are these tests.
There was a problem hiding this comment.
Looks like JAX has these docs, which do suggest running in a container: https://github.com/ROCm/rocm-jax/blob/master/BUILDING.md#3-running-tests, but I'm not seeing a technical explanation there for why tests are difficult to run outside of a container. We are strongly discouraging Docker and container usage in new packaging in TheRock and this is the time to investigate those details.
Again, and I'm getting frustrated repeating myself - this type of work needs to start with a design discussion, before we get into details during code review. We've wasted multiple weeks now going back and forth without seeing a plan written down. At a minimum provide links to pages like these so we can discuss the design points - that is your job as the PR author and not something reviewers should need to research on their own:
- https://rocm.docs.amd.com/projects/install-on-linux/en/latest/install/3rd-party/jax-install.html (in contrast to https://rocm.docs.amd.com/projects/install-on-linux/en/latest/install/3rd-party/pytorch-install.html#test-the-pytorch-installation)
- https://github.com/ROCm/rocm-jax/blob/master/BUILDING.md#3-running-tests
There was a problem hiding this comment.
They're not terribly difficult, it just requires the right dependencies. Building the image and then running with that is just what we do in our current CI setup. See my other comment. I wholeheartedly agree that doing some design work and getting clear on how TheRock expects frameworks to build and test is much needed, and I'd be happy to discuss it somewhere that's not a PR comment thread.
| - name: Build JAX Docker image | ||
| working-directory: jax | ||
| env: | ||
| ROCM_VERSION: ${{ inputs.rocm_version }} | ||
| run: | | ||
| python3 build/ci_build \ | ||
| --rocm-version="${ROCM_VERSION_SHORT}" \ | ||
| --therock-path="${{ inputs.tar_url }}" \ | ||
| build_dockers \ | ||
| -f ubu24 | ||
| # Assign JAX_IMAGE to the expected image name produced by the build script | ||
| JAX_IMAGE="jax-ubu24.rocm$(echo "${ROCM_VERSION_SHORT}" | tr -d '.'):latest" | ||
| echo "JAX_IMAGE=${JAX_IMAGE}" >> "$GITHUB_ENV" | ||
| echo "Built JAX Docker image: ${JAX_IMAGE}" |
There was a problem hiding this comment.
Why is a Docker image needed here?
If keeping the docker image, a tag other than :latest is probably called for (I assume this doesn't get uploaded anywhere but still safer to not have a test workflow directly set the tag to "latest")
There was a problem hiding this comment.
The core_tests and other GPU tests are ran using the docker image. Even if we test the docker image to latest or not the latest, it should be any difference as its running inside a pod and will be decommissioned after this workflow is completed running.
There was a problem hiding this comment.
Why does this need to use a custom Docker image? I strongly suggest to follow the style used in testing torch wheels in TheRock. There a Docker image is only used to guarantee having a ROCm-free environment. Packages to test should than be installed via pip. See https://github.com/ROCm/TheRock/blob/main/.github/workflows/test_pytorch_wheels.yml
There was a problem hiding this comment.
This is not a custom docker image. Its the same docker file which jax upstream supports and the tests which we are running. The tests currently are running on a docker image based. I can have a new issue which is to enhance it the same way as torch. Most of the tests support only docker image based even the performance and accuracy tests. for JAX-MAX training. Even for the pytorch we have to build the docker image to support performance training pytorch which is PRIMUS.
marbre
left a comment
There was a problem hiding this comment.
I think we're hitting the same issue as with the previous PR. This has a diagram but doesn't argue why it needs to use Docker the way it does. I really think this needs to follow the workflow style used for testing torch wheels.
| - name: Build JAX Docker image | ||
| working-directory: jax | ||
| env: | ||
| ROCM_VERSION: ${{ inputs.rocm_version }} | ||
| run: | | ||
| python3 build/ci_build \ | ||
| --rocm-version="${ROCM_VERSION_SHORT}" \ | ||
| --therock-path="${{ inputs.tar_url }}" \ | ||
| build_dockers \ | ||
| -f ubu24 | ||
| # Assign JAX_IMAGE to the expected image name produced by the build script | ||
| JAX_IMAGE="jax-ubu24.rocm$(echo "${ROCM_VERSION_SHORT}" | tr -d '.'):latest" | ||
| echo "JAX_IMAGE=${JAX_IMAGE}" >> "$GITHUB_ENV" | ||
| echo "Built JAX Docker image: ${JAX_IMAGE}" |
There was a problem hiding this comment.
Why does this need to use a custom Docker image? I strongly suggest to follow the style used in testing torch wheels in TheRock. There a Docker image is only used to guarantee having a ROCm-free environment. Packages to test should than be installed via pip. See https://github.com/ROCm/TheRock/blob/main/.github/workflows/test_pytorch_wheels.yml
| - name: Compute ROCM_VERSION_SHORT | ||
| working-directory: jax | ||
| env: | ||
| ROCM_VERSION: ${{ inputs.rocm_version }} | ||
| run: | | ||
| # Extract major.minor.patch from ROCM_VERSION | ||
| ROCM_VERSION_SHORT="$(echo "${ROCM_VERSION}" | sed -E 's/^([0-9]+)\.([0-9]+)\.([0-9]+).*/\1.\2.\3/')" | ||
| echo "ROCM_VERSION_SHORT=${ROCM_VERSION_SHORT}" >> "$GITHUB_ENV" | ||
| echo "Full ROCm version: ${ROCM_VERSION}" | ||
| echo "Using semantic ROCm version: ${ROCM_VERSION_SHORT}" |
There was a problem hiding this comment.
There is no need to compute a tag. Rather spin up a docker, pip install the jax wheels, run tests and that's it. No need to build a docker container.
I have explained the same thing as the previous PR. To run the tests which currently JAX team supports are based on a docker image. And the we have to use the same dockerfile which is from upstream of JAX. Thats the reason I have added a flow diagram. Please check the flow diagram. Thats the flow if you need changes we have to check with the upstream jax and jax team. |
|
Latest Run : https://github.com/ROCm/TheRock/actions/runs/19287934450 As per our Discussion created a new issue for enhancement which will be by next release. #2124 |
|
Closing this PR as its been reworked and as suggested started a new PR with changes #2247 |
Motivation
Technical Details
Test Plan
Test Result
Submission Checklist
Core tests pass on at least one GPU configuration in CI
Workflow has a manual dispatch input for testing arbitrary wheel URLs
Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.