Add workflow for JAX wheel builds#1033
Conversation
|
Need to merge #1033 before this one |
|
Sorry. Need to merge ROCm/rocm-jax#58 first. |
|
@ScottTodd - any feedback on this? |
| @@ -0,0 +1,117 @@ | |||
| name: Build Linux PyTorch Wheels | |||
There was a problem hiding this comment.
Update workflow name to match the file name
| name: Build Linux PyTorch Wheels | |
| name: Build Linux JAX Wheels |
Also, are these builds "portable", e.g. by building under a manylinux docker container? I recently renamed some workflows to show that more clearly: #1023. It doesn't look like it from a read of https://github.com/ROCm/rocm-jax/blob/master/build/ci_build, so are these Linux packages only runnable on Ubuntu (or whichever distro is used for the build)?
There was a problem hiding this comment.
Yes, these are built under a manylinux wheel. For a number of reasons, our build scripts are a little oniony. But we create a manylinux image with ROCm here, and then do the build in that: https://github.com/ROCm/rocm-jax/blob/master/jax_rocm_plugin/build/rocm/ci_build#L57
| - name: Checkout TheRock | ||
| uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2 | ||
| with: | ||
| path: TheRock |
There was a problem hiding this comment.
I'd prefer to omit path for the workflow's own repository and instead add path as needed for other repositories like rocm/rocm-jax above.
There was a problem hiding this comment.
I flipped this around. I just want to point out that the only thing we use from checked-out TheRock is the ./build_tools/third_party/s3_management/manage.py. We use the tarball to actually do the build.
| - name: Build JAX Wheels | ||
| run: | | ||
| cd rocm-jax | ||
| python3 build/ci_build \ |
There was a problem hiding this comment.
It might make sense to move this script over to TheRock, similar to what we have for PyTorch. Wdyt?
There was a problem hiding this comment.
No. This is the script that all CI teams use to build JAX. TheRock aren't the only users. DevOps team uses this when they build stuff against regular ROCm, and we use this in our CI on the JAX team.
There was a problem hiding this comment.
It might make more sense then to checkout TheRock in rocm-jax, similar to what is done in rocm-libraries?
There was a problem hiding this comment.
You guys wanted the opposite above, so I changed it. I can change it back if you want?
When this workflow checks out TheRock, all it uses is ./build_tools/third_party/s3_management/manage.py. The actual build uses the tarball that you give it.
There was a problem hiding this comment.
You guys wanted the opposite above, so I changed it. I can change it back if you want?
I have not been involved in all prior discussions and might miss context here.
If it was agreed on earlier that this might be okay as is.
|
How close are we to getting this landed? Any details we can defer until we get it in and some runs under our belt? |
I've addressed everyone's comments on this, I think. Just waiting on an approval. |
|
Ah, please click "re-request review" (if you can, otherwise just ping with |
|
Yeah, I'm not seeing the "re-request review" option where it normally is. But it's ready for another round @marbre. |
There was a problem hiding this comment.
@marbre any concerns with using the same S3 bucket, subdir, etc. for JAX wheels?
- We may need to add more packages to our index pages in the https://github.com/ROCm/TheRock/tree/main/build_tools/third_party/s3_management scripts.
- If the build process outputs (or depends on) wheels other than JAX, is there a chance that they could conflict with what the PyTorch build produces?
There was a problem hiding this comment.
@marbre any concerns with using the same S3 bucket, subdir, etc. for JAX wheels?
I am still undecided. We will move creating regular releases (and writing to the nightly releases bucket) to another repo but this does does not necessarily conflict with the proposed changes. What is a bit unfortunate is the fact that the build script is not in TheRock and that it gets called in a release workflow. Furthermore, changes to the script upstream it will not trigger a CI run in TheRock.
Furthermore, with #1110, we plan to land a gating mechanism soon. Untested wheels (and I consider the JAX wheels untested for now) will go to a staging subdir in the bucket to allow testing. If tests pass, they get copied to the final location. This is something we might want to have for JAX wheels as well or at least want it reworked after #1110 was landed.
- We may need to add more packages to our index pages in the https://github.com/ROCm/TheRock/tree/main/build_tools/third_party/s3_management scripts.
Yes, without adding the package itself to manage.py, the index will not include the JAX wheel. Furthermore, it should be checked if additional packages must be added to update_dependencies.py.
- If the build process outputs (or depends on) wheels other than JAX, is there a chance that they could conflict with what the PyTorch build produces?
Yes, running the workflow to see what it produces and what it pushes to the dev is indeed something we should check.
There was a problem hiding this comment.
What is a bit unfortunate is the fact that the build script is not in TheRock and that it gets called in a release workflow. Furthermore, changes to the script upstream it will not trigger a CI run in TheRock.
Well it's the same for PyTorch. The scripts we have in TheRock are wrapping https://github.com/pytorch/pytorch/blob/main/setup.py. We're bootstrapping here, the bulk of support belongs upstream.
There was a problem hiding this comment.
So, do I need to do anything here to switch the bucket or anything? Or is this good as-is for now?
There was a problem hiding this comment.
I did add the packages that JAX needs to the list though. I think I've got everything in the right place.
There was a problem hiding this comment.
Have you run this workflow to test it? Can you share the logs?
If you wanted to test in this repository with the self-hosted runners, you could add a placeholder workflow first like with #828. This looks far enough along to land as-is, test, and then send new PRs for improvements.
There was a problem hiding this comment.
@charleshofer can you try with the self-hosted runner here and post the log. Would really like to land this..
There was a problem hiding this comment.
I've just run things through act to make sure the syntax was okay. I think it'd be easier to land this one rather than create another PR.
There was a problem hiding this comment.
It's hard to review a change a workflow file without being able to see any logs of the workflow running, especially for a new workflow. I can make an attempt to parse the code and walk through it by eye, but there are details that I can't foresee from just the code, like how long each step takes, if there are permissions issues, if there are syntax errors, etc.
There was a problem hiding this comment.
Will remember for future 🫡
There was a problem hiding this comment.
This input doesn't appear to be used yet. Safe to remove?
There was a problem hiding this comment.
Yeah, I'll remove this
There was a problem hiding this comment.
It's hard to review a change a workflow file without being able to see any logs of the workflow running, especially for a new workflow. I can make an attempt to parse the code and walk through it by eye, but there are details that I can't foresee from just the code, like how long each step takes, if there are permissions issues, if there are syntax errors, etc.
There was a problem hiding this comment.
Oh interesting, these package names encode the ROCm version? That isn't part of a version identifier or part of the index URL that the packages come from? As we rework packaging for PyTorch too, let's keep an eye on this.
Co-authored-by: Marius Brehler <marius.brehler@amd.com>
f2bd6dd to
3dcfa73
Compare
## Motivation Fixes problems with the JAX CI workflow that was created in #1033 ## Technical Details Fixes minor problems that make the workflow crash, and ensures some command-line tools that we need are installed. ## Test Plan Make sure the workflow works when run with Actions: https://github.com/ROCm/TheRock/actions/workflows/build_linux_jax_wheels.yml --------- Co-authored-by: Scott Todd <scott.todd0@gmail.com>



Adds a workflow for building JAX with release tarballs