[torch] Log environment reproduction steps in test workflows#2238
Conversation
HereThereBeDragons
left a comment
There was a problem hiding this comment.
overall a good idea.
here some thoughts for the moment.
- i am not sure if we want to have it running all the time in the pipeline?
wouldnt it be less noise if we have a markdown "here are the steps to do and export the variables from the CI run" and then have in the CI just print
export AMDGPU_FAMILY="gfx1151"
export TORCH_VERSION="2.7.1+rocm7.10.0a20251120"
export PYTORCH_GIT_REF="release/2.7"
etc?
-
in the future, when i have finally finished #1732, we can also easily propagate the container image hash
-
related to the different markdown versions:
do workflow steps show proper markdown? i always thought it is just plain text. -
python venv should be an extra step und not part of docker. maybe have the
apt installand setup of it together.
| summary += "See [Running/testing PyTorch](https://github.com/ROCm/TheRock/tree/main/external-builds/pytorch#runningtesting-pytorch). " | ||
| summary += "For example:\n\n" | ||
| summary += "```bash\n" | ||
| summary += "PYTORCH_TEST_WITH_ROCM=1 python pytorch/test/run_test.py --include test_torch\n" |
There was a problem hiding this comment.
we are not using this PYTORCH_TEST_WITH_ROCM - not sure if that is needed. as we are already checking out pytorch which takes enternity we could also check out therock. without fetch_sources it should be fast.
alternatively you should defnitely put here also an example with -k as this i would say is the preferred choice for debugging. not to run an entire test file but just selected tests.
There was a problem hiding this comment.
we are not using this
PYTORCH_TEST_WITH_ROCM- not sure if that is needed. as we are already checking out pytorch which takes enternity we could also check out therock. without fetch_sources it should be fast.
That variable is used in https://github.com/pytorch/pytorch/blob/main/torch/testing/_internal/common_utils.py. These instructions are written from the perspective of upstream PyTorch (https://github.com/pytorch/pytorch), with no dependencies on any sources from TheRock.
Here I figured we could put the most basic instructions that match https://rocm.docs.amd.com/projects/install-on-linux/en/develop/install/3rd-party/pytorch-install.html#test-the-pytorch-installation. If we link to https://github.com/ROCm/TheRock/tree/main/external-builds/pytorch#runningtesting-pytorch, we can also give more detailed instructions for using our scripts too.
alternatively you should defnitely put here also an example with
-kas this i would say is the preferred choice for debugging. not to run an entire test file but just selected tests.
What I want for that is to
- generate a test report during testing
- parse the test report to get a list of failures
- list the failures and show commands to reproduce those specific failures
There was a problem hiding this comment.
in that case where you give a specific list of failed commands, i can understand it would be quite cool to list the command to rerun.
but maybe we can reduce it "look here URL to setup the container" and then "here is the pytorch test command to run"
| summary += "See [Running/testing PyTorch](https://github.com/ROCm/TheRock/tree/main/external-builds/pytorch#runningtesting-pytorch). " | ||
| summary += "For example:\n\n" | ||
| summary += "```bash\n" | ||
| summary += "PYTORCH_TEST_WITH_ROCM=1 python pytorch/test/run_test.py --include test_torch\n" |
There was a problem hiding this comment.
we are not using this
PYTORCH_TEST_WITH_ROCM- not sure if that is needed. as we are already checking out pytorch which takes enternity we could also check out therock. without fetch_sources it should be fast.
That variable is used in https://github.com/pytorch/pytorch/blob/main/torch/testing/_internal/common_utils.py. These instructions are written from the perspective of upstream PyTorch (https://github.com/pytorch/pytorch), with no dependencies on any sources from TheRock.
Here I figured we could put the most basic instructions that match https://rocm.docs.amd.com/projects/install-on-linux/en/develop/install/3rd-party/pytorch-install.html#test-the-pytorch-installation. If we link to https://github.com/ROCm/TheRock/tree/main/external-builds/pytorch#runningtesting-pytorch, we can also give more detailed instructions for using our scripts too.
alternatively you should defnitely put here also an example with
-kas this i would say is the preferred choice for debugging. not to run an entire test file but just selected tests.
What I want for that is to
- generate a test report during testing
- parse the test report to get a list of failures
- list the failures and show commands to reproduce those specific failures
| """ | ||
| This summarizes the environment setup steps for the | ||
| .github/workflows/test_pytorch_wheels.yml workflow. |
There was a problem hiding this comment.
- i am not sure if we want to have it running all the time in the pipeline?
wouldnt it be less noise if we have a markdown "here are the steps to do and export the variables from the CI run" and then have in the CI just print
export AMDGPU_FAMILY="gfx1151" export TORCH_VERSION="2.7.1+rocm7.10.0a20251120" export PYTORCH_GIT_REF="release/2.7"
I hear that, yeah. I think for many contributors who aren't as familiar with each CI pipeline, having a nicely formatted summary will make workflow results easier to understand. I know where in the logs to look for reproduction steps across each workflow type, but many developers do not.
There was a problem hiding this comment.
see my comments some discussions above: maybe have a generic file in the doc/ how to setup docker for pytorch tests and then in the ci have a referrence to that doc as URL + the pytorch test command
There was a problem hiding this comment.
Here's a new attempt: https://gist.github.com/ScottTodd/6a465a4958fdaea59ede417434ba64b4#file-v4-md, which I'll write the code for now.
The output will be in this format:
PyTorch Test Report
- torch version:
2.7.1+rocm7.10.0a20251120 - GPU family:
gfx110X-dgpu - Package index: https://rocm.nightlies.amd.com/v2-staging/gfx110X-dgpu
- Source code: https://github.com/ROCm/pytorch/tree/release/2.7
To reproduce, see Running/testing PyTorch and setup with:
# Fetch pytorch source files, including tests:
git clone --branch release/2.7 --origin rocm https://github.com/ROCm/pytorch.git
# Install torch and test requirements
pip install ^
--index-url=https://rocm.nightlies.amd.com/v2-staging/gfx110X-dgpu ^
torch==2.7.1+rocm7.10.0a20251120
pip install -r pytorch/.ci/docker/requirements-ci.txt|
Approved, though I personally found the optional docker step very helpful. |
HereThereBeDragons
left a comment
There was a problem hiding this comment.
i like the brevity of v4
Fixes #2173 This writes a step summary for each of our "Test PyTorch Wheels" workflow runs showing how to reproduce the test environment without needing to sift through the logs, find workflow file sources, or even clone TheRock to use our scripts. > [!IMPORTANT] > Example workflow run: https://github.com/ROCm/TheRock/actions/runs/19944247475 > > I tried a few variations on the output format: https://gist.github.com/ScottTodd/6a465a4958fdaea59ede417434ba64b4. Here's an older test run with v1: https://github.com/ROCm/TheRock/actions/runs/19553166331. I worked through a few design points: * I want the instructions to be easy to copy/paste into issue reports and terminals, but there are two key branches: 1. Starting a Docker container (which is optional). If you copy/paste a full set of instructions and one starts an interactive container the remaining instructions will not be run. 2. If you already have pytorch checked out, you don't need to clone the repository. * We probably want a few quick bullet points at the start listing the inputs (pytorch version), linking to the github branch, linking to the release index page, etc. * If we have the pytorch tests generate a test report, we could also include that in the summary and even show how to run specific failed test cases individually (maybe limit to the first 10 failures?)
Fixes #2173
This writes a step summary for each of our "Test PyTorch Wheels" workflow runs showing how to reproduce the test environment without needing to sift through the logs, find workflow file sources, or even clone TheRock to use our scripts.
Important
Example workflow run: https://github.com/ROCm/TheRock/actions/runs/19944247475
I tried a few variations on the output format: https://gist.github.com/ScottTodd/6a465a4958fdaea59ede417434ba64b4. Here's an older test run with v1: https://github.com/ROCm/TheRock/actions/runs/19553166331.
I worked through a few design points: