Skip to content

[torch] Log environment reproduction steps in test workflows#2238

Merged
ScottTodd merged 10 commits into
mainfrom
users/scotttodd/torch-repro-logging
Dec 5, 2025
Merged

[torch] Log environment reproduction steps in test workflows#2238
ScottTodd merged 10 commits into
mainfrom
users/scotttodd/torch-repro-logging

Conversation

@ScottTodd
Copy link
Copy Markdown
Member

@ScottTodd ScottTodd commented Nov 20, 2025

Fixes #2173

This writes a step summary for each of our "Test PyTorch Wheels" workflow runs showing how to reproduce the test environment without needing to sift through the logs, find workflow file sources, or even clone TheRock to use our scripts.

I worked through a few design points:

  • I want the instructions to be easy to copy/paste into issue reports and terminals, but there are two key branches:
    1. Starting a Docker container (which is optional). If you copy/paste a full set of instructions and one starts an interactive container the remaining instructions will not be run.
    2. If you already have pytorch checked out, you don't need to clone the repository.
  • We probably want a few quick bullet points at the start listing the inputs (pytorch version), linking to the github branch, linking to the release index page, etc.
  • If we have the pytorch tests generate a test report, we could also include that in the summary and even show how to run specific failed test cases individually (maybe limit to the first 10 failures?)

Comment thread .github/workflows/test_pytorch_wheels.yml Outdated
Copy link
Copy Markdown
Contributor

@HereThereBeDragons HereThereBeDragons left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

overall a good idea.

here some thoughts for the moment.

  1. i am not sure if we want to have it running all the time in the pipeline?

wouldnt it be less noise if we have a markdown "here are the steps to do and export the variables from the CI run" and then have in the CI just print

export AMDGPU_FAMILY="gfx1151"
export TORCH_VERSION="2.7.1+rocm7.10.0a20251120"
export PYTORCH_GIT_REF="release/2.7"

etc?

  1. in the future, when i have finally finished #1732, we can also easily propagate the container image hash

  2. related to the different markdown versions:
    do workflow steps show proper markdown? i always thought it is just plain text.

  3. python venv should be an extra step und not part of docker. maybe have the apt install and setup of it together.

summary += "See [Running/testing PyTorch](https://github.com/ROCm/TheRock/tree/main/external-builds/pytorch#runningtesting-pytorch). "
summary += "For example:\n\n"
summary += "```bash\n"
summary += "PYTORCH_TEST_WITH_ROCM=1 python pytorch/test/run_test.py --include test_torch\n"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we are not using this PYTORCH_TEST_WITH_ROCM - not sure if that is needed. as we are already checking out pytorch which takes enternity we could also check out therock. without fetch_sources it should be fast.

alternatively you should defnitely put here also an example with -k as this i would say is the preferred choice for debugging. not to run an entire test file but just selected tests.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we are not using this PYTORCH_TEST_WITH_ROCM - not sure if that is needed. as we are already checking out pytorch which takes enternity we could also check out therock. without fetch_sources it should be fast.

That variable is used in https://github.com/pytorch/pytorch/blob/main/torch/testing/_internal/common_utils.py. These instructions are written from the perspective of upstream PyTorch (https://github.com/pytorch/pytorch), with no dependencies on any sources from TheRock.

Here I figured we could put the most basic instructions that match https://rocm.docs.amd.com/projects/install-on-linux/en/develop/install/3rd-party/pytorch-install.html#test-the-pytorch-installation. If we link to https://github.com/ROCm/TheRock/tree/main/external-builds/pytorch#runningtesting-pytorch, we can also give more detailed instructions for using our scripts too.

alternatively you should defnitely put here also an example with -k as this i would say is the preferred choice for debugging. not to run an entire test file but just selected tests.

What I want for that is to

  1. generate a test report during testing
  2. parse the test report to get a list of failures
  3. list the failures and show commands to reproduce those specific failures

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in that case where you give a specific list of failed commands, i can understand it would be quite cool to list the command to rerun.

but maybe we can reduce it "look here URL to setup the container" and then "here is the pytorch test command to run"

Comment thread build_tools/github_actions/summarize_test_pytorch_workflow.py
Comment thread build_tools/github_actions/summarize_test_pytorch_workflow.py Outdated
Comment thread build_tools/github_actions/summarize_test_pytorch_workflow.py Outdated
summary += "See [Running/testing PyTorch](https://github.com/ROCm/TheRock/tree/main/external-builds/pytorch#runningtesting-pytorch). "
summary += "For example:\n\n"
summary += "```bash\n"
summary += "PYTORCH_TEST_WITH_ROCM=1 python pytorch/test/run_test.py --include test_torch\n"
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we are not using this PYTORCH_TEST_WITH_ROCM - not sure if that is needed. as we are already checking out pytorch which takes enternity we could also check out therock. without fetch_sources it should be fast.

That variable is used in https://github.com/pytorch/pytorch/blob/main/torch/testing/_internal/common_utils.py. These instructions are written from the perspective of upstream PyTorch (https://github.com/pytorch/pytorch), with no dependencies on any sources from TheRock.

Here I figured we could put the most basic instructions that match https://rocm.docs.amd.com/projects/install-on-linux/en/develop/install/3rd-party/pytorch-install.html#test-the-pytorch-installation. If we link to https://github.com/ROCm/TheRock/tree/main/external-builds/pytorch#runningtesting-pytorch, we can also give more detailed instructions for using our scripts too.

alternatively you should defnitely put here also an example with -k as this i would say is the preferred choice for debugging. not to run an entire test file but just selected tests.

What I want for that is to

  1. generate a test report during testing
  2. parse the test report to get a list of failures
  3. list the failures and show commands to reproduce those specific failures

Comment thread build_tools/github_actions/summarize_test_pytorch_workflow.py Outdated
Comment thread build_tools/github_actions/summarize_test_pytorch_workflow.py
Comment thread build_tools/github_actions/summarize_test_pytorch_workflow.py
Comment on lines +3 to +5
"""
This summarizes the environment setup steps for the
.github/workflows/test_pytorch_wheels.yml workflow.
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. i am not sure if we want to have it running all the time in the pipeline?

wouldnt it be less noise if we have a markdown "here are the steps to do and export the variables from the CI run" and then have in the CI just print

export AMDGPU_FAMILY="gfx1151"
export TORCH_VERSION="2.7.1+rocm7.10.0a20251120"
export PYTORCH_GIT_REF="release/2.7"

I hear that, yeah. I think for many contributors who aren't as familiar with each CI pipeline, having a nicely formatted summary will make workflow results easier to understand. I know where in the logs to look for reproduction steps across each workflow type, but many developers do not.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see my comments some discussions above: maybe have a generic file in the doc/ how to setup docker for pytorch tests and then in the ci have a referrence to that doc as URL + the pytorch test command

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's a new attempt: https://gist.github.com/ScottTodd/6a465a4958fdaea59ede417434ba64b4#file-v4-md, which I'll write the code for now.

The output will be in this format:


PyTorch Test Report

To reproduce, see Running/testing PyTorch and setup with:

# Fetch pytorch source files, including tests:
git clone --branch release/2.7 --origin rocm https://github.com/ROCm/pytorch.git

# Install torch and test requirements
pip install ^
  --index-url=https://rocm.nightlies.amd.com/v2-staging/gfx110X-dgpu ^
  torch==2.7.1+rocm7.10.0a20251120
pip install -r pytorch/.ci/docker/requirements-ci.txt

@ScottTodd ScottTodd marked this pull request as ready for review December 4, 2025 21:36
@jeffdaily
Copy link
Copy Markdown

Approved, though I personally found the optional docker step very helpful.

# (Optional) Run under Docker
sudo docker run -it \
  --device=/dev/kfd --device=/dev/dri \
  --ipc=host --group-add=video --group-add=render --group-add=110 \
  ghcr.io/rocm/no_rocm_image_ubuntu24_04:latest
sudo apt install python3.12-venv -y

Comment thread .github/workflows/test_pytorch_wheels.yml
Copy link
Copy Markdown
Contributor

@HereThereBeDragons HereThereBeDragons left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i like the brevity of v4

Comment thread build_tools/github_actions/summarize_test_pytorch_workflow.py
Comment thread build_tools/github_actions/summarize_test_pytorch_workflow.py Outdated
@ScottTodd ScottTodd merged commit a859b6f into main Dec 5, 2025
11 checks passed
@ScottTodd ScottTodd deleted the users/scotttodd/torch-repro-logging branch December 5, 2025 19:49
@github-project-automation github-project-automation Bot moved this from TODO to Done in TheRock Triage Dec 5, 2025
rponnuru5 pushed a commit that referenced this pull request Dec 9, 2025
Fixes #2173

This writes a step summary for each of our "Test PyTorch Wheels"
workflow runs showing how to reproduce the test environment without
needing to sift through the logs, find workflow file sources, or even
clone TheRock to use our scripts.

> [!IMPORTANT]
> Example workflow run:
https://github.com/ROCm/TheRock/actions/runs/19944247475
>
> I tried a few variations on the output format:
https://gist.github.com/ScottTodd/6a465a4958fdaea59ede417434ba64b4.
Here's an older test run with v1:
https://github.com/ROCm/TheRock/actions/runs/19553166331.

I worked through a few design points:
* I want the instructions to be easy to copy/paste into issue reports
and terminals, but there are two key branches:
1. Starting a Docker container (which is optional). If you copy/paste a
full set of instructions and one starts an interactive container the
remaining instructions will not be run.
2. If you already have pytorch checked out, you don't need to clone the
repository.
* We probably want a few quick bullet points at the start listing the
inputs (pytorch version), linking to the github branch, linking to the
release index page, etc.
* If we have the pytorch tests generate a test report, we could also
include that in the summary and even show how to run specific failed
test cases individually (maybe limit to the first 10 failures?)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

[Feature] Log simple reproduction steps in PyTorch test workflows

4 participants