[torch] Log environment reproduction steps in test workflows by ScottTodd · Pull Request #2238 · ROCm/TheRock

ScottTodd · 2025-11-20T23:26:38Z

This writes a step summary for each of our "Test PyTorch Wheels" workflow runs showing how to reproduce the test environment without needing to sift through the logs, find workflow file sources, or even clone TheRock to use our scripts.

Important

Example workflow run: https://github.com/ROCm/TheRock/actions/runs/19944247475

I tried a few variations on the output format: https://gist.github.com/ScottTodd/6a465a4958fdaea59ede417434ba64b4. Here's an older test run with v1: https://github.com/ROCm/TheRock/actions/runs/19553166331.

I worked through a few design points:

I want the instructions to be easy to copy/paste into issue reports and terminals, but there are two key branches:
1. Starting a Docker container (which is optional). If you copy/paste a full set of instructions and one starts an interactive container the remaining instructions will not be run.
2. If you already have pytorch checked out, you don't need to clone the repository.
We probably want a few quick bullet points at the start listing the inputs (pytorch version), linking to the github branch, linking to the release index page, etc.
If we have the pytorch tests generate a test report, we could also include that in the summary and even show how to run specific failed test cases individually (maybe limit to the first 10 failures?)

…ther" This reverts commit d73dd8e.

HereThereBeDragons

overall a good idea.

here some thoughts for the moment.

i am not sure if we want to have it running all the time in the pipeline?

wouldnt it be less noise if we have a markdown "here are the steps to do and export the variables from the CI run" and then have in the CI just print

export AMDGPU_FAMILY="gfx1151"
export TORCH_VERSION="2.7.1+rocm7.10.0a20251120"
export PYTORCH_GIT_REF="release/2.7"

etc?

in the future, when i have finally finished #1732, we can also easily propagate the container image hash
related to the different markdown versions:
do workflow steps show proper markdown? i always thought it is just plain text.
python venv should be an extra step und not part of docker. maybe have the apt install and setup of it together.

HereThereBeDragons · 2025-11-25T08:16:47Z

+    summary += "See [Running/testing PyTorch](https://github.com/ROCm/TheRock/tree/main/external-builds/pytorch#runningtesting-pytorch). "
+    summary += "For example:\n\n"
+    summary += "```bash\n"
+    summary += "PYTORCH_TEST_WITH_ROCM=1 python pytorch/test/run_test.py --include test_torch\n"


we are not using this PYTORCH_TEST_WITH_ROCM - not sure if that is needed. as we are already checking out pytorch which takes enternity we could also check out therock. without fetch_sources it should be fast.

alternatively you should defnitely put here also an example with -k as this i would say is the preferred choice for debugging. not to run an entire test file but just selected tests.

we are not using this PYTORCH_TEST_WITH_ROCM - not sure if that is needed. as we are already checking out pytorch which takes enternity we could also check out therock. without fetch_sources it should be fast.

That variable is used in https://github.com/pytorch/pytorch/blob/main/torch/testing/_internal/common_utils.py. These instructions are written from the perspective of upstream PyTorch (https://github.com/pytorch/pytorch), with no dependencies on any sources from TheRock.

Here I figured we could put the most basic instructions that match https://rocm.docs.amd.com/projects/install-on-linux/en/develop/install/3rd-party/pytorch-install.html#test-the-pytorch-installation. If we link to https://github.com/ROCm/TheRock/tree/main/external-builds/pytorch#runningtesting-pytorch, we can also give more detailed instructions for using our scripts too.

alternatively you should defnitely put here also an example with -k as this i would say is the preferred choice for debugging. not to run an entire test file but just selected tests.

What I want for that is to

generate a test report during testing

parse the test report to get a list of failures

list the failures and show commands to reproduce those specific failures

in that case where you give a specific list of failed commands, i can understand it would be quite cool to list the command to rerun.

but maybe we can reduce it "look here URL to setup the container" and then "here is the pytorch test command to run"

ScottTodd · 2025-11-25T23:53:29Z

+    summary += "See [Running/testing PyTorch](https://github.com/ROCm/TheRock/tree/main/external-builds/pytorch#runningtesting-pytorch). "
+    summary += "For example:\n\n"
+    summary += "```bash\n"
+    summary += "PYTORCH_TEST_WITH_ROCM=1 python pytorch/test/run_test.py --include test_torch\n"


we are not using this PYTORCH_TEST_WITH_ROCM - not sure if that is needed. as we are already checking out pytorch which takes enternity we could also check out therock. without fetch_sources it should be fast.

That variable is used in https://github.com/pytorch/pytorch/blob/main/torch/testing/_internal/common_utils.py. These instructions are written from the perspective of upstream PyTorch (https://github.com/pytorch/pytorch), with no dependencies on any sources from TheRock.

Here I figured we could put the most basic instructions that match https://rocm.docs.amd.com/projects/install-on-linux/en/develop/install/3rd-party/pytorch-install.html#test-the-pytorch-installation. If we link to https://github.com/ROCm/TheRock/tree/main/external-builds/pytorch#runningtesting-pytorch, we can also give more detailed instructions for using our scripts too.

alternatively you should defnitely put here also an example with -k as this i would say is the preferred choice for debugging. not to run an entire test file but just selected tests.

What I want for that is to

generate a test report during testing

parse the test report to get a list of failures

list the failures and show commands to reproduce those specific failures

ScottTodd · 2025-11-26T00:08:03Z

+"""
+This summarizes the environment setup steps for the
+.github/workflows/test_pytorch_wheels.yml workflow.


i am not sure if we want to have it running all the time in the pipeline?

wouldnt it be less noise if we have a markdown "here are the steps to do and export the variables from the CI run" and then have in the CI just print

export AMDGPU_FAMILY="gfx1151" export TORCH_VERSION="2.7.1+rocm7.10.0a20251120" export PYTORCH_GIT_REF="release/2.7"

I hear that, yeah. I think for many contributors who aren't as familiar with each CI pipeline, having a nicely formatted summary will make workflow results easier to understand. I know where in the logs to look for reproduction steps across each workflow type, but many developers do not.

see my comments some discussions above: maybe have a generic file in the doc/ how to setup docker for pytorch tests and then in the ci have a referrence to that doc as URL + the pytorch test command

Here's a new attempt: https://gist.github.com/ScottTodd/6a465a4958fdaea59ede417434ba64b4#file-v4-md, which I'll write the code for now.

The output will be in this format:

PyTorch Test Report

torch version: 2.7.1+rocm7.10.0a20251120

GPU family: gfx110X-dgpu

Package index: https://rocm.nightlies.amd.com/v2-staging/gfx110X-dgpu

Source code: https://github.com/ROCm/pytorch/tree/release/2.7

To reproduce, see Running/testing PyTorch and setup with:

# Fetch pytorch source files, including tests: git clone --branch release/2.7 --origin rocm https://github.com/ROCm/pytorch.git # Install torch and test requirements pip install ^ --index-url=https://rocm.nightlies.amd.com/v2-staging/gfx110X-dgpu ^ torch==2.7.1+rocm7.10.0a20251120 pip install -r pytorch/.ci/docker/requirements-ci.txt

…ch-repro-logging

jeffdaily · 2025-12-04T22:13:06Z

Approved, though I personally found the optional docker step very helpful.

# (Optional) Run under Docker
sudo docker run -it \
  --device=/dev/kfd --device=/dev/dri \
  --ipc=host --group-add=video --group-add=render --group-add=110 \
  ghcr.io/rocm/no_rocm_image_ubuntu24_04:latest
sudo apt install python3.12-venv -y

HereThereBeDragons

i like the brevity of v4

Fixes #2173 This writes a step summary for each of our "Test PyTorch Wheels" workflow runs showing how to reproduce the test environment without needing to sift through the logs, find workflow file sources, or even clone TheRock to use our scripts. > [!IMPORTANT] > Example workflow run: https://github.com/ROCm/TheRock/actions/runs/19944247475 > > I tried a few variations on the output format: https://gist.github.com/ScottTodd/6a465a4958fdaea59ede417434ba64b4. Here's an older test run with v1: https://github.com/ROCm/TheRock/actions/runs/19553166331. I worked through a few design points: * I want the instructions to be easy to copy/paste into issue reports and terminals, but there are two key branches: 1. Starting a Docker container (which is optional). If you copy/paste a full set of instructions and one starts an interactive container the remaining instructions will not be run. 2. If you already have pytorch checked out, you don't need to clone the repository. * We probably want a few quick bullet points at the start listing the inputs (pytorch version), linking to the github branch, linking to the release index page, etc. * If we have the pytorch tests generate a test report, we could also include that in the summary and even show how to run specific failed test cases individually (maybe limit to the first 10 failures?)

ScottTodd added 4 commits November 20, 2025 13:12

Rename "pytorch_version" to "pytorch_ref" to disambiguate further

d73dd8e

Introduce summarize_test_pytorch_workflow.py.

d6d8ecb

Call new script from test workflow

75be90f

Iterate on output format.

fade399

ScottTodd requested a review from HereThereBeDragons November 20, 2025 23:26

github-project-automation Bot added this to TheRock Triage Nov 20, 2025

github-project-automation Bot moved this to TODO in TheRock Triage Nov 20, 2025

ScottTodd added 2 commits November 20, 2025 16:09

Revert "Rename "pytorch_version" to "pytorch_ref" to disambiguate fur…

7b9d74d

…ther" This reverts commit d73dd8e.

Rename arg in script to pytorch_git_ref, revert workflow to old name

a1f7b6e

ScottTodd commented Nov 21, 2025

View reviewed changes

Comment thread .github/workflows/test_pytorch_wheels.yml Outdated

HereThereBeDragons reviewed Nov 25, 2025

View reviewed changes

ScottTodd commented Nov 26, 2025

View reviewed changes

ScottTodd added 3 commits December 4, 2025 09:48

Merge remote-tracking branch 'upstream/main' into users/scotttodd/tor…

d89256e

…ch-repro-logging

Simplify markdown output, move some information into docs.

4c6c3ae

Add newline before code block

ef81118

ScottTodd marked this pull request as ready for review December 4, 2025 21:36

ScottTodd requested review from HereThereBeDragons and jeffdaily December 4, 2025 21:36

jeffdaily approved these changes Dec 4, 2025

View reviewed changes

ScottTodd commented Dec 4, 2025

View reviewed changes

Comment thread .github/workflows/test_pytorch_wheels.yml

HereThereBeDragons approved these changes Dec 5, 2025

View reviewed changes

Comment thread build_tools/github_actions/summarize_test_pytorch_workflow.py

Comment thread build_tools/github_actions/summarize_test_pytorch_workflow.py Outdated

ScottTodd mentioned this pull request Dec 5, 2025

[Issue] Linux pytorch failing TestNN::test_broadcast_no_grad with "RuntimeError: NCCL Error 1: unhandled cuda error" #2165

Closed

Add python version, capitalize Torch version

052f401

ScottTodd merged commit a859b6f into main Dec 5, 2025
11 checks passed

ScottTodd deleted the users/scotttodd/torch-repro-logging branch December 5, 2025 19:49

github-project-automation Bot moved this from TODO to Done in TheRock Triage Dec 5, 2025

Conversation

ScottTodd commented Nov 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

HereThereBeDragons left a comment

Choose a reason for hiding this comment

Uh oh!

HereThereBeDragons Nov 25, 2025

Choose a reason for hiding this comment

Uh oh!

ScottTodd Nov 25, 2025

Choose a reason for hiding this comment

Uh oh!

HereThereBeDragons Nov 26, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ScottTodd Nov 25, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ScottTodd Nov 26, 2025

Choose a reason for hiding this comment

Uh oh!

HereThereBeDragons Nov 26, 2025

Choose a reason for hiding this comment

Uh oh!

ScottTodd Dec 4, 2025

Choose a reason for hiding this comment

PyTorch Test Report

Uh oh!

jeffdaily commented Dec 4, 2025

Uh oh!

Uh oh!

HereThereBeDragons left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ScottTodd commented Nov 20, 2025 •

edited

Loading