Skip to content

pytorch 2.7.1; switch label on windows; turn on artefact persistence#391

Merged
h-vetinari merged 10 commits into
conda-forge:mainfrom
h-vetinari:channels
Jun 17, 2025
Merged

pytorch 2.7.1; switch label on windows; turn on artefact persistence#391
h-vetinari merged 10 commits into
conda-forge:mainfrom
h-vetinari:channels

Conversation

@h-vetinari
Copy link
Copy Markdown
Member

These are each intended as a work-around for conda/infrastructure#1159. Let's hope at least one works out.

@conda-forge-admin
Copy link
Copy Markdown
Contributor

conda-forge-admin commented May 29, 2025

Hi! This is the friendly automated conda-forge-linting service.

I just wanted to let you know that I linted all conda-recipes in your PR (recipe/meta.yaml) and found it was in an excellent condition.

I do have some suggestions for making it better though...

For recipe/meta.yaml:

  • ℹ️ The recipe is not parsable by parser conda-souschef (grayskull). This parser is not currently used by conda-forge, but may be in the future. We are collecting information to see which recipes are compatible with grayskull.
  • ℹ️ The recipe is not parsable by parser conda-recipe-manager. The recipe can only be automatically migrated to the new v1 format if it is parseable by conda-recipe-manager.

This message was generated by GitHub Actions workflow run https://github.com/conda-forge/conda-forge-webservices/actions/runs/15680385882. Examine the logs at this URL for more detail.

@mgorny
Copy link
Copy Markdown
Contributor

mgorny commented May 29, 2025

I'm all for it if it works. Presumably we'll get the artifact from the pull request CI runs, correct? If so, perhaps we could limit CI to just the problematic Windows build.

@h-vetinari
Copy link
Copy Markdown
Member Author

@wolfv @baszalmstra, sorry for the ping, just checking the health of the windows server again as the two windows jobs here didn't start (there's one windows job running in another job here, but in the past it was possible to have up to 4 concurrent jobs - perhaps there are other long-running jobs elsewhere that I cannot see? 🤔)

@h-vetinari
Copy link
Copy Markdown
Member Author

Seems that on the 9th try, the CI from the merge of #383 actually ran through 🥳

@h-vetinari h-vetinari changed the title Switch label on windows; turn on artefact persistence pytorch 2.7.1; switch label on windows; turn on artefact persistence Jun 5, 2025
@mgorny
Copy link
Copy Markdown
Contributor

mgorny commented Jun 7, 2025

Kinda surprised they didn't bump Triton pin, but I guess the new version didn't have any significant changes.

These 3 generic+CUDA failures don't look important but mkl+CUDA looks significant:

2025-06-05T15:25:29.2504494Z E       torch._inductor.exc.InductorError: SystemError: PY_SSIZE_T_CLEAN macro must be defined for '#' formats

I'm guessing something went really wrong somewhere.

@mgorny
Copy link
Copy Markdown
Contributor

mgorny commented Jun 11, 2025

Looks like it was a flake after all.

@h-vetinari
Copy link
Copy Markdown
Member Author

Unfortunately we still have a bunch of test failures, including a few that look very related to triton - perhaps pointing at some interaction with conda-forge/triton-feedstock#51

@h-vetinari h-vetinari mentioned this pull request Jun 13, 2025
@h-vetinari
Copy link
Copy Markdown
Member Author

We still have a problem when testing the CUDA builds:

=========================== short test summary info ============================
FAILED [1.1922s] test/inductor/test_torchinductor.py::GPUTests::test_isinf_cuda - AssertionError: TypeError not raised

To execute this test, run the following from the base repo dir:
    python test/inductor/test_torchinductor.py GPUTests.test_isinf_cuda

This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0
FAILED [6.1424s] test/inductor/test_torchinductor.py::GPUTests::test_linear_dynamic_maxautotune_cuda - torch._inductor.exc.InductorError: NoValidChoicesError: 

Set TORCHDYNAMO_VERBOSE=1 for the internal stack trace (please do this especially if you're reporting a bug to PyTorch). For even more developer context, set TORCH_LOGS="+dynamo"


To execute this test, run the following from the base repo dir:
    python test/inductor/test_torchinductor.py GPUTests.test_linear_dynamic_maxautotune_cuda

This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0
FAILED [3.7207s] test/inductor/test_torchinductor.py::TritonCodeGenTests::test_donated_buffer_inplace_gpt - IndexError: list index out of range

To execute this test, run the following from the base repo dir:
    python test/inductor/test_torchinductor.py TritonCodeGenTests.test_donated_buffer_inplace_gpt

This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0
= 3 failed, 14739 passed, 2841 skipped, 91 xfailed, 86661 warnings in 5886.94s (1:38:06) =

The stacktrace looks very much triton related, but then again, we're pulling in the same version (3.3.0, not 3.3.1) as for the last passing build.

@h-vetinari
Copy link
Copy Markdown
Member Author

Looking closer, these may be genuine triton bugs

/tmp/tmp[...].py:17:0: error: Failures have been detected while processing an MLIR pass pipeline
/tmp/tmp[...].py:17:0: note: Pipeline failed while executing [`ConvertTritonGPUToLLVM` on 'builtin.module' operation]:
                       reproducer generated at `std::errs, please share the reproducer above with Triton project.`

@h-vetinari
Copy link
Copy Markdown
Member Author

OK, triton 3.3.1 makes the problems worse. We go from 3 test failures to

= 621 failed, 14121 passed, 2841 skipped, 91 xfailed, 86632 warnings in 5879.43s (1:37:59) =

@mgorny
Copy link
Copy Markdown
Contributor

mgorny commented Jun 16, 2025

Uh, SystemError: PY_SSIZE_T_CLEAN macro must be defined for '#' formats again. I don't think it's related to Triton version, though — I've seen it before with 3.3.0. It looks like something randomly messing it up.

Perhaps triton-lang/triton#6928 can help? Not sure about the necessity of .c changes, but the .py change was what was I thought of when I looked at the code too.

@h-vetinari
Copy link
Copy Markdown
Member Author

Ok, it seems we're finally on the way to green here (not sure what your comments were referring to @mgorny - perhaps you meant #393?).

@h-vetinari
Copy link
Copy Markdown
Member Author

@wolfv @baszalmstra quick question: did the windows server get resized recently? We used to have up to 4 concurrent jobs, though in recent weeks it seems we're down to at most one job at the same time.

@wolfv
Copy link
Copy Markdown
Member

wolfv commented Jun 17, 2025

We didn't touch anything

@h-vetinari
Copy link
Copy Markdown
Member Author

h-vetinari commented Jun 17, 2025

Thanks for the response! In that case, what I imagine might have happened that there are some dead jobs from some failure somewhere along the line that are clogging up the queue. Is there a server dashboard that I could perhaps get access to? Being able to delete stale jobs would help a lot (if you trust me to be careful with that responsibility... FWIW, I have the same kind of access for https://github.com/Quansight/open-gpu-server, including deleting the occasional stale job).

@mgorny
Copy link
Copy Markdown
Contributor

mgorny commented Jun 17, 2025

(not sure what your comments were referring to @mgorny - perhaps you meant #393?)

This was in reply to:

OK, triton 3.3.1 makes the problems worse. We go from 3 test failures to

= 621 failed, 14121 passed, 2841 skipped, 91 xfailed, 86632 warnings in 5879.43s (1:37:59) =

@h-vetinari
Copy link
Copy Markdown
Member Author

OK, let's get this in. Any follow-up discussions can be addressed in #393

@h-vetinari h-vetinari merged commit 55fb9b8 into conda-forge:main Jun 17, 2025
34 of 35 checks passed
@h-vetinari h-vetinari deleted the channels branch June 17, 2025 22:06
@h-vetinari
Copy link
Copy Markdown
Member Author

Sigh, now the PY_SSIZE_T_CLEAN also appeared in one job after merging here. But at least the windows libtorch builds got uploaded 🥳

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants