Skip to content

fix(deps): revert to miniforge3:25.3.1-0 and pin libcurl==8.14.1#324

Merged
gforsyth merged 18 commits intorapidsai:mainfrom
gforsyth:restore_openssl_pin
Nov 14, 2025
Merged

fix(deps): revert to miniforge3:25.3.1-0 and pin libcurl==8.14.1#324
gforsyth merged 18 commits intorapidsai:mainfrom
gforsyth:restore_openssl_pin

Conversation

@gforsyth
Copy link
Contributor

@gforsyth gforsyth commented Nov 13, 2025

Ok, after a lot of back and forth and trial and error, we've identified that some combination of libcurl and certain versions of either mamba or conda (contained in the upstream miniforge3 image) are interacting in such a way as to cause deadlocks in some small percentage of runs. But it is consistent enough to cause at least a few deadlocks for every PR in this repo.

For now (2025/11/14), I have:

  • reverted back to the upstream miniforge3:25.3.1-0 image
  • pinned libcurl=8.14.1

The libcurl pin is set to the version of libcurl that is already installed in the base environment of the upstream image, the goal with pinning is to avoid swapping in a new version mid-network-transaction.

Noting here that this is still a temporary fix. We need to better isolate the root cause (between libcurl and mamba and conda versions) and then report upstream to the appropriate project/projects.

BUT, this does let us update openssl and stops the deadlocks, so that's still a plus.

@gforsyth gforsyth requested a review from a team as a code owner November 13, 2025 21:12
@gforsyth gforsyth requested review from KyleFromNVIDIA and removed request for a team November 13, 2025 21:12
@rockhowse
Copy link
Contributor

rockhowse commented Nov 13, 2025

Yup we got 4 hangs as part of my PR here:

image

Not as egregious as I remember it... but still not ideal. 😔

@gforsyth
Copy link
Contributor Author

Well, there are two jobs hanging here -- the ci-conda job and the miniforge-cuda job, both on openssl==3.5.2.

Could it be the issue is in libcurl and the pinning of openssl was giving us an earlier version of libcurl as a dependency resolution artifact?

@gforsyth gforsyth mentioned this pull request Nov 13, 2025
@rockhowse
Copy link
Contributor

Well, there are two jobs hanging here -- the ci-conda job and the miniforge-cuda job, both on openssl==3.5.2.

Could it be the issue is in libcurl and the pinning of openssl was giving us an earlier version of libcurl as a dependency resolution artifact?

Quite possibly... however if you review the way this hangs, it always seems to happen on the 7/10 step and we are doing some fun things like conda clean -ailp which I think is meant to try and remove some things from the conda environment out of the docker container or something.

We also are NOT using any sort of local pull through caching here, although the issue doesn't appear to be related to volumme of connections or rate limiting given I was able to reliably cause this in my PR with only 3 jobs limited by rockylinux python-3.10(?) and amd64 to see what the smallest set of the matrix I could run to get it to trigger.

I haven't yet had time to do what @jakirkham recommended which was just doing some curl commands as part of the CI To see if we can get similar responses.

If you enable -vvv you will see conda try and start up a thread pool to connect and then it hangs, no DNS lookup, no TLS related handshakes that you see with connections that are actually established and which fit solidly into the realm of OpenSSL vs libcurl. It's almost like a deadlock or race condition of some sort, might be useful to dump some info using strace or something to show lower level system usage.

@gforsyth gforsyth added the DO NOT MERGE Hold off on merging; see PR for details label Nov 13, 2025
# update everything before other environment changes, to ensure mixing
# an older conda with newer packages still works well
rapids-mamba-retry update --all -y -n base
rapids-mamba-retry update --all -y -n -vvvv base
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup with this you will see where it's hanging... however... the buffer in the web interface won't update until the pipeline completes (is cancled or times out)

I kept thinking it was stuck at the place where you see it stop updating in the web console, but after termination and the buffers are flushed you can see it's actually terminating when conda's setting up to connect to conda-forge

Image

@gforsyth gforsyth force-pushed the restore_openssl_pin branch from 14d6794 to a854e5c Compare November 13, 2025 22:45
@gforsyth gforsyth force-pushed the restore_openssl_pin branch from f42e142 to 555306f Compare November 14, 2025 16:02
@gforsyth
Copy link
Contributor Author

gforsyth commented Nov 14, 2025

Wow, ok, so reverting to the upstream miniforge3:25.3.1-0 image and pinning openssl and libcurl so those packages don't get upgraded in-place in the base environment seem to have resolved all of the hangs.

I'm going to try unpinning openssl but leaving the libcurl pin to see if that also works.

But for now, the things in the mix in re: the hanging behavior are:

Some possible culprits in the upstream release:

@gforsyth gforsyth changed the title Revert "refactor(deps): remove openssl pin <3.5.3 (#321)" fix(deps): revert to miniforge3:25.3.1-0 and pin libcurl==8.14.1 Nov 14, 2025
@gforsyth gforsyth removed the DO NOT MERGE Hold off on merging; see PR for details label Nov 14, 2025
@gforsyth gforsyth merged commit 931bc6c into rapidsai:main Nov 14, 2025
314 checks passed
@gforsyth gforsyth deleted the restore_openssl_pin branch November 14, 2025 18:37
Copy link
Contributor

@rockhowse rockhowse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This set of changes is closer to the real issue. Ideally should allow us to have a stable 25.12 release while we address a way for us to use newer versions.

I agree with @bdice that we want to report this upstream.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants