Skip to content

Switched to static linking of llvm#100

Merged
beckermr merged 11 commits into
conda-forge:mainfrom
timostrunk:switch_to_static
Mar 24, 2025
Merged

Switched to static linking of llvm#100
beckermr merged 11 commits into
conda-forge:mainfrom
timostrunk:switch_to_static

Conversation

@timostrunk

@timostrunk timostrunk commented Feb 13, 2025

Copy link
Copy Markdown
Contributor

I switched to static linking of libllvm here and added it to the ignore_run_exports on unix. I did not change the build behaviour on Windows as it seems to already build with default settings there and I have no means to test it.

This fixes #99 and #84.

Reasons against this PR:

  • Obviously this will increase binary size, because libllvm is now statically linked. Im explicitly tagging @xhochy here, because he was against static linking this.
  • If the symbol version issue would be identified and resolved it would be cleaner

Reasons to merge this PR:

  • It resolves a hard to debug issue, which occurs using libraries, where debugging skill is required to find out, why a segmentation fault actually happened
  • It can easily be reverted, once the symbol isolation is gone.

Checklist

  • Used a personal fork of the feedstock to propose changes
  • Bumped the build number (if the version is unchanged)
  • Reset the build number to 0 (if the version changed)
  • Re-rendered with the latest conda-smithy (Use the phrase @conda-forge-admin, please rerender in a comment in this PR for automated rerendering)
  • Ensured the license file is being packaged.

@timostrunk

Copy link
Copy Markdown
Contributor Author

@conda-forge-admin, please rerender

@conda-forge-admin

Copy link
Copy Markdown
Contributor

Hi! This is the friendly automated conda-forge-linting service.

I just wanted to let you know that I linted all conda-recipes in your PR (recipe/meta.yaml) and found it was in an excellent condition.

@timostrunk

Copy link
Copy Markdown
Contributor Author

ppc64le still broken, will look into it

@timostrunk

Copy link
Copy Markdown
Contributor Author

ppc64le still broken, will look into it

build works locally on cross-compile and goes further than action runner. Test fail (as expected) due to invalid binary format.

@timostrunk timostrunk marked this pull request as draft February 14, 2025 10:57
@timostrunk

Copy link
Copy Markdown
Contributor Author

I will try to dynamic link on ppc64le still. I think the native build has issues. The local crossbuild on my machine works though, but I don't want to sacrifice the tests, which I would need to do, if I let the github runner build on linux64.

Reverted the PR to draft, because there will be some noise.

@timostrunk

Copy link
Copy Markdown
Contributor Author

@conda-forge-admin, please rerender

@conda-forge-admin

Copy link
Copy Markdown
Contributor

Hi! This is the friendly automated conda-forge-webservice.

I tried to rerender for you, but it looks like there was nothing to do.

This message was generated by GitHub Actions workflow run https://github.com/conda-forge/conda-forge-webservices/actions/runs/13329229741. Examine the logs at this URL for more detail.

@timostrunk timostrunk marked this pull request as ready for review February 14, 2025 13:09
@timostrunk

Copy link
Copy Markdown
Contributor Author

@jakirkham : This is now ready for review. ppc64le still builds shared, because I was unable to static link on a native ppc64le platform.

@jakirkham jakirkham left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As already discussed at length previously ( #73 ), will not be accepting a change to static linking

Think we need to find a better way to address the issue users are encountering. One avenue may be changing the linker flags (stripping, symbol visibility, etc.)

@timostrunk

Copy link
Copy Markdown
Contributor Author

My analysis showed that the cause of the issue might not be in llvmlite, so changing the flags here won't change much. The issue is that libllvm15 links into libllvm19, so I presume the right place for a fix is there. If my analysis is right libllvm15 should be in conflict with libllvm19, as it is not safe to have both in the same environment.

Until this fix is found it is desastrous for the conda-forge ecosystem not to provide a workaround. Currently not merging this means that people using pytorch and numba in the same conda environment either

  1. face segmentation faults
  2. need to downpin pytorch and keep the pin until llvmlite gets libllvm19 compatibility at which point this starts working again 'by chance' and will most probably break again with the next libllvm release.

numba and pytorch is a very common combination. Therefore: Please reconsider merging this. It can be reverted once the issue is solved.

@timostrunk

Copy link
Copy Markdown
Contributor Author

Also in #73 you write:
"Users have been working with Numba on CUDA for some time without issues and this uses the dynamic linking solution we currently have

Not following the argument for static linking"

The reason this is working is only because numba and pytorch-cuda required the same libllvm version at this precise time. When libllvm14 was used by llvmlite and libllvm15 was used by pytorch-cuda, it broke. Now the same thing happens with libllvm15 and 19.

@h-vetinari

Copy link
Copy Markdown
Member

As already discussed at length previously ( #73 ), will not be accepting a change to static linking

Pity you did not respond on the thread I opened in the core channel. I will bring this to a vote within core.

@jakirkham

Copy link
Copy Markdown
Member

If this is indeed an LLVM issue, let's file on that feedstock. If there was already an issue filed and it was closed prematurely, am happy to reopen (provided a link)

@timostrunk

Copy link
Copy Markdown
Contributor Author

This is the issue I opened on the llvm-dev feedstock: conda-forge/llvmdev-feedstock#312

@beckermr

Copy link
Copy Markdown
Member

I am happy to take on the maintenance burden of pushing updates to the llvm versions for static linking if that helps unblock this.

@beckermr

beckermr commented Feb 25, 2025

Copy link
Copy Markdown
Member

OK friends. I have another idea on how to ease the maintenance burden of static linking so we can unblock this PR.

I have added a new feature to the bot in this PR: conda-forge/conda-forge-bot#3755.

It allows the bot to update static libs according to an abstract spec as follows.

In the extra section you add this bit of yaml

extra:
  static_linking_host_requirements:
    - llvmdev 15.*
    - llvm 15.*

This specifies the abstract requirements for the static library you want in host.

Then in your recipe, you list the exact packages you care about like this

requirements:
  host:
    - python
    - setuptools
    - llvmdev 15.0.7 h2621b3d_4  # [osx and arm64]
    - llvm 15.0.7 h4a7a88c_4     # [osx and arm64]
    - llvmdev 15.0.7 hbedff68_4  # [osx and x86]
    - llvm 15.0.7 hed0f868_4     # [osx and x86]

The bot then does the following computation:

  1. For each abstract static lib host requirement in the extra section
  • Get the latest version for all platforms the feedstock builds
  • If the latest versions+build numbers across all platforms differ, bail on whole update.
  1. Extract the specs of packages in host in the current recipe that
  • match the abstract host requirements in the recipe per platform
  • are exact specs
  1. If any versions and/or build numbers at equal version have increased when comparing 1 and 2
  • update the static libs in the recipe
  • issue a PR

The result of this is an update to the host section like this:

requirements:
  host:
    - python
    - setuptools
    - llvmdev 15.0.7 h4429f82_5  # [osx and arm64]
    - llvm 15.0.7 h0cf516b_5     # [osx and arm64]
    - llvmdev 15.0.7 hc29ff6c_5  # [osx and x86]
    - llvm 15.0.7 hb21d583_5     # [osx and x86]
    - zlib
    - vs2015_runtime  # [win]

The bot stores the new static lib versions it used to update the feedstock as part of the PR info it has. This should prevent it from issuing duplicate PRs.

Further, by bailing if not all of the versions+build numbers of the new static libs match, we should prevent PRs being issued in the middle of a build of the static libs on the backend.

Finally, by restricting the search for updates to the static libs to the abstract specs in extra, the bot will only issue an update for increases in minor+patch versions and/or build numbers in the say 15.* series for llvm, etc.

If this is of interest to you all, let me know and I can finish up the bot PR, make a PR to this feedstock, and we can start trying it out.

cc @isuruf @h-vetinari @jakirkham @conda-forge/llvmlite

@h-vetinari

Copy link
Copy Markdown
Member

Thank you for trying to find a solution on this @beckermr! I'm fine with whatever setup that lets us get rid of the segfaults. To me the static_linking_host_requirements: machinery sounds a bit like overkill, but if it addresses the maintenance concerns here, then I'm happy to go in that direction.

@conda-forge-admin

conda-forge-admin commented Feb 27, 2025

Copy link
Copy Markdown
Contributor

Hi! This is the friendly automated conda-forge-linting service.

I wanted to let you know that I linted all conda-recipes in your PR (recipe/meta.yaml) and found some lint.

Here's what I've got...

For recipe/meta.yaml:

  • ❌ In conda-forge.yml: $.bot = {'update_static_libs': True, 'abi_migration_branches': ['rc']}.

    {'update_static_libs': True, 'abi_migration_branches': ['rc']} is not valid under any of the given schemas

    Schema
    {
      "anyOf": [
        {
          "$ref": "#/$defs/BotConfig"
        },
        {
          "type": "null"
        }
      ]
    }

This message was generated by GitHub Actions workflow run https://github.com/conda-forge/conda-forge-webservices/actions/runs/13569596802. Examine the logs at this URL for more detail.

@beckermr

Copy link
Copy Markdown
Member

@isuruf ppc64le is not statically linked in these builds due to some error @timostrunk had when they tried above.

@beckermr

Copy link
Copy Markdown
Member

PR to fix the linter: conda-forge/conda-smithy#2253

@beckermr

Copy link
Copy Markdown
Member

OK. This one is all green or will be soon. I do not want to merge when another maintainer has requested changes.

@jakirkham Can you look at what we've done here and reconsider your review possibly? We've enabled fully automatic updates via the bot and I've added myself to the feedstock to manage things around that as I expect we'll encounter a few bugs along the way.

@beckermr

beckermr commented Mar 6, 2025

Copy link
Copy Markdown
Member

@conda-forge-admin relint

@conda-forge-admin

Copy link
Copy Markdown
Contributor

Hi! This is the friendly automated conda-forge-linting service.

I just wanted to let you know that I linted all conda-recipes in your PR (recipe/meta.yaml) and found it was in an excellent condition.

@timostrunk

Copy link
Copy Markdown
Contributor Author

Pinging @jakirkham again here to get the discussion rolling again. Could you please comment on whether you are ok with this setup?

Comment thread recipe/build.sh Outdated
Comment on lines +13 to +14
elif [[ "${target_platform}" == linux-ppc64le ]]; then
CXXFLAGS="${CXXFLAGS} -fplt"

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@isuruf Any idea why this was needed? Would be good to leave a comment.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Think this is because we normally pass -fno-plt in the CXXFLAGS. This actually done for all Linux architectures

However there have been some cases where this doesn't work on linux_ppc64le (notably with LLVM). Please see this bug:

Think Isuru is adding -fplt as a quick way of overriding the -fno-plt behavior. Adding -fplt is a bit quicker than what we normally do, which is remove the -fno-plt flag

Since we have figured out that this works, think we should adopt the syntax that we have elsewhere and remove -fno-plt from flags. This will also make it easier for future readers to find more context

Suggested change
elif [[ "${target_platform}" == linux-ppc64le ]]; then
CXXFLAGS="${CXXFLAGS} -fplt"
elif [[ "${target_platform}" == linux-ppc64le ]]; then
# Taken from llvmdev's recipe
# https://github.com/conda-forge/llvmdev-feedstock/blob/8c2c0f2db9db1fdf12289381dcee4e2d9a2e5fec/recipe/build.sh#L29-L33
# disable `-fno-plt` due to some GCC bug causing linker errors, see
# https://github.com/llvm/llvm-project/issues/51205
CFLAGS="$(echo $CFLAGS | sed 's/-fno-plt //g')"
CXXFLAGS="$(echo $CXXFLAGS | sed 's/-fno-plt //g')"

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should add have committed the syntax change above. Though leaving unresolved so the thread remains visible

@kkraus14

Copy link
Copy Markdown

Had a side chat related to this PR as well as consulted @gmarkall who is a numba and llvmlite maintainer:

@gmarkall

Copy link
Copy Markdown
Contributor

I'm still not 100% on why this is happening as the libLLVM-15 and libLLVM-19 symbols are both versioned with version specific identifiers.

I think only one version of the symbol can exist in a process - so, my understanding is that if a symbol has already been resolved with the LLVM 15-specific version, it's not going to be resolved again if an LLVM 19-specific caller has a relocation to a symbol of the same name that subsequently needs to be resolved. So symbol versioning doesn't help in our situation here.

@beckermr

Copy link
Copy Markdown
Member

After some sidebar conversations with various folks, I think we've reached a consensus of sorts.

The plan as I understand it is to:

  1. Merge this PR sometime in the next 1-2 weeks. Waiting on approval from @jakirkham IIUIC.
  2. Various folks working on numba, conda-forge, etc. will continue to investigate the LLVM issue.
  3. @jakirkham suggested to me offline that we can merge this PR and then push a test build number bump to llvm in order to test out the bot integrations. I am happy to baby sit the bot on that and iron our bugs if people have interest. This will need to wait until the week after next week, but that test should not block this PR.

Thanks everyone for working hard on this tricky issue!

@beckermr

Copy link
Copy Markdown
Member

Per further discussion, I will merge this PR on Monday, March 24, 2025. Happy weekend!

@jakirkham jakirkham left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apologies for the slow reply here

Met late last week with both Keith and Matt to discuss the changes and maintenance here

Agree that static linking is the least bad option we can come up with atm. So agree we should do that

There were a couple questions that had come up when we looked at some of the changes here. It took a bit longer to dig into these. Have made comments on them below

Also agree it would be great to have Matt on as a maintainer. Would like to add Keith as well (if he agrees). This should help with keeping up on changes here

Comment thread recipe/build.sh Outdated
Comment on lines +13 to +14
elif [[ "${target_platform}" == linux-ppc64le ]]; then
CXXFLAGS="${CXXFLAGS} -fplt"

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Think this is because we normally pass -fno-plt in the CXXFLAGS. This actually done for all Linux architectures

However there have been some cases where this doesn't work on linux_ppc64le (notably with LLVM). Please see this bug:

Think Isuru is adding -fplt as a quick way of overriding the -fno-plt behavior. Adding -fplt is a bit quicker than what we normally do, which is remove the -fno-plt flag

Since we have figured out that this works, think we should adopt the syntax that we have elsewhere and remove -fno-plt from flags. This will also make it easier for future readers to find more context

Suggested change
elif [[ "${target_platform}" == linux-ppc64le ]]; then
CXXFLAGS="${CXXFLAGS} -fplt"
elif [[ "${target_platform}" == linux-ppc64le ]]; then
# Taken from llvmdev's recipe
# https://github.com/conda-forge/llvmdev-feedstock/blob/8c2c0f2db9db1fdf12289381dcee4e2d9a2e5fec/recipe/build.sh#L29-L33
# disable `-fno-plt` due to some GCC bug causing linker errors, see
# https://github.com/llvm/llvm-project/issues/51205
CFLAGS="$(echo $CFLAGS | sed 's/-fno-plt //g')"
CXXFLAGS="$(echo $CXXFLAGS | sed 's/-fno-plt //g')"

Comment thread recipe/meta.yaml
Comment thread recipe/meta.yaml
- marcelotrevisani
- xhochy
- mbargull
- beckermr

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kkraus14 is it ok if we add you to the maintainers here?

Suggested change
- beckermr
- beckermr
- kkraus14

@jakirkham

Copy link
Copy Markdown
Member

@conda-forge-admin , please re-render

Comment thread recipe/meta.yaml
@jakirkham jakirkham dismissed their stale review March 22, 2025 02:34

(submitted an updated review)

@beckermr

Copy link
Copy Markdown
Member

@conda-forge-admin rerender

@conda-forge-admin

Copy link
Copy Markdown
Contributor

Hi! This is the friendly automated conda-forge-webservice.

I tried to rerender for you, but it looks like there was nothing to do.

This message was generated by GitHub Actions workflow run https://github.com/conda-forge/conda-forge-webservices/actions/runs/14034002527. Examine the logs at this URL for more detail.

@beckermr

Copy link
Copy Markdown
Member

I am happy to add @kkraus14 in another PR if he'd like. Going to merge.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Linking libllvm-19 before using libllvmlite (currently requiring libllvm-15) leads to segmentation faults.

8 participants