Skip to content

[BACKEND] Update LLVM version to llvm/llvm-project@9ddfe62#4338

Closed
ThomasRaoux wants to merge 5 commits intomainfrom
llvm-head
Closed

[BACKEND] Update LLVM version to llvm/llvm-project@9ddfe62#4338
ThomasRaoux wants to merge 5 commits intomainfrom
llvm-head

Conversation

@ThomasRaoux
Copy link
Copy Markdown
Collaborator

No description provided.

@gflegar
Copy link
Copy Markdown
Collaborator

gflegar commented Jul 19, 2024

@ThomasRaoux we can continue the discussion from #4212 here.

I'm trying to figure out at which point in our internally-known good LLVM commits the problem started happening. So I rebased the llvm-head branch on top of main and re-applied all the updates we've accumulated in the last few weeks one by one, so I get LLVM builds for all of them. I'll create draft PRs once that's done and we can see when it started failing, and merge the updates up to that point.

I can also try to get the latest LLVM version we have internally in the hopes that the issue got fixed in the meantime (will do that once all the builds are done).

In the meantime, I found where OpenXLA (which uses Triton as a dependecy) updated to this exact LLVM version that we know is causing problems in Triton, trying to see if we applied any patches on our side to make it work: openxla/xla@2b7f566
The patches we applied on top of LLVM do not seem relevant to Triton though. 🤷

@ThomasRaoux
Copy link
Copy Markdown
Collaborator Author

@ThomasRaoux we can continue the discussion from #4212 here.

I'm trying to figure out at which point in our internally-known good LLVM commits the problem started happening. So I rebased the llvm-head branch on top of main and re-applied all the updates we've accumulated in the last few weeks one by one, so I get LLVM builds for all of them. I'll create draft PRs once that's done and we can see when it started failing, and merge the updates up to that point.

I can also try to get the latest LLVM version we have internally in the hopes that the issue got fixed in the meantime (will do that once all the builds are done).

In the meantime, I found where OpenXLA (which uses Triton as a dependecy) updated to this exact LLVM version that we know is causing problems in Triton, trying to see if we applied any patches on our side to make it work: openxla/xla@2b7f566 The patches we applied on top of LLVM do not seem relevant to Triton though. 🤷

Wow this is awesome, thanks a lot! I tried to debug it a bit locally, I realized it happens when I build locally on linux but on my mac when I build locally all the lit tests pass. So it may be some undefined behavior or something that shows up only on some build config :( Hopefully narrowing down the LLVM hash will help figure it out.

@ThomasRaoux
Copy link
Copy Markdown
Collaborator Author

@gflegar sorry for jumping between different PRs for discussion :) So if I understand correctly this version fails and the version in between wouldn't fail llvm build on ARM?

@gflegar
Copy link
Copy Markdown
Collaborator

gflegar commented Jul 22, 2024

@gflegar sorry for jumping between different PRs for discussion :) So if I understand correctly this version fails and the version in between wouldn't fail llvm build on ARM?

No worries, it's a bit complicated with all the different PRs :)

This version passes LLVM builds, but fails on tests. Also, I created #4367, which is currently basically a copy of this PR, but from a separate, locked branch, since the contents of this PR would change any time we try to do something on the llvm-head branch (e.g., once Florian CCed below pushes the new LLVM version to the branch).
The version that fails MacoOS ARM LLVM builds is the one from #4366. I'm running tests on it now (since it does build on non-ARM Ubuntu) to see if we can also reproduce the failing tests there.

Also @reichlfl FYI take a look at discussions here, since you're our next Triton integrator at Google. The TL;DR is that we had CI issues for the last few weeks upstream, and now are trying to catch up, but the latest version of LLVM is causing test failures upstream, but not internally at Google. We're trying to root-cause it. You should still do the integration as normal, and create a PR for the latest LLVM we have internally, so we can check if the issue has maybe already been resolved upstream.

@gflegar
Copy link
Copy Markdown
Collaborator

gflegar commented Jul 22, 2024

Alright, it seems that #4366 is even more broken - plenty of tests failing with different errors, one among those is the <<UNKNOWN SSA VALUE>> we've seen in this PR, but that's not the only issue. So maybe the error is coming from somewhere between llvm/llvm-project@4713bd4 and llvm/llvm-project@de88b2c? (It is possible it's a completely unrelated thing, since more stuff is failing on #4366.)

Let's first see if all of this is fixed in the latest version of LLVM we'll roll out this week and save ourselves some time if it is :)

Wow this is awesome, thanks a lot! I tried to debug it a bit locally, I realized it happens when I build locally on linux but on my mac when I build locally all the lit tests pass. So it may be some undefined behavior or something that shows up only on some build config :( Hopefully narrowing down the LLVM hash will help figure it out.

My pleasure! Interesting, maybe it is UB. FWIW we do run Triton's lit tests internally through undefined behavior sanitizer, and that has not detected any issues with this LLVM version (but we do probably run a slightly different build configuration, since we don't use Triton's CMake build system and roll our own Bazel-like thing instead).

@karupayun
Copy link
Copy Markdown
Collaborator

karupayun commented Aug 2, 2024

The build starts to fail with "the << UNKNOWN SSA VALUE >> error" after this commit: llvm/llvm-project@ce80c80. It's quite a simple commit, maybe you have a better idea why using a non-deterministic seed could be causing the issue for you.

@ThomasRaoux
Copy link
Copy Markdown
Collaborator Author

The build starts to fail with "the << UNKNOWN SSA VALUE >> error" after this commit: llvm/llvm-project@ce80c80. It's quite a simple commit, maybe you have a better idea why using a non-deterministic seed could be causing the issue for you.

my first guess is that this header is included in both triton and LLVM but the preprocessor takes a different path on each of those therefore there would be a mismatch in the seeds based on whether the function is built with Triton or with LLVM.
We could verify that by setting LLVM_ENABLE_ABI_BREAKING_CHECKS to 0 for LLVM build?

@gflegar
Copy link
Copy Markdown
Collaborator

gflegar commented Aug 5, 2024

That would explain why we're not seeing it internally. We build both LLVM and Triton as one big package, rather than separate binaries, so we wouldn't have this mismatch.

@karupayun karupayun force-pushed the llvm-head branch 2 times, most recently from 559090b to 61665fd Compare August 6, 2024 12:25
@ThomasRaoux
Copy link
Copy Markdown
Collaborator Author

BTW, looking at the cmake file of llvm it seems like it will override the pre-processor and the right way to set this cmake option:

set(LLVM_ABI_BREAKING_CHECKS "WITH_ASSERTS" CACHE STRING
  "Enable abi-breaking checks.  Can be WITH_ASSERTS, FORCE_ON or FORCE_OFF.")

here is the cmake logic:

string(TOUPPER "${LLVM_ABI_BREAKING_CHECKS}" uppercase_LLVM_ABI_BREAKING_CHECKS)

if( uppercase_LLVM_ABI_BREAKING_CHECKS STREQUAL "WITH_ASSERTS" )
  if( LLVM_ENABLE_ASSERTIONS )
    set( LLVM_ENABLE_ABI_BREAKING_CHECKS 1 )
  endif()
elseif( uppercase_LLVM_ABI_BREAKING_CHECKS STREQUAL "FORCE_ON" )
  set( LLVM_ENABLE_ABI_BREAKING_CHECKS 1 )
elseif( uppercase_LLVM_ABI_BREAKING_CHECKS STREQUAL "FORCE_OFF" )
  # We don't need to do anything special to turn off ABI breaking checks.
elseif( NOT DEFINED LLVM_ABI_BREAKING_CHECKS )
  # Treat LLVM_ABI_BREAKING_CHECKS like "FORCE_OFF" when it has not been
  # defined.
else()
  message(FATAL_ERROR "Unknown value for LLVM_ABI_BREAKING_CHECKS: \"${LLVM_ABI_BREAKING_CHECKS}\"!")
endif()

@karupayun karupayun force-pushed the llvm-head branch 2 times, most recently from 695ffec to 9267087 Compare August 8, 2024 14:12
@karupayun
Copy link
Copy Markdown
Collaborator

BTW, looking at the cmake file of llvm it seems like it will override the pre-processor and the right way to set this cmake option:

set(LLVM_ABI_BREAKING_CHECKS "WITH_ASSERTS" CACHE STRING
  "Enable abi-breaking checks.  Can be WITH_ASSERTS, FORCE_ON or FORCE_OFF.")

here is the cmake logic:

string(TOUPPER "${LLVM_ABI_BREAKING_CHECKS}" uppercase_LLVM_ABI_BREAKING_CHECKS)

if( uppercase_LLVM_ABI_BREAKING_CHECKS STREQUAL "WITH_ASSERTS" )
  if( LLVM_ENABLE_ASSERTIONS )
    set( LLVM_ENABLE_ABI_BREAKING_CHECKS 1 )
  endif()
elseif( uppercase_LLVM_ABI_BREAKING_CHECKS STREQUAL "FORCE_ON" )
  set( LLVM_ENABLE_ABI_BREAKING_CHECKS 1 )
elseif( uppercase_LLVM_ABI_BREAKING_CHECKS STREQUAL "FORCE_OFF" )
  # We don't need to do anything special to turn off ABI breaking checks.
elseif( NOT DEFINED LLVM_ABI_BREAKING_CHECKS )
  # Treat LLVM_ABI_BREAKING_CHECKS like "FORCE_OFF" when it has not been
  # defined.
else()
  message(FATAL_ERROR "Unknown value for LLVM_ABI_BREAKING_CHECKS: \"${LLVM_ABI_BREAKING_CHECKS}\"!")
endif()

Yes, thanks for pointing it out. Your suggestion works and tests passes when setting the flag. I did not have a lot of time this week but I will be looking next week on how to properly fix it.

@ThomasRaoux
Copy link
Copy Markdown
Collaborator Author

Yes, thanks for pointing it out. Your suggestion works and tests passes when setting the flag. I did not have a lot of time this week but I will be looking next week on how to properly fix it.

Awesome! Thanks

@MaskRay
Copy link
Copy Markdown
Contributor

MaskRay commented Aug 8, 2024

Say my llvm-project build is at /tmp/Rel. /tmp/Rel/include/llvm/Config/abi-breaking.h defines LLVM_ABI_BREAKING_CHECKS.

llvm/include/llvm/ADT/Hashing.h includes abi-breaking.h and gets the LLVM_ABI_BREAKING_CHECKS value from the LLVM build. Whether or not the user build has a NDEBUG mismatch doesn't matter.

If there are Triton failures when the LLVM_ABI_BREAKING_CHECKS==1 code path is used, you might want to check code/tests overly depending on the determinism of llvm/include/llvm/ADT/Hashing.h, which is not guaranteed.

karupayun added a commit to openxla/triton that referenced this pull request Aug 13, 2024
This PR is setting DLLVM_ABI_BREAKING_CHECKS=FORCE_OFF, similar of how
we do it inside Google to be able to update llvm version in Triton.

The culprit llvm commit is
llvm/llvm-project@ce80c80,
that basically uses a non-deterministic seed if the previous "Flag" is
set.

There were some discussions in
triton-lang#4338, and the other option
would be to check/replace code and tests that are overly depending on
the determinism of llvm/include/llvm/ADT/Hashing.h (DenseMap and
DenseSet mostly).

I think that the priority is first to be able to update the llvm
version, but if OpenAI is interested I can work on replacing them (for
MapVector), to make the unit tests to pass. Let me know.
karupayun pushed a commit to openxla/triton that referenced this pull request Aug 13, 2024
This PR is setting DLLVM_ABI_BREAKING_CHECKS=FORCE_OFF, similar of how
we do it inside Google to be able to update llvm version in Triton.

The culprit llvm commit is
llvm/llvm-project@ce80c80,
that basically uses a non-deterministic seed if the previous "Flag" is
set.

There were some discussions in
triton-lang#4338, and the other option
would be to check/replace code and tests that are overly depending on
the determinism of llvm/include/llvm/ADT/Hashing.h (DenseMap and
DenseSet mostly).

I think that the priority is first to be able to update the llvm
version, but if OpenAI is interested I can work on replacing them (for
MapVector), to make the unit tests to pass. Let me know.
karupayun pushed a commit to openxla/triton that referenced this pull request Aug 13, 2024
…version

This PR is setting LLVM_ABI_BREAKING_CHECKS=FORCE_OFF, similar of how
we do it inside Google. In this way, we are forcing the structures
that relies in Hashing.h to use a deterministic seed.

We were not able to update the llvm version. The culprit llvm commit is
llvm/llvm-project@ce80c80,
that basically uses a non-deterministic seed if the previous "Flag" is
set.

There were some discussions in
triton-lang#4338, and the other option
would be to check/replace code and tests that are overly depending on
the determinism of llvm/include/llvm/ADT/Hashing.h (DenseMap and
DenseSet mostly).

I think that the priority is first to be able to update the llvm
version, but if OpenAI is interested I could work on replacing them
(for MapVector), in order to fix all the failing unit tests, and then we
can set the LLVM_ABI_BREAKING_CHECKS again to "WITH_ASSERTS".
karupayun pushed a commit to openxla/triton that referenced this pull request Aug 13, 2024
…version

This PR is setting LLVM_ABI_BREAKING_CHECKS=FORCE_OFF, similar of how
we do it inside Google. In this way, we are forcing the structures
that relies in Hashing.h to use a deterministic seed.

We were not able to update the llvm version. The culprit llvm commit is
llvm/llvm-project@ce80c80,
that basically uses a non-deterministic seed if the previous "Flag" is
set.

There were some discussions in
triton-lang#4338, and the other option
would be to check/replace code and tests that are overly depending on
the determinism of llvm/include/llvm/ADT/Hashing.h (DenseMap and
DenseSet mostly).

I think that the priority is first to be able to update the llvm
version, but if OpenAI is interested I could work on replacing them
(for MapVector), in order to fix all the failing unit tests, and then we
can set the LLVM_ABI_BREAKING_CHECKS again to "WITH_ASSERTS".
@karupayun
Copy link
Copy Markdown
Collaborator

Created #4512 to set LLVM_ABI_BREAKING_CHECKS=FORCE_OFF as it was suggested here. It's the same we are doing inside Google and I think that being able to update llvm is a priority.

If @ThomasRaoux is interested, in parallel we can replace some DenseMap (and DenseSet) by VectorMap (and VectorSet) to fix the failing unit tests. These structures come with performance overhead, so let me know if you are interested on doing that refactor. Alternative we can investigate what are the issues in the current uses (of DenseMap and DenseSet).

@ThomasRaoux
Copy link
Copy Markdown
Collaborator Author

If @ThomasRaoux is interested, in parallel we can replace some DenseMap (and DenseSet) by VectorMap (and VectorSet) to fix the failing unit tests. These structures come with performance overhead, so let me know if you are interested on doing that refactor. Alternative we can investigate what are the issues in the current uses (of DenseMap and DenseSet).

As I mentioned on the other PR I don't understand what cases should be changed and why

@ThomasRaoux
Copy link
Copy Markdown
Collaborator Author

@karupayun which PR do you want to land for the LLVM upgrade?

ThomasRaoux pushed a commit that referenced this pull request Aug 16, 2024
…version (#4512)

This PR is setting LLVM_ABI_BREAKING_CHECKS=FORCE_OFF, similar of how we
do it inside Google. In this way, we are forcing the structures that
relies in Hashing.h to use a deterministic seed.

We were not able to update the llvm version. The culprit llvm commit is
llvm/llvm-project@ce80c80,
that basically uses a non-deterministic seed if the previous "Flag" is
set.

There were some discussions in
#4338, and the other option
would be to check/replace code and tests that are overly depending on
the determinism of llvm/include/llvm/ADT/Hashing.h (DenseMap and
DenseSet mostly).

I think that the priority is first to be able to update the llvm
version, but if OpenAI is interested I could work on replacing them (for
MapVector), in order to fix all the failing unit tests, and then we can
set the LLVM_ABI_BREAKING_CHECKS again to "WITH_ASSERTS".

- [x] I am not making a trivial change, such as fixing a typo in a
comment.
- [x] I have written a PR description following these
  [rules](https://cbea.ms/git-commit/#why-not-how).
- [x] I have run `pre-commit run --from-ref origin/main --to-ref HEAD`.
- [x] This PR does not need a test because it's just setting a variable
for LLVM.
- [x] I have not added any `lit` tests.

Co-authored-by: Tori Baker <vwbaker@google.com>
@ThomasRaoux
Copy link
Copy Markdown
Collaborator Author

Closing this, we can start a fresh LLVM upgrade now that the problem is fixed.

bertmaher pushed a commit to bertmaher/triton that referenced this pull request Dec 10, 2024
…version (triton-lang#4512)

This PR is setting LLVM_ABI_BREAKING_CHECKS=FORCE_OFF, similar of how we
do it inside Google. In this way, we are forcing the structures that
relies in Hashing.h to use a deterministic seed.

We were not able to update the llvm version. The culprit llvm commit is
llvm/llvm-project@ce80c80,
that basically uses a non-deterministic seed if the previous "Flag" is
set.

There were some discussions in
triton-lang#4338, and the other option
would be to check/replace code and tests that are overly depending on
the determinism of llvm/include/llvm/ADT/Hashing.h (DenseMap and
DenseSet mostly).

I think that the priority is first to be able to update the llvm
version, but if OpenAI is interested I could work on replacing them (for
MapVector), in order to fix all the failing unit tests, and then we can
set the LLVM_ABI_BREAKING_CHECKS again to "WITH_ASSERTS".

- [x] I am not making a trivial change, such as fixing a typo in a
comment.
- [x] I have written a PR description following these
  [rules](https://cbea.ms/git-commit/#why-not-how).
- [x] I have run `pre-commit run --from-ref origin/main --to-ref HEAD`.
- [x] This PR does not need a test because it's just setting a variable
for LLVM.
- [x] I have not added any `lit` tests.

Co-authored-by: Tori Baker <vwbaker@google.com>
liuyunqi20 pushed a commit to flagos-ai/FlagTree that referenced this pull request Oct 21, 2025
…version (#4512)

This PR is setting LLVM_ABI_BREAKING_CHECKS=FORCE_OFF, similar of how we
do it inside Google. In this way, we are forcing the structures that
relies in Hashing.h to use a deterministic seed.

We were not able to update the llvm version. The culprit llvm commit is
llvm/llvm-project@ce80c80,
that basically uses a non-deterministic seed if the previous "Flag" is
set.

There were some discussions in
triton-lang/triton#4338, and the other option
would be to check/replace code and tests that are overly depending on
the determinism of llvm/include/llvm/ADT/Hashing.h (DenseMap and
DenseSet mostly).

I think that the priority is first to be able to update the llvm
version, but if OpenAI is interested I could work on replacing them (for
MapVector), in order to fix all the failing unit tests, and then we can
set the LLVM_ABI_BREAKING_CHECKS again to "WITH_ASSERTS".

- [x] I am not making a trivial change, such as fixing a typo in a
comment.
- [x] I have written a PR description following these
  [rules](https://cbea.ms/git-commit/#why-not-how).
- [x] I have run `pre-commit run --from-ref origin/main --to-ref HEAD`.
- [x] This PR does not need a test because it's just setting a variable
for LLVM.
- [x] I have not added any `lit` tests.

Co-authored-by: Tori Baker <vwbaker@google.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants