Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Llama 3.1 8b fp16 takes ~20 minutes to compile #19049

Open
aviator19941 opened this issue Nov 6, 2024 · 13 comments
Open

Llama 3.1 8b fp16 takes ~20 minutes to compile #19049

aviator19941 opened this issue Nov 6, 2024 · 13 comments
Labels
bug 🐞 Something isn't working

Comments

@aviator19941
Copy link
Contributor

aviator19941 commented Nov 6, 2024

What happened?

Llama 3.1 8b fp16 with decomposed (not sdpa) flash attention takes ~20 minutes to compile with iree-compiler==20241105.1069, but it only took <1 minute to compile with iree-compiler==20241104.1068/ also before this commit f71dd12.

Steps to reproduce your issue

  1. IR: https://gist.github.com/aviator19941/287a572dded1006f7bb82c8e9a5fe34d
  2. ../iree-build-no-trace/tools/iree-compile 8b_f16_decomposed.mlir --iree-hip-target=gfx942 --iree-hal-target-backends=rocm -o=test_decomposed_tom.vmfb
  3. Able to compile to vmfb (~19 minutes slower than previously)

What component(s) does this issue relate to?

No response

Version information

41ed8c0

Additional context

No response

@aviator19941 aviator19941 added the bug 🐞 Something isn't working label Nov 6, 2024
@ScottTodd
Copy link
Member

Thanks for the details in this report. Narrowing down the commit range helps quite a bit.

More tips: https://iree.dev/developers/debugging/compile-time-regressions/

@IanWood1
Copy link
Contributor

IanWood1 commented Nov 6, 2024

I'm having similar problems with 70b currently, I'll upload a --mlir-timing output once its done

@benvanik
Copy link
Collaborator

benvanik commented Nov 6, 2024

watch it be one of the compile performance optimizations that made it worse :P

@IanWood1
Copy link
Contributor

IanWood1 commented Nov 6, 2024

Looks like the majority of the time is spent in OptimizeIntArithtic (full report). Probably not the most accurate report because I forgot to use --iree-opt-data-tiling=false so there will be a bunch of extract_slices and dynamic dim stuff.

I'd be interested in trying to resolve this, but because this seems pretty blocking for llama dev, it might be better to have someone with more familiarity work on it.

@benvanik
Copy link
Collaborator

benvanik commented Nov 6, 2024

heh, yeah, that'll do it

  1037.3731 ( 36.5%)  1033.0835 ( 73.0%)    'util.func' Pipeline
  1036.3686 ( 36.5%)  1032.6555 ( 73.0%)      OptimizeIntArithmetic
  344.6345 ( 12.1%)  344.3302 ( 24.3%)    'util.func' Pipeline
  344.3747 ( 12.1%)  344.1984 ( 24.3%)      OptimizeIntArithmetic

There may be some short-circuiting we can add to the analysis (avoid walking into linalg ops or something) but I'm not sure of the impact on the analysis results.

We may need to internally parallelize that given that these funcs are so big. mlir::parallelForEach / mlir::parallelFor & co are something we can use that is often much easier than splitting an entire pass into several analysis steps with shared caches and such - especially when the analysis itself is likely the issue (we don't need multiple analysis passes to run concurrently, we need one pass to run concurrently on the IR it is scoped to).

Unless there's a few obvious standouts in a perf dump/tracy with sampling enabled we may have to turn it off for this model until the larger changes can be made. I'm not sure if we've already started needing the results for good codegen, though, so it's a risk.

@ScottTodd
Copy link
Member

Here's the full commit range between the nightly releases you referenced: candidate-20241104.1068...candidate-20241105.1069

@benvanik
Copy link
Collaborator

benvanik commented Nov 6, 2024

Odd - nothing stands out besides the integrate - most other changes were in codegen or runtime HAL API - I was expecting a flag flip.

@IanWood1
Copy link
Contributor

IanWood1 commented Nov 6, 2024

This MLIR will repro the perf issue on TOM with iree-opt llama70b_f16.input.mlir --iree-util-optimize-int-arithmetic

@benvanik
Copy link
Collaborator

benvanik commented Nov 6, 2024

hah watch it be iree-org/llvm-project@3494ee9 - adding an assert to APInt will make LLVM even worse in debug builds, hooray~~~~ we should make sure asserts are disabled and try timing (and if the issue push for that to be rolled back/behind an opt-in-aggressive flag - APInt can be constructed a bajillion times/sec and that assert path is not cheap).

I'd try going before/after the integrate and also comparing torch IR before/after (as there's both LLVM and torch in there).

@Groverkss
Copy link
Contributor

hah watch it be iree-org/llvm-project@3494ee9 - adding an assert to APInt will make LLVM even worse in debug builds, hooray~~~~ we should make sure asserts are disabled and try timing (and if the issue push for that to be rolled back/behind an opt-in-aggressive flag - APInt can be constructed a bajillion times/sec and that assert path is not cheap).

I'd try going before/after the integrate and also comparing torch IR before/after (as there's both LLVM and torch in there).

That assert should really be behind LLVM_ENABLE_EXPENSIVE_CHECKS

@benvanik
Copy link
Collaborator

benvanik commented Nov 6, 2024

yeah - definitely! it's really not good to have that in that value-type ctor

@IanWood1
Copy link
Contributor

IanWood1 commented Nov 6, 2024

Apparently IntegerRangeAnalysis was looping over extremely large splat tensors. This fix brings total compilation time from ~1400s to ~100: llvm/llvm-project#115229

iree-org/llvm-project@3494ee9 could also be causing problems too so there might be more to do.

@benvanik
Copy link
Collaborator

benvanik commented Nov 6, 2024

hah! nice find/fix ian!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug 🐞 Something isn't working
Projects
None yet
Development

No branches or pull requests

5 participants