Llama 3.1 8b fp16 takes ~20 minutes to compile #19049

aviator19941 · 2024-11-06T19:58:28Z

What happened?

Llama 3.1 8b fp16 with decomposed (not sdpa) flash attention takes ~20 minutes to compile with iree-compiler==20241105.1069, but it only took <1 minute to compile with iree-compiler==20241104.1068/ also before this commit f71dd12.

Steps to reproduce your issue

IR: https://gist.github.com/aviator19941/287a572dded1006f7bb82c8e9a5fe34d
../iree-build-no-trace/tools/iree-compile 8b_f16_decomposed.mlir --iree-hip-target=gfx942 --iree-hal-target-backends=rocm -o=test_decomposed_tom.vmfb
Able to compile to vmfb (~19 minutes slower than previously)

What component(s) does this issue relate to?

No response

Version information

41ed8c0

Additional context

No response

The text was updated successfully, but these errors were encountered:

ScottTodd · 2024-11-06T20:01:54Z

Thanks for the details in this report. Narrowing down the commit range helps quite a bit.

More tips: https://iree.dev/developers/debugging/compile-time-regressions/

IanWood1 · 2024-11-06T20:08:09Z

I'm having similar problems with 70b currently, I'll upload a --mlir-timing output once its done

benvanik · 2024-11-06T20:12:37Z

watch it be one of the compile performance optimizations that made it worse :P

IanWood1 · 2024-11-06T21:04:04Z

Looks like the majority of the time is spent in OptimizeIntArithtic (full report). Probably not the most accurate report because I forgot to use --iree-opt-data-tiling=false so there will be a bunch of extract_slices and dynamic dim stuff.

I'd be interested in trying to resolve this, but because this seems pretty blocking for llama dev, it might be better to have someone with more familiarity work on it.

benvanik · 2024-11-06T21:13:33Z

heh, yeah, that'll do it

  1037.3731 ( 36.5%)  1033.0835 ( 73.0%)    'util.func' Pipeline
  1036.3686 ( 36.5%)  1032.6555 ( 73.0%)      OptimizeIntArithmetic
  344.6345 ( 12.1%)  344.3302 ( 24.3%)    'util.func' Pipeline
  344.3747 ( 12.1%)  344.1984 ( 24.3%)      OptimizeIntArithmetic

There may be some short-circuiting we can add to the analysis (avoid walking into linalg ops or something) but I'm not sure of the impact on the analysis results.

We may need to internally parallelize that given that these funcs are so big. mlir::parallelForEach / mlir::parallelFor & co are something we can use that is often much easier than splitting an entire pass into several analysis steps with shared caches and such - especially when the analysis itself is likely the issue (we don't need multiple analysis passes to run concurrently, we need one pass to run concurrently on the IR it is scoped to).

Unless there's a few obvious standouts in a perf dump/tracy with sampling enabled we may have to turn it off for this model until the larger changes can be made. I'm not sure if we've already started needing the results for good codegen, though, so it's a risk.

ScottTodd · 2024-11-06T21:16:07Z

Here's the full commit range between the nightly releases you referenced: candidate-20241104.1068...candidate-20241105.1069

benvanik · 2024-11-06T21:17:53Z

Odd - nothing stands out besides the integrate - most other changes were in codegen or runtime HAL API - I was expecting a flag flip.

IanWood1 · 2024-11-06T21:22:04Z

This MLIR will repro the perf issue on TOM with iree-opt llama70b_f16.input.mlir --iree-util-optimize-int-arithmetic

benvanik · 2024-11-06T21:22:40Z

hah watch it be iree-org/llvm-project@3494ee9 - adding an assert to APInt will make LLVM even worse in debug builds, hooray~~~~ we should make sure asserts are disabled and try timing (and if the issue push for that to be rolled back/behind an opt-in-aggressive flag - APInt can be constructed a bajillion times/sec and that assert path is not cheap).

I'd try going before/after the integrate and also comparing torch IR before/after (as there's both LLVM and torch in there).

Groverkss · 2024-11-06T21:34:55Z

hah watch it be iree-org/llvm-project@3494ee9 - adding an assert to APInt will make LLVM even worse in debug builds, hooray~~~~ we should make sure asserts are disabled and try timing (and if the issue push for that to be rolled back/behind an opt-in-aggressive flag - APInt can be constructed a bajillion times/sec and that assert path is not cheap).

I'd try going before/after the integrate and also comparing torch IR before/after (as there's both LLVM and torch in there).

That assert should really be behind LLVM_ENABLE_EXPENSIVE_CHECKS

benvanik · 2024-11-06T21:37:02Z

yeah - definitely! it's really not good to have that in that value-type ctor

IanWood1 · 2024-11-06T22:45:00Z

Apparently IntegerRangeAnalysis was looping over extremely large splat tensors. This fix brings total compilation time from ~1400s to ~100: llvm/llvm-project#115229

iree-org/llvm-project@3494ee9 could also be causing problems too so there might be more to do.

benvanik · 2024-11-06T22:47:31Z

hah! nice find/fix ian!

aviator19941 added the bug 🐞 Something isn't working label Nov 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Llama 3.1 8b fp16 takes ~20 minutes to compile #19049

Llama 3.1 8b fp16 takes ~20 minutes to compile #19049

aviator19941 commented Nov 6, 2024 •

edited

Loading

ScottTodd commented Nov 6, 2024

IanWood1 commented Nov 6, 2024

benvanik commented Nov 6, 2024

IanWood1 commented Nov 6, 2024 •

edited

Loading

benvanik commented Nov 6, 2024

ScottTodd commented Nov 6, 2024

benvanik commented Nov 6, 2024

IanWood1 commented Nov 6, 2024

benvanik commented Nov 6, 2024

Groverkss commented Nov 6, 2024

benvanik commented Nov 6, 2024

IanWood1 commented Nov 6, 2024

benvanik commented Nov 6, 2024

Llama 3.1 8b fp16 takes ~20 minutes to compile #19049

Llama 3.1 8b fp16 takes ~20 minutes to compile #19049

Comments

aviator19941 commented Nov 6, 2024 • edited Loading

What happened?

Steps to reproduce your issue

What component(s) does this issue relate to?

Version information

Additional context

ScottTodd commented Nov 6, 2024

IanWood1 commented Nov 6, 2024

benvanik commented Nov 6, 2024

IanWood1 commented Nov 6, 2024 • edited Loading

benvanik commented Nov 6, 2024

ScottTodd commented Nov 6, 2024

benvanik commented Nov 6, 2024

IanWood1 commented Nov 6, 2024

benvanik commented Nov 6, 2024

Groverkss commented Nov 6, 2024

benvanik commented Nov 6, 2024

IanWood1 commented Nov 6, 2024

benvanik commented Nov 6, 2024

aviator19941 commented Nov 6, 2024 •

edited

Loading

IanWood1 commented Nov 6, 2024 •

edited

Loading