Skip to content

Conversation

@aengelke
Copy link
Contributor

@aengelke aengelke commented Dec 29, 2025

Building LLVM is slow and dominated by the C++ front-end. Most time is spent in (repeatedly) parsing headers. C++ modules build didn't really help, is fragile, and kills parallelism. Therefore, I propose to use PCH for frequently used headers (i.e., C++ stdlib, Support, IR, CodeGen). There shouldn't be much difference for incremental compilation, as many headers I put into the PCHs transitively get included into most source files anyway.

Time breakdown in seconds for building libLLVM.so (collected with -ftime-trace, -O1, -DLLVM_TARGETS_TO_BUILD="X86;AArch64"):

Phase                           main main+TPDE+PCH
------------------------------- ---- -------------
ExecuteCompiler                 9913          4084
Frontend                        7842          2773 (PCH difference)
  Source                        5471          1213
  PerformPendingInstantiations  1837           916
  CodeGen Function               248           278
Backend                         2014          1256
  Optimizer                     1332          1243 (probably just measurement noise)
  CodeGenPasses                  675             3 (TPDE fallbacks to LLVM for 10 CUs)
  TPDE                             0             7

wall-time [s] (48c/96t)       126.48         66.04
(AMD EPYC 9454P, 2.75-3.80 GHz, 372 GiB memory)

(Edit: new data with idle machine + assertions disabled, individual results are still a bit noisy due to hyper-threading; no substantial change in relative factors.) (NB: on this machine, much larger wall-time reductions are rather unlikely: SLPVectorizer.cpp takes >40s alone.)

The back-end is not really relevant here, but enabling PCH (this PR) reduces the front-end time by 2x. On c-t-t, this shows up as a 30% wall-time/instructions reduction in stage2-clang. Further improvements for Clang should also be possible (clang/AST/Decl.h is expensive with 1219s), but is not trivial due to the use of object libraries.

I'm opening this PR for early feedback/direction before spending more time on this:

  • Does this extensive PCH use has a reasonable chance of getting merged?
    • How to deal with newly occuring name collisions? (e.g. llvm::Reloc from llvm/Support/CodeGen.h vs. lld::macho::Reloc from lld/MachO/Relocations.h; where e.g. lld/MachO/InputSections.cpp uses both namespaces)
  • Enable by default vs. not? E.g., CI should probably not use this to catch missing includes, but CI should also keep it working as PCHs can lead to new errors.
  • PCH reuse is currently hard-coded in llvm_add_library, how to make this more elegant?

cc @nikic @rnk @boomanaiden154 (not sure who else is interested in LLVM build times + looking at CMake)

@aengelke aengelke added cmake Build system in general and CMake in particular llvm Umbrella label for LLVM issues labels Dec 29, 2025
@github-actions
Copy link

github-actions bot commented Dec 29, 2025

🐧 Linux x64 Test Results

  • 196313 tests passed
  • 5947 tests skipped

✅ The build succeeded and all tests passed.

@github-actions
Copy link

github-actions bot commented Dec 29, 2025

🪟 Windows x64 Test Results

  • 134818 tests passed
  • 4877 tests skipped

✅ The build succeeded and all tests passed.

@aengelke
Copy link
Contributor Author

aengelke commented Dec 29, 2025

FWIW, statistics about optimization passes >5s (no adaptors, wrappers, etc.) on libLLVM compiled with -O1 (NB: pass times can include analyses):

   Total   NumRuns       PerRun  Pass
   5.01s    160476      31.19us  LoopUnrollPass
   5.35s   4404446       1.21us  TargetIRAnalysis
   6.94s   4403527       1.58us  AssumptionAnalysis
   7.10s    433061      16.41us  LoopRotatePass
   7.91s      3909    2023.59us  CallGraphAnalysis
   9.66s      1980    4877.52us  CalledValuePropagationPass
  10.19s      1980    5145.05us  AlwaysInlinerPass
  12.22s   5587932       2.19us  PostDominatorTreeAnalysis
  12.26s   2070891       5.92us  ReassociatePass
  13.61s   2070891       6.57us  ADCEPass
  13.87s    737972      18.80us  LICMPass
  13.90s   2070891       6.71us  BDCEPass
  15.25s    432540      35.25us  LoopDeletionPass
  15.41s   4462734       3.45us  LoopSimplifyPass
  17.77s   5938233       2.99us  LoopAnalysis
  18.65s      3960    4710.34us  GlobalOptPass
  19.69s  11044561       1.78us  DominatorTreeAnalysis
  20.63s    302139      68.27us  IndVarSimplifyPass
  23.03s   2070891      11.12us  MemCpyOptPass
  25.45s    302139      84.23us  LoopIdiomRecognizePass
  25.85s   4139632       6.24us  PostOrderFunctionAttrsPass
  28.08s   2070891      13.56us  SCCPPass
  30.01s      1980   15158.71us  IPSCCPPass
  34.19s   3495375       9.78us  BranchProbabilityAnalysis
  42.01s   2843306      14.77us  MemorySSAAnalysis
  49.53s   3495375      14.17us  BlockFrequencyAnalysis
  64.37s   6536108       9.85us  SROAPass
  82.56s   4304741      19.18us  EarlyCSEPass
  91.52s  13070009       7.00us  SimplifyCFGPass
 248.60s   2069816     120.11us  InlinerPass
 279.41s  10996635      25.41us  InstCombinePass
1274.61s      1980  643743.37us  Optimizer

aengelke added a commit that referenced this pull request Dec 30, 2025
…3869)

This avoids looking at the individual sources for mixed C/C++ libraries.

The previous code was written ~2014. Generator expressions were added in
CMake 3.3 (2015). We currently require CMake 3.20 and therefore can rely
on more modern features.

Apart from simplifying the code, this is preliminary work to make more
use of pre-compiled headers (#173868).
@aengelke
Copy link
Contributor Author

Another data point: I can build LLVM (incl. tools) in 6-7 minutes on my laptop now (M2 Macbook Air, 8 cores); building unit-tests take another 2.5mins. Interestingly, with PCH optimizations are comparably expensive here:

ExecuteCompiler                 3272
Frontend                        1533
  Source                         476
  PerformPendingInstantiations   570
  CodeGen Function               300
Backend                         1699
  Optimizer                     1056
    InlinerPass                  264 (90.27us per run; 2934128 runs)
    InstCombinePass              163 (10.69us per run; 15292003 runs)
    SimplifyCFGPass               67 (3.77us per run; 18032531 runs)
    EarlyCSEPass                  58 (9.87us per run; 5880387 runs)
    SROAPass                      51 (5.69us per run; 9020246 runs)
    BlockFrequencyAnalysis        47 (9.41us per run; 5048403 runs)
    BranchProbabilityAnalysis     33 (6.72us per run; 5048403 runs)
    AAManager                     30 (2.57us per run; 11970407 runs)
    MemorySSAAnalysis             29 (7.63us per run; 3921385 runs)
    IPSCCPPass                    28 (12019.39us per run; 2410 runs)
    PostOrderFunctionAttrsPass    23 (3.98us per run; 5868256 runs)
  CodeGenPasses                  639

wall-time (8c/8t) ~400s

Most expensive CUs >15s:

15.769s unittests/SandboxIR/SandboxIRTest.cpp
16.174s lib/CodeGen/SelectionDAG/DAGCombiner.cpp
16.238s lib/Transforms/IPO/MemProfContextDisambiguation.cpp
16.457s unittests/Frontend/OpenMPIRBuilderTest.cpp
16.702s unittests/ADT/DenseMapTest.cpp
16.779s lib/Transforms/IPO/AttributorAttributes.cpp
16.920s lib/Target/AArch64/AArch64ISelLowering.cpp
28.694s tools/llvm-readobj/dir/ELFDumper.cpp
28.865s lib/Target/X86/X86ISelLowering.cpp
30.295s lib/Passes/PassBuilder.cpp
33.410s lib/Transforms/Vectorize/SLPVectorizer.cpp
54.741s unittests/Frontend/OpenMPDecompositionTest.cpp

Copy link
Contributor

@boomanaiden154 boomanaiden154 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not exactly sure how you need to invoke the clang driver to use precompiled headers, but it seems like sccache might not have support currently (mozilla/sccache#615). Might still be useful for local builds, but we would have to turn it off in premerge CI until it works with caching.

Either way, I would probably like to see a proper RFC on discourse for this.

@aengelke
Copy link
Contributor Author

RFC on Discourse

mahesh-attarde pushed a commit to mahesh-attarde/llvm-project that referenced this pull request Jan 6, 2026
…m#173869)

This avoids looking at the individual sources for mixed C/C++ libraries.

The previous code was written ~2014. Generator expressions were added in
CMake 3.3 (2015). We currently require CMake 3.20 and therefore can rely
on more modern features.

Apart from simplifying the code, this is preliminary work to make more
use of pre-compiled headers (llvm#173868).
aengelke added a commit that referenced this pull request Jan 7, 2026
Don't duplicate the EnumEntry type in llvm-objdump.

Spliced off from #173868, where this is required to avoid the name
collision.
@github-actions
Copy link

github-actions bot commented Jan 7, 2026

✅ With the latest revision this PR passed the C/C++ code formatter.

@mstorsjo
Copy link
Member

mstorsjo commented Jan 8, 2026

FWIW, some test feedback on this patchset:

On Ubuntu 20.04, with the stock system libstdc++ 9, I get this:

In file included from <built-in>:1:
In file included from llvm-project/llvm/build-clang/third-party/unittest/CMakeFiles/llvm_gtest.dir/cmake_pch.hxx:6:
In file included from llvm-project/llvm/include/llvm/Support/pch.h:74:
In file included from /usr/lib/gcc/aarch64-linux-gnu/9/../../../../include/c++/9/execution:32:
In file included from /usr/lib/gcc/aarch64-linux-gnu/9/../../../../include/c++/9/pstl/glue_execution_defs.h:52:
In file included from /usr/lib/gcc/aarch64-linux-gnu/9/../../../../include/c++/9/pstl/algorithm_impl.h:25:
In file included from /usr/lib/gcc/aarch64-linux-gnu/9/../../../../include/c++/9/pstl/parallel_backend.h:14:
/usr/lib/gcc/aarch64-linux-gnu/9/../../../../include/c++/9/pstl/parallel_backend_tbb.h:19:10: fatal error: 'tbb/blocked_range.h' file not found
   19 | #include <tbb/blocked_range.h>
      |          ^~~~~~~~~~~~~~~~~~~~~

On Ubuntu 24.04 with GCC, I get lots of these warnings, and no measurable speedup:

[581/5825] Building CXX object utils/T...leGenCommon.dir/CodeGenRegisters.cpp.o
cc1plus: warning: llvm-project/llvm/build/lib/Support/CMakeFiles/LLVMSupport.dir/cmake_pch.hxx.gch: not used because `LLVM_BUILD_STATIC' is defined [-Winvalid-pch]

On Ubuntu 24.04 with Clang, I do see something of a speedup though.

@sharkautarch
Copy link

FWIW, some test feedback on this patchset:

On Ubuntu 20.04, with the stock system libstdc++ 9, I get this:

In file included from <built-in>:1:
In file included from llvm-project/llvm/build-clang/third-party/unittest/CMakeFiles/llvm_gtest.dir/cmake_pch.hxx:6:
In file included from llvm-project/llvm/include/llvm/Support/pch.h:74:
In file included from /usr/lib/gcc/aarch64-linux-gnu/9/../../../../include/c++/9/execution:32:
In file included from /usr/lib/gcc/aarch64-linux-gnu/9/../../../../include/c++/9/pstl/glue_execution_defs.h:52:
In file included from /usr/lib/gcc/aarch64-linux-gnu/9/../../../../include/c++/9/pstl/algorithm_impl.h:25:
In file included from /usr/lib/gcc/aarch64-linux-gnu/9/../../../../include/c++/9/pstl/parallel_backend.h:14:
/usr/lib/gcc/aarch64-linux-gnu/9/../../../../include/c++/9/pstl/parallel_backend_tbb.h:19:10: fatal error: 'tbb/blocked_range.h' file not found
   19 | #include <tbb/blocked_range.h>
      |          ^~~~~~~~~~~~~~~~~~~~~

I think the simplest workaround for that error w/ gcc+libstdc++9 would be to remove the #include <execution> line from llvm-project/llvm/include/llvm/Support/pch.h
Since it seems that the only place in llvm-project where the execution syshdr is included is in libcxx, so precompiling that header doesn't even help if you're not building libcxx

@aengelke
Copy link
Contributor Author

aengelke commented Jan 8, 2026

Thanks for checking GCC (14.2), I rarely use it for building LLVM. Reducing the number of headers seems to be important for GCC, there's certainly some room for improvements. Some observations from quick testing and profiling of GCC:

  • PCH are much larger than with Clang. Even after reducing the size of Support/pch.h, the Support PCH is still 132M (Clang: 24M).
  • As a consequence, max-rss also increases strongly (e.g., from 397M to 751M).
  • cc1plus system time increases strongly; on a random file, perf indicates >20% time spent in handling page faults (2x more than before), probably due to the large PCH.
  • Parsing time goes down substantially, but, according to perf on cc1plus, most of the improvement is negated by instantiate_pending_templates -> instantiate_decl, which consumes much more time in the PCH build (on a random file, this increases from 5% to 25% with almost identical total compile time).
  • -ftime-report is not useful for profiling, it strongly affects execution times. I don't know how to get more accurate information on what exactly causes GCC to slow down that much.

The warnings seem to only come from executables, most files in libraries are not afffected.

@aengelke
Copy link
Contributor Author

aengelke commented Jan 9, 2026

I dumped all template instantiations GCC additionally performs when using PCH. These are round 50k additional instantations performed just when compiling LLVMSupport.a. extra-templ-LLVMSupport.txt

Clang doesn't have this problem due to -fpch-instantiate-templates (a45f713). Maybe GCC could be changed to do something similar, I don't know.

I don't think the GCC compile-time is fixable on our side. We might want to consider not using PCH with GCC. Here is data for a standard single-target LLVM release build:

  • Clang 21, PCH disabled: 133.24s (CPU time: 5497.8 usr, 297.6 sys)
  • Clang 21, PCH enabled: 83.49s (CPU time: 3034.2 usr, 194.4 sys)
  • GCC 15, PCH disabled: 141.94s (CPU time: 5232.6 usr, 559.8 sys)
  • GCC 15, PCH enabled: 137.58s (CPU time: 5094.6 usr, 710.4 sys)

PCH sizes are also much larger with GCC:

 91M Clang /lib/CodeGen/CMakeFiles/LLVMCodeGen.dir/cmake_pch.hxx.pch
427M GCC   /lib/CodeGen/CMakeFiles/LLVMCodeGen.dir/cmake_pch.hxx.gch
 55M Clang /lib/IR/CMakeFiles/LLVMCore.dir/cmake_pch.hxx.pch
268M GCC   /lib/IR/CMakeFiles/LLVMCore.dir/cmake_pch.hxx.gch
 25M Clang /lib/Support/CMakeFiles/LLVMSupport.dir/cmake_pch.hxx.pch
131M GCC   /lib/Support/CMakeFiles/LLVMSupport.dir/cmake_pch.hxx.gch
 30M Clang /third-party/unittest/CMakeFiles/llvm_gtest.dir/cmake_pch.hxx.pch
149M GCC   /third-party/unittest/CMakeFiles/llvm_gtest.dir/cmake_pch.hxx.gch

Build directory sizes:

1570M Clang
1772M Clang PCH
1846M GCC
2818M GCC PCH

cmake -DLLVM_ENABLE_PROJECTS="llvm" -DLLVM_TARGETS_TO_BUILD="X86" -DCMAKE_BUILD_TYPE=Release -G Ninja -DLLVM_ENABLE_ASSERTIONS=OFF -DLLVM_USE_LINKER=lld

Due to heavy use of using namespace llvm, Reloc is often ambiguous with
llvm::Reloc, the relocation model. Previously, this was sometimes
disambiguated with macho::Reloc. This ambiguity is even more problematic
when using pre-compiled headers, where it's no longer "obvious" whether
it should be Reloc or macho::Reloc.

Therefore, rename Reloc to Relocation. This is also consistent with
lld/ELF, where the type is also named Relocation.
@sharkautarch
Copy link

@aengelke one issue I’ve noticed w/ this PR: currently, when building llvm w/ rtti on and also building Polly, Polly fails to compile
This is because Polly’s CMakeLists.txt adds "-fno-rtti" to the cxx flags, apparently so that it’s compatible w/ llvm shared libs built w/o rtti
See this 14 year old commit from 2011: 0803071

You should probably somehow just edit your pr to exclude the use of pch stuffs for polly

@aengelke
Copy link
Contributor Author

Thanks for catching this, I added a fix to #176420.

@pinskia
Copy link

pinskia commented Jan 16, 2026

Llvm CPU time: 5497.8 usr, 297.6 sys
Vs
Gcc 5232.6 usr, 559.8 sys

Hmm. So from the looks of it clang/llvm could be optimized more. Gcc is most likely spending more time in system as it writes out the .s file and then gas reads it back in and assembles it.
So gcc is spending less time in user space to do work, ~5% if my math is correct. But 2x as much time in the kernel. Which does corresponding to writing out a .s file.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cmake Build system in general and CMake in particular llvm Umbrella label for LLVM issues

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants