trans/LLVM: Don't keep all LLVM modules in memory at the same time #39280

michaelwoerister · 2017-01-24T22:55:14Z

UPDATE: There are some mentoring instructions below if you're interested in looking into this issue.

NB: Part of the roadmap issue on incremental compilation.

Currently the compiler translates all codegen units into LLVM modules and then runs LLVM on all of them in parallel. In the context of incremental compilation, where there can be hundreds of codegen units for a single crate, but also for non-incremental builds with a high number of codegen units, this can mean a lot of memory pressure -- e.g. for Servo's script crate that's more than 10 GB peak memory usage (while otherwise it's around 4.5 GB).

There's no real need to keep more LLVM modules in memory than are currently being worked on. Two possible solutions:

If the compiler uses N LLVM threads, start persisting LLVM modules to disk at the N+1st and reload them later for optimization and codegen. That's probably relatively easy to implement but involves more disk I/O than necessary.
Translate only N codegen units to LLVM modules, translate them all the way to object files, then go back and translate the next codegen unit. This has the advantage that we would not need to temporarily store anything on disk, but we would need to keep the tcx in memory as long as there are still untranslated codegen units. It would also require a much bigger refactoring than the first approach.

Any other ideas?

cc @rust-lang/compiler @rust-lang/tools @nagisa

The text was updated successfully, but these errors were encountered:

nrc · 2017-01-24T23:16:04Z

Will this actually cause problems? Unless there are too many threads (in which case that should be addressed), it feels like the OS will handle this better by paging to disk than we will by (effectively) doing so manually. Also, for many devs, 10GB is not too much memory usage and anything we do here (unless we're really smart) will slow down compilation for those users.

alexcrichton · 2017-01-25T00:14:15Z

My first thought would be the latter, consider translation part of a "work unit" and don't actually start doing it until it's ready to go all the way into LLVM. That way we cap the number of active LLVM modules at any one point in time.

I think @nrc's got a good point, though, in that using the disk as a cache may be better left to the OS rather than the compiler itself (until proven otherwise, of course)

michaelwoerister · 2017-01-25T00:15:15Z

Some data points:

Will this actually cause problems?

It lead to my not being able to compile the script crate on Linux with 8 GB. Maybe if I had more swap space it would have worked. (EDIT: The VM had 4GB swap space)

Unless there are too many threads (in which case that should be addressed)

If everything fits into memory at once, having hundreds of parallel LLVM threads does not seem to have a detrimental effect (running 140 threads was a few percent faster than having 4 threads on a 4 core processor). OTOH, I would suspect that paging would completely destroy performance if many threads had to fight for RAM.

Also, for many devs, 10GB is not too much memory usage and anything we do here (unless we're really smart) will slow down compilation for those users.

That's certainly a valid concern!

retep998 · 2017-01-25T00:27:36Z

If you really want to allow the OS to do a good job of paging out memory you need to ensure that modules don't share pages with each other. If a given module is spread out across your memory with lots of fragmentation, then you'll have to have a lot more memory paged in than necessary, reducing the effectiveness of the OS paging stuff out. Also, there are ways to hint to the OS that you don't need some memory at the moment, so it can be more willing to page it out sooner in favor of keeping other memory in memory.

michaelwoerister · 2017-01-25T14:16:16Z

It seems to me like letting the OS take care of this via paging is really the most efficient and most easily implemented solution. But I'm also a bit concerned about rustc getting the reputation of being a memory hog, even if unfairly. It also allows for fewer crates to be compiled by a 32-bit compiler (although the script crate, for example, is already above 2GB before trans, I think).

hanna-kruppe · 2017-01-25T16:36:17Z

Would option 2 actually be any slower or otherwise less efficient? It doesn't seem to do any additional work. tcx would remain in memory longer, but since (in the case of huge crates with many modules) the proposed change saves a lot of memory on LLVM modules that shouldn't hurt much, if at all. Besides, in the easy case (enough threads to trans all CGUs in parallel), the tcx can still be dropped as soon as IR was generated and before the LLVM passes run.

As @retep998 outlines, it is not guaranteed that the OS will actually do a perfect job of paging modules in and out. So ignoring option 1, which does a lot of unnecessary disk I/O, the choice seems to be between doing the same work and either:

Allocate all the memory up-front and pray the OS handles it gracefully, or
Allocate and free memory as needed, to hopefully avoid the need for paging in the first place.

nikomatsakis · 2017-01-25T17:05:11Z

I'm a bit confused. It seems to me that there are three things that need to be done:

Translation from HIR to LLVM IR.
Optimization and transformation of LLVM IR.
Generate .o file.

Currently, we do the first one for everything. Then we start some threads to do the second one -- at what point do we do the third one? (I can't remember.) In particular, do we wait for all the threads to finish?

michaelwoerister · 2017-01-25T17:15:40Z

at what point do we do the third one?

That's done by the last LLVM pass, IIRC.

nikomatsakis · 2017-01-25T17:20:09Z

@michaelwoerister so, in that case, I disagree with the assessment of "Let the O/S do it". In particular, it seems likely that "optimization and transformation of the LLVM IR" takes a lot of (temporary) memory. If we start N threads (one per CGU) and those are all going simultaneously, that's going to lead to a larger peak than is necessary. But if we throttle starting at phase 2, this will reduce the peak, right?

In particular, the O/S is good at paging memory that is no longer used, but if we start N threads, all that memory is in active use ("working set" is large). If we throttle, working set is smaller, and we can free as we finish each item (since .o file is generated), no?

It might reduce the peak further to do the full pipeline one CGU at a time, but maybe not, because as it is we can free the HIR and tcx entirely once trans is done (and I believe we do so).

michaelwoerister · 2017-01-25T17:43:45Z

I would limit the number of concurrent threads to something around the number of CPU cores, since there is no real reason to have one thread per CGU. That would limit the working set. If we keep doing as many threads as CGUs then yes, paging won't be efficient.

but maybe not, because as it is we can free the HIR and tcx entirely once trans is done (and I believe we do so).

Yes, that is the main disadvantage of that approach. However, as @rkruppe noted, we can free the tcx once the last llvm module has been generated. If you have a low number of codegen units (as is common in the non-incremental case) then we can free the tcx pretty early. If you have a high number of CGUs than chances are good that the amount of memory needed per LLVM module is low, so keeping N small LLVM modules in addition to the tcx might not be a big deal.

One advantage of approach (2) is that translation and LLVM can run in parallel for much of the time.

trans  |--CGU1---|--CGU2----|--CGU3--|--CGU4---|--CGU5----|
LLVM 1           |--CGU1----------|            |--CGU4-------|
LLVM 2                      |--CGU2----------|            |--CGU5-----|
LLVM 3                               |--CGU3----------|
-------------------------------------------------------------------------------->

vs

trans  |--CGU1---|--CGU2----|--CGU3--|--CGU4---|--CGU5----|
LLVM 1                                                     |--CGU1----------||--CGU4-------|                      
LLVM 2                                                     |--CGU2----------||--CGU5-----|
LLVM 3                                                     |--CGU3----------|                                 
---------------------------------------------------------------------------------------------->

Trans would have to be throttled though as for the queue not fill up and consume all memory again.

nikomatsakis · 2017-01-25T17:52:08Z

@michaelwoerister that seems reasonable. Though I think that restructuring in this way is a mild amount of work, whereas limiting number of worker threads is very easy, right?

In any case, I imagine we would basically start N threads all pulling from a single fixed-length channel. Each item they pull out is an LLVM module, so they would process it. Meanwhile, trans would be doing a loop like this:

let mut fixed_len_channel = channel(1); // fixed-len channel of length 1
start_threads();
for item in work_items {
    let llvm_ir = translate(next_item);
    fixed_len_channel.send(llvm_ir); // will block if channel is full
}
free_tcx_etc();
join_threads();

Right? Seems nice.

nikomatsakis · 2017-01-25T17:52:31Z

(Of course we could make fixed-length channel have a longer length too, any value will serve to throttle.)

michaelwoerister · 2017-01-25T17:55:17Z

Yes, that's pretty much how I imagined it.

michaelwoerister · 2017-01-25T17:57:05Z

Though I think that restructuring in this way is a mild amount of work, whereas limiting number of worker threads is very easy, right?

Yes.

@nikomatsakis

…, r=nikomatsakis back: Limit the number of LLVM worker threads. This should fix issue rust-lang#39568. Also see rust-lang#39280. r? @nikomatsakis

@nikomatsakis

…, r=nikomatsakis back: Limit the number of LLVM worker threads. This should fix issue rust-lang#39568. Also see rust-lang#39280. r? @nikomatsakis

@nikomatsakis

…, r=nikomatsakis back: Limit the number of LLVM worker threads. This should fix issue rust-lang#39568. Also see rust-lang#39280. r? @nikomatsakis

@nikomatsakis

…, r=nikomatsakis back: Limit the number of LLVM worker threads. This should fix issue rust-lang#39568. Also see rust-lang#39280. r? @nikomatsakis

SimonSapin · 2017-02-17T18:39:19Z

I hit this independently when experimenting with incremental compilation on Servo’s CI: servo/servo#15565 (comment) On a builder with 8 GB of RAM and no swap, rustc crashed with:

fatal runtime error: out of memory
error: Could not compile `script`.

nikomatsakis · 2017-02-21T22:08:43Z

This feels like it would be a pretty easy thing to fix. I'll try to either take a crack at it or write up some mentoring instructions tomorrow. I am behind on incremental PRs -- I had hoped to get to it today, but spent the whole day reading notifications instead. Yay for 3-day weekends! :)

SimonSapin · 2017-03-03T16:04:50Z

@nikomatsakis Done.

servo/servo#15565 (comment)
http://build.servo.org/builders/arm32/builds/5827/steps/compile/logs/stdio

time: 0.392; rss: 97MB	parsing
time: 0.000; rss: 97MB	recursion limit
time: 0.000; rss: 97MB	crate injection
time: 0.001; rss: 98MB	plugin loading
time: 0.000; rss: 98MB	plugin registration
time: 4.876; rss: 650MB	expansion
time: 0.000; rss: 650MB	maybe building test harness
time: 0.038; rss: 650MB	maybe creating a macro crate
time: 0.000; rss: 650MB	checking for inline asm in case the target doesn't support it
time: 0.188; rss: 650MB	early lint checks
time: 0.043; rss: 650MB	AST validation
time: 0.947; rss: 868MB	name resolution
time: 0.111; rss: 868MB	complete gated feature checking
time: 0.684; rss: 1103MB	lowering ast -> hir
time: 0.107; rss: 1017MB	indexing hir
time: 0.039; rss: 1017MB	attribute checking
time: 0.026; rss: 806MB	language item collection
time: 0.129; rss: 815MB	lifetime resolution
time: 0.000; rss: 815MB	looking for entry point
time: 0.003; rss: 815MB	looking for plugin registrar
time: 0.452; rss: 898MB	region resolution
time: 0.075; rss: 898MB	loop checking
time: 0.053; rss: 898MB	static item recursion checking
time: 1.298; rss: 948MB	compute_incremental_hashes_map
time: 0.000; rss: 948MB	load_dep_graph
time: 0.138; rss: 963MB	stability index
time: 0.331; rss: 990MB	stability checking
time: 0.548; rss: 1063MB	type collecting
time: 0.208; rss: 1067MB	variance inference
time: 0.220; rss: 1124MB	impl wf inference
time: 0.572; rss: 1188MB	coherence checking
time: 0.726; rss: 1193MB	wf checking
time: 2.519; rss: 1262MB	item-types checking
time: 15.323; rss: 1806MB	item-bodies checking
time: 0.000; rss: 1806MB	drop-impl checking
time: 1.276; rss: 1906MB	const checking
time: 1.114; rss: 1906MB	privacy checking
time: 0.283; rss: 1911MB	intrinsic checking
time: 0.114; rss: 1911MB	effect checking
time: 0.417; rss: 1917MB	match checking
time: 0.196; rss: 1919MB	liveness checking
time: 0.928; rss: 1924MB	rvalue checking
time: 2.293; rss: 2628MB	MIR dump
  time: 0.297; rss: 2649MB	SimplifyCfg
  time: 0.474; rss: 2650MB	QualifyAndPromoteConstants
  time: 0.601; rss: 2650MB	TypeckMir
  time: 0.039; rss: 2650MB	SimplifyBranches
  time: 0.136; rss: 2650MB	SimplifyCfg
time: 1.547; rss: 2650MB	MIR cleanup and validation
time: 1.984; rss: 2658MB	borrow checking
time: 0.023; rss: 2659MB	reachability checking
time: 0.328; rss: 2662MB	death checking
time: 0.000; rss: 2662MB	unused lib feature checking
time: 2.008; rss: 2662MB	lint checking
time: 0.000; rss: 2662MB	resolving dependency formats
  time: 0.033; rss: 2662MB	NoLandingPads
  time: 0.133; rss: 2662MB	SimplifyCfg
  time: 0.373; rss: 2678MB	EraseRegions
  time: 0.069; rss: 2678MB	AddCallGuards
  time: 0.991; rss: 2713MB	ElaborateDrops
  time: 0.033; rss: 2713MB	NoLandingPads
  time: 0.282; rss: 2714MB	SimplifyCfg
  time: 0.185; rss: 2714MB	InstCombine
  time: 0.082; rss: 2714MB	Deaggregator
  time: 0.038; rss: 2714MB	CopyPropagation
  time: 0.241; rss: 2715MB	SimplifyLocals
  time: 0.092; rss: 2717MB	AddCallGuards
  time: 0.033; rss: 2717MB	PreTrans
time: 2.586; rss: 2717MB	MIR optimisations
  time: 0.683; rss: 2819MB	write metadata
  time: 5.005; rss: 2895MB	translation item collection
  time: 1.267; rss: 2939MB	codegen unit partitioning
  time: 0.674; rss: 5032MB	internalize symbols
time: 47.199; rss: 5033MB	translation
time: 0.001; rss: 5033MB	assert dep graph
fatal runtime error: out of memory
error: Could not compile `script`.

michaelwoerister · 2017-03-06T15:54:43Z

Just to follow-up: The simple "fix" I implemented will only make swapping less of a problem, since fewer threads will compete for the physical memory available. If (memory + swap space) < total memory needed by rustc we still expect to crash.

sophiajt · 2017-05-21T22:31:36Z

It's been a while since this issue has been updated. What's the current status?

nikomatsakis · 2017-05-23T00:36:18Z

Some refactoring has been done, but the main change remains undone. I've been meaning for a while now to try and write-up a few notes and try to mentor a PR for this.

nikomatsakis · 2017-05-25T20:01:56Z

OK, so, I think this would be an awesome bug for someone who is interested in helping out with incremental. It's going to be a certain amount of refactoring, but a lot of it is done I think. To be honest, I'm not 100% sure what's the best path here, so let me lay out a few thoughts and see if anyone is interested in picking it up.

The big goal

Right now, we do the following to generate code:

loop over all rust "codegen-units" And generate the LLVM bitcode for all of them
- here is the loop in the code; each module's result is generated by this function. Note that in incremental mode, we sometimes just re-use a previous .o file.
in non-incremental mode, do a little reachability optimization I'll discuss in a bit
free the "tcx" (type context), which contains the HIR, types, and a bunch of compiler state
- the previous bits of code were part of "phase 4" in the driver, which itself was part of "phase 3"
- once we return from the phase 3 function, the tcx will have been freed and will no longer exist
start up N threads and parcel out those codegen-units to the various threads, letting them do optimizations
- this takes place in "phase 5"
- the actual code is run_work_multithreaded()

The problem is that storing the LLVM bitcode for ALL the modules leads to a huge memory spike. This in turn leads to slow compilation. How slow? Hard to say. I see that using incremental the base cost of building the script crate is 2x slower than not (at least if I remember correctly), but I'm not sure how much of that is attributed to this spike in memory usage. Regardless, it's a problem.

One (relatively minor) complication

One complication is step 2 in the list above. If we are not in incremental mode, then we have this bit of code that figures out which functions wind up getting used by other code-gen-units and marking them as "external", and the rest as internal. Marking things as internal permits more optimization. We don't do this in incremental mode though because we want to allow other modules to be updated independently.

The problem is that this pass inherently requires us to have built all codegen-units so we can walk them, and we have to do that before we do any optimizations. It's sort of the opposite of what we want.

There are two ways to fix this. The right way, and the wrong way. The right way is probably to modify the "trans collector" (which is the thing that creates the codegen-units in the first place) and have IT figure out what must be marked as private/public. This would allow us to do the optimization even in incremental mode (though we might choose not to), since we could easily detect when things have changed (i.e., when something must now be public that was private before) and recompile as needed.

The wrong way is to keep the current setup, which is also very clean in its own way, but just have it only enable if (a) incremental is off and (b) the number of codegen-units is small. Then basically we can kind of disable the "pipelineing" if it's not going to buy us much, effectively, and trade it for better optimization.

One bigger complication

We have to restructure the driver and the way is separates "phase 4" (LLVM generation) from "phase 5" (LLVM optimization). This distinction isn't really that meaningful and eliminating it is, after all, the whole point of this bug.

The overall plan

My general plan is basically to start up a thread (or many threads) that will do LLVM optimization early on, basically before phase 4. These threads will share a fixed-size queue (with a size ~ the number of cores or something) with the LLVM generation threads. The generation threads will generate IR and then push it in the queue; if the queue is full, they will block. The other threads will take things out of the queue and process it. What is now called "phase 5" would thus not be starting the threads but rather joining the threads.

To handle the "internalize symbols" stuff, we might tweak this process a bit -- e.g., if there are a very small number of CGU, then we could delay starting the worker threads, accumulate the CGUs into a vector, and then process the vector in a big chunk -- and fire off all the completed results to the worker threads.

Pieces

If you're interested, there are probably a couple of independent PRs:

Refactor the phase4/phase5 setup in the driver (without changing how things work internally that much).
Refactor the "internalize symbols" stuff to use the collector (if we go that route).

Or maybe some other path. This is not an easy refactor, but it seems like a fun one to me!

krstoff · 2017-05-26T20:57:25Z

Hi, I'm interested in doing this. :) @nikomatsakis

nikomatsakis · 2017-05-31T00:14:29Z

@krstoff great! let me know if I can be of help. I think probably the best place to start is just by reshuffling the "phase4/phase5" setup so that LLVM optimizations aren't something the user can separately invoke.

nikomatsakis · 2017-06-08T16:36:45Z

@krstoff any updates? I'm thinking of taking a crack at some preliminary refactoring here, but I don't want to step on your toes.

sophiajt · 2017-07-03T23:27:59Z

@nikomatsakis - sounds like this one might be open for people jumping in. Would still be great to see this one go in

nikomatsakis · 2017-07-17T15:16:58Z

@michaelwoerister has done some initial refactoring, removing at least one obstacle here.

nikomatsakis · 2017-07-17T15:17:04Z

I think he plans to keep working on it.

…lvm, r=eddyb trans: Internalize symbols without relying on LLVM This PR makes the compiler use the information gather by the trans collector in order to determine which symbols/trans-items can be made internal. This has the advantages: + of being LLVM independent, + of also working in incremental mode, and + of allowing to not keep all LLVM modules in memory at the same time. This is in preparation for fixing issue #39280. cc @rust-lang/compiler

alexcrichton · 2017-08-10T14:01:39Z

@michaelwoerister this is closed now right with #43506 landed?

michaelwoerister · 2017-08-10T14:15:42Z

Yes 🎉

This was referenced Feb 6, 2017

ICE 'Box<Any>', /Users/rustbuild/src/rust-buildbot/slave/nightly-dist-rustc-mac/build/src/librustc_trans/back/write.rs:56' if incremental build is enabled on macOS Sierra. #39568

Closed

back: Limit the number of LLVM worker threads. #39583

Merged

SimonSapin mentioned this issue Feb 17, 2017

Do not merge yet: Enable incremental compilation by default, except on Windows where it’s buggy servo/servo#15565

Closed

5 tasks

aturon mentioned this issue Mar 6, 2017

Incremental compilation rust-lang/rust-roadmap-2017#4

Open

14 tasks

steveklabnik removed the A-compiler label Mar 24, 2017

michaelwoerister mentioned this issue Apr 24, 2017

Incremental compilation - internalize symbols using up more than 1GB (double the RAM usage before) #41466

Closed

michaelwoerister mentioned this issue May 2, 2017

change internalize_symbols to work per-CGU #41707

Closed

sophiajt mentioned this issue May 24, 2017

Metabug for Rust-related issues servo/servo#17012

Open

nikomatsakis added the E-mentor Call for participation: This issue has a mentor. Use #t-compiler/help on Zulip for discussion. label May 25, 2017

michaelwoerister mentioned this issue Jul 12, 2017

trans: Internalize symbols without relying on LLVM #43183

Merged

Mark-Simulacrum added I-compilemem Issue: Problems and improvements with respect to memory usage during compilation. C-cleanup Category: PRs that clean code up or issues documenting cleanup. labels Jul 26, 2017

michaelwoerister closed this as completed Aug 10, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

trans/LLVM: Don't keep all LLVM modules in memory at the same time #39280

trans/LLVM: Don't keep all LLVM modules in memory at the same time #39280

michaelwoerister commented Jan 24, 2017 •

edited by nikomatsakis

Loading

nrc commented Jan 24, 2017

alexcrichton commented Jan 25, 2017

michaelwoerister commented Jan 25, 2017 •

edited

Loading

retep998 commented Jan 25, 2017

michaelwoerister commented Jan 25, 2017

hanna-kruppe commented Jan 25, 2017 •

edited

Loading

nikomatsakis commented Jan 25, 2017

michaelwoerister commented Jan 25, 2017

nikomatsakis commented Jan 25, 2017

michaelwoerister commented Jan 25, 2017

nikomatsakis commented Jan 25, 2017

nikomatsakis commented Jan 25, 2017

michaelwoerister commented Jan 25, 2017

michaelwoerister commented Jan 25, 2017

SimonSapin commented Feb 17, 2017

nikomatsakis commented Feb 21, 2017

SimonSapin commented Mar 3, 2017

michaelwoerister commented Mar 6, 2017

sophiajt commented May 21, 2017

nikomatsakis commented May 23, 2017

nikomatsakis commented May 25, 2017

krstoff commented May 26, 2017 •

edited

Loading

nikomatsakis commented May 31, 2017

nikomatsakis commented Jun 8, 2017

sophiajt commented Jul 3, 2017

nikomatsakis commented Jul 17, 2017

nikomatsakis commented Jul 17, 2017

alexcrichton commented Aug 10, 2017

michaelwoerister commented Aug 10, 2017

trans/LLVM: Don't keep all LLVM modules in memory at the same time #39280

trans/LLVM: Don't keep all LLVM modules in memory at the same time #39280

Comments

michaelwoerister commented Jan 24, 2017 • edited by nikomatsakis Loading

nrc commented Jan 24, 2017

alexcrichton commented Jan 25, 2017

michaelwoerister commented Jan 25, 2017 • edited Loading

retep998 commented Jan 25, 2017

michaelwoerister commented Jan 25, 2017

hanna-kruppe commented Jan 25, 2017 • edited Loading

nikomatsakis commented Jan 25, 2017

michaelwoerister commented Jan 25, 2017

nikomatsakis commented Jan 25, 2017

michaelwoerister commented Jan 25, 2017

nikomatsakis commented Jan 25, 2017

nikomatsakis commented Jan 25, 2017

michaelwoerister commented Jan 25, 2017

michaelwoerister commented Jan 25, 2017

SimonSapin commented Feb 17, 2017

nikomatsakis commented Feb 21, 2017

SimonSapin commented Mar 3, 2017

michaelwoerister commented Mar 6, 2017

sophiajt commented May 21, 2017

nikomatsakis commented May 23, 2017

nikomatsakis commented May 25, 2017

The big goal

One (relatively minor) complication

One bigger complication

The overall plan

Pieces

krstoff commented May 26, 2017 • edited Loading

nikomatsakis commented May 31, 2017

nikomatsakis commented Jun 8, 2017

sophiajt commented Jul 3, 2017

nikomatsakis commented Jul 17, 2017

nikomatsakis commented Jul 17, 2017

alexcrichton commented Aug 10, 2017

michaelwoerister commented Aug 10, 2017

michaelwoerister commented Jan 24, 2017 •

edited by nikomatsakis

Loading

michaelwoerister commented Jan 25, 2017 •

edited

Loading

hanna-kruppe commented Jan 25, 2017 •

edited

Loading

krstoff commented May 26, 2017 •

edited

Loading