Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

trans/LLVM: Don't keep all LLVM modules in memory at the same time #39280

Closed
michaelwoerister opened this issue Jan 24, 2017 · 32 comments
Closed
Labels
A-codegen Area: Code generation A-concurrency Area: Concurrency A-incr-comp Area: Incremental compilation C-cleanup Category: PRs that clean code up or issues documenting cleanup. E-mentor Call for participation: This issue has a mentor. Use #t-compiler/help on Zulip for discussion. I-compilemem Issue: Problems and improvements with respect to memory usage during compilation. I-compiletime Issue: Problems and improvements with respect to compile times. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.

Comments

@michaelwoerister
Copy link
Member

michaelwoerister commented Jan 24, 2017

UPDATE: There are some mentoring instructions below if you're interested in looking into this issue.


NB: Part of the roadmap issue on incremental compilation.

Currently the compiler translates all codegen units into LLVM modules and then runs LLVM on all of them in parallel. In the context of incremental compilation, where there can be hundreds of codegen units for a single crate, but also for non-incremental builds with a high number of codegen units, this can mean a lot of memory pressure -- e.g. for Servo's script crate that's more than 10 GB peak memory usage (while otherwise it's around 4.5 GB).

There's no real need to keep more LLVM modules in memory than are currently being worked on. Two possible solutions:

  1. If the compiler uses N LLVM threads, start persisting LLVM modules to disk at the N+1st and reload them later for optimization and codegen. That's probably relatively easy to implement but involves more disk I/O than necessary.
  2. Translate only N codegen units to LLVM modules, translate them all the way to object files, then go back and translate the next codegen unit. This has the advantage that we would not need to temporarily store anything on disk, but we would need to keep the tcx in memory as long as there are still untranslated codegen units. It would also require a much bigger refactoring than the first approach.

Any other ideas?

cc @rust-lang/compiler @rust-lang/tools @nagisa

@michaelwoerister michaelwoerister added A-codegen Area: Code generation A-compiler A-concurrency Area: Concurrency A-incr-comp Area: Incremental compilation I-compiletime Issue: Problems and improvements with respect to compile times. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. labels Jan 24, 2017
@nrc
Copy link
Member

nrc commented Jan 24, 2017

Will this actually cause problems? Unless there are too many threads (in which case that should be addressed), it feels like the OS will handle this better by paging to disk than we will by (effectively) doing so manually. Also, for many devs, 10GB is not too much memory usage and anything we do here (unless we're really smart) will slow down compilation for those users.

@alexcrichton
Copy link
Member

My first thought would be the latter, consider translation part of a "work unit" and don't actually start doing it until it's ready to go all the way into LLVM. That way we cap the number of active LLVM modules at any one point in time.

I think @nrc's got a good point, though, in that using the disk as a cache may be better left to the OS rather than the compiler itself (until proven otherwise, of course)

@michaelwoerister
Copy link
Member Author

michaelwoerister commented Jan 25, 2017

Some data points:

Will this actually cause problems?

It lead to my not being able to compile the script crate on Linux with 8 GB. Maybe if I had more swap space it would have worked. (EDIT: The VM had 4GB swap space)

Unless there are too many threads (in which case that should be addressed)

If everything fits into memory at once, having hundreds of parallel LLVM threads does not seem to have a detrimental effect (running 140 threads was a few percent faster than having 4 threads on a 4 core processor). OTOH, I would suspect that paging would completely destroy performance if many threads had to fight for RAM.

Also, for many devs, 10GB is not too much memory usage and anything we do here (unless we're really smart) will slow down compilation for those users.

That's certainly a valid concern!

@retep998
Copy link
Member

If you really want to allow the OS to do a good job of paging out memory you need to ensure that modules don't share pages with each other. If a given module is spread out across your memory with lots of fragmentation, then you'll have to have a lot more memory paged in than necessary, reducing the effectiveness of the OS paging stuff out. Also, there are ways to hint to the OS that you don't need some memory at the moment, so it can be more willing to page it out sooner in favor of keeping other memory in memory.

@michaelwoerister
Copy link
Member Author

It seems to me like letting the OS take care of this via paging is really the most efficient and most easily implemented solution. But I'm also a bit concerned about rustc getting the reputation of being a memory hog, even if unfairly. It also allows for fewer crates to be compiled by a 32-bit compiler (although the script crate, for example, is already above 2GB before trans, I think).

@hanna-kruppe
Copy link
Contributor

hanna-kruppe commented Jan 25, 2017

Would option 2 actually be any slower or otherwise less efficient? It doesn't seem to do any additional work. tcx would remain in memory longer, but since (in the case of huge crates with many modules) the proposed change saves a lot of memory on LLVM modules that shouldn't hurt much, if at all. Besides, in the easy case (enough threads to trans all CGUs in parallel), the tcx can still be dropped as soon as IR was generated and before the LLVM passes run.

As @retep998 outlines, it is not guaranteed that the OS will actually do a perfect job of paging modules in and out. So ignoring option 1, which does a lot of unnecessary disk I/O, the choice seems to be between doing the same work and either:

  • Allocate all the memory up-front and pray the OS handles it gracefully, or
  • Allocate and free memory as needed, to hopefully avoid the need for paging in the first place.

@nikomatsakis
Copy link
Contributor

I'm a bit confused. It seems to me that there are three things that need to be done:

  • Translation from HIR to LLVM IR.
  • Optimization and transformation of LLVM IR.
  • Generate .o file.

Currently, we do the first one for everything. Then we start some threads to do the second one -- at what point do we do the third one? (I can't remember.) In particular, do we wait for all the threads to finish?

@michaelwoerister
Copy link
Member Author

at what point do we do the third one?

That's done by the last LLVM pass, IIRC.

@nikomatsakis
Copy link
Contributor

@michaelwoerister so, in that case, I disagree with the assessment of "Let the O/S do it". In particular, it seems likely that "optimization and transformation of the LLVM IR" takes a lot of (temporary) memory. If we start N threads (one per CGU) and those are all going simultaneously, that's going to lead to a larger peak than is necessary. But if we throttle starting at phase 2, this will reduce the peak, right?

In particular, the O/S is good at paging memory that is no longer used, but if we start N threads, all that memory is in active use ("working set" is large). If we throttle, working set is smaller, and we can free as we finish each item (since .o file is generated), no?

It might reduce the peak further to do the full pipeline one CGU at a time, but maybe not, because as it is we can free the HIR and tcx entirely once trans is done (and I believe we do so).

@michaelwoerister
Copy link
Member Author

I would limit the number of concurrent threads to something around the number of CPU cores, since there is no real reason to have one thread per CGU. That would limit the working set. If we keep doing as many threads as CGUs then yes, paging won't be efficient.

but maybe not, because as it is we can free the HIR and tcx entirely once trans is done (and I believe we do so).

Yes, that is the main disadvantage of that approach. However, as @rkruppe noted, we can free the tcx once the last llvm module has been generated. If you have a low number of codegen units (as is common in the non-incremental case) then we can free the tcx pretty early. If you have a high number of CGUs than chances are good that the amount of memory needed per LLVM module is low, so keeping N small LLVM modules in addition to the tcx might not be a big deal.

One advantage of approach (2) is that translation and LLVM can run in parallel for much of the time.

trans  |--CGU1---|--CGU2----|--CGU3--|--CGU4---|--CGU5----|
LLVM 1           |--CGU1----------|            |--CGU4-------|
LLVM 2                      |--CGU2----------|            |--CGU5-----|
LLVM 3                               |--CGU3----------|
-------------------------------------------------------------------------------->

vs

trans  |--CGU1---|--CGU2----|--CGU3--|--CGU4---|--CGU5----|
LLVM 1                                                     |--CGU1----------||--CGU4-------|                      
LLVM 2                                                     |--CGU2----------||--CGU5-----|
LLVM 3                                                     |--CGU3----------|                                 
---------------------------------------------------------------------------------------------->

Trans would have to be throttled though as for the queue not fill up and consume all memory again.

@nikomatsakis
Copy link
Contributor

@michaelwoerister that seems reasonable. Though I think that restructuring in this way is a mild amount of work, whereas limiting number of worker threads is very easy, right?

In any case, I imagine we would basically start N threads all pulling from a single fixed-length channel. Each item they pull out is an LLVM module, so they would process it. Meanwhile, trans would be doing a loop like this:

let mut fixed_len_channel = channel(1); // fixed-len channel of length 1
start_threads();
for item in work_items {
    let llvm_ir = translate(next_item);
    fixed_len_channel.send(llvm_ir); // will block if channel is full
}
free_tcx_etc();
join_threads();

Right? Seems nice.

@nikomatsakis
Copy link
Contributor

(Of course we could make fixed-length channel have a longer length too, any value will serve to throttle.)

@michaelwoerister
Copy link
Member Author

Yes, that's pretty much how I imagined it.

@michaelwoerister
Copy link
Member Author

Though I think that restructuring in this way is a mild amount of work, whereas limiting number of worker threads is very easy, right?

Yes.

frewsxcv added a commit to frewsxcv/rust that referenced this issue Feb 7, 2017
…, r=nikomatsakis

back: Limit the number of LLVM worker threads.

This should fix issue rust-lang#39568.
Also see rust-lang#39280.

r? @nikomatsakis
frewsxcv added a commit to frewsxcv/rust that referenced this issue Feb 7, 2017
…, r=nikomatsakis

back: Limit the number of LLVM worker threads.

This should fix issue rust-lang#39568.
Also see rust-lang#39280.

r? @nikomatsakis
frewsxcv added a commit to frewsxcv/rust that referenced this issue Feb 8, 2017
…, r=nikomatsakis

back: Limit the number of LLVM worker threads.

This should fix issue rust-lang#39568.
Also see rust-lang#39280.

r? @nikomatsakis
frewsxcv added a commit to frewsxcv/rust that referenced this issue Feb 8, 2017
…, r=nikomatsakis

back: Limit the number of LLVM worker threads.

This should fix issue rust-lang#39568.
Also see rust-lang#39280.

r? @nikomatsakis
@SimonSapin
Copy link
Contributor

I hit this independently when experimenting with incremental compilation on Servo’s CI: servo/servo#15565 (comment) On a builder with 8 GB of RAM and no swap, rustc crashed with:

fatal runtime error: out of memory
error: Could not compile `script`.

@nikomatsakis
Copy link
Contributor

This feels like it would be a pretty easy thing to fix. I'll try to either take a crack at it or write up some mentoring instructions tomorrow. I am behind on incremental PRs -- I had hoped to get to it today, but spent the whole day reading notifications instead. Yay for 3-day weekends! :)

@SimonSapin
Copy link
Contributor

@nikomatsakis Done.

servo/servo#15565 (comment)
http://build.servo.org/builders/arm32/builds/5827/steps/compile/logs/stdio

time: 0.392; rss: 97MB	parsing
time: 0.000; rss: 97MB	recursion limit
time: 0.000; rss: 97MB	crate injection
time: 0.001; rss: 98MB	plugin loading
time: 0.000; rss: 98MB	plugin registration
time: 4.876; rss: 650MB	expansion
time: 0.000; rss: 650MB	maybe building test harness
time: 0.038; rss: 650MB	maybe creating a macro crate
time: 0.000; rss: 650MB	checking for inline asm in case the target doesn't support it
time: 0.188; rss: 650MB	early lint checks
time: 0.043; rss: 650MB	AST validation
time: 0.947; rss: 868MB	name resolution
time: 0.111; rss: 868MB	complete gated feature checking
time: 0.684; rss: 1103MB	lowering ast -> hir
time: 0.107; rss: 1017MB	indexing hir
time: 0.039; rss: 1017MB	attribute checking
time: 0.026; rss: 806MB	language item collection
time: 0.129; rss: 815MB	lifetime resolution
time: 0.000; rss: 815MB	looking for entry point
time: 0.003; rss: 815MB	looking for plugin registrar
time: 0.452; rss: 898MB	region resolution
time: 0.075; rss: 898MB	loop checking
time: 0.053; rss: 898MB	static item recursion checking
time: 1.298; rss: 948MB	compute_incremental_hashes_map
time: 0.000; rss: 948MB	load_dep_graph
time: 0.138; rss: 963MB	stability index
time: 0.331; rss: 990MB	stability checking
time: 0.548; rss: 1063MB	type collecting
time: 0.208; rss: 1067MB	variance inference
time: 0.220; rss: 1124MB	impl wf inference
time: 0.572; rss: 1188MB	coherence checking
time: 0.726; rss: 1193MB	wf checking
time: 2.519; rss: 1262MB	item-types checking
time: 15.323; rss: 1806MB	item-bodies checking
time: 0.000; rss: 1806MB	drop-impl checking
time: 1.276; rss: 1906MB	const checking
time: 1.114; rss: 1906MB	privacy checking
time: 0.283; rss: 1911MB	intrinsic checking
time: 0.114; rss: 1911MB	effect checking
time: 0.417; rss: 1917MB	match checking
time: 0.196; rss: 1919MB	liveness checking
time: 0.928; rss: 1924MB	rvalue checking
time: 2.293; rss: 2628MB	MIR dump
  time: 0.297; rss: 2649MB	SimplifyCfg
  time: 0.474; rss: 2650MB	QualifyAndPromoteConstants
  time: 0.601; rss: 2650MB	TypeckMir
  time: 0.039; rss: 2650MB	SimplifyBranches
  time: 0.136; rss: 2650MB	SimplifyCfg
time: 1.547; rss: 2650MB	MIR cleanup and validation
time: 1.984; rss: 2658MB	borrow checking
time: 0.023; rss: 2659MB	reachability checking
time: 0.328; rss: 2662MB	death checking
time: 0.000; rss: 2662MB	unused lib feature checking
time: 2.008; rss: 2662MB	lint checking
time: 0.000; rss: 2662MB	resolving dependency formats
  time: 0.033; rss: 2662MB	NoLandingPads
  time: 0.133; rss: 2662MB	SimplifyCfg
  time: 0.373; rss: 2678MB	EraseRegions
  time: 0.069; rss: 2678MB	AddCallGuards
  time: 0.991; rss: 2713MB	ElaborateDrops
  time: 0.033; rss: 2713MB	NoLandingPads
  time: 0.282; rss: 2714MB	SimplifyCfg
  time: 0.185; rss: 2714MB	InstCombine
  time: 0.082; rss: 2714MB	Deaggregator
  time: 0.038; rss: 2714MB	CopyPropagation
  time: 0.241; rss: 2715MB	SimplifyLocals
  time: 0.092; rss: 2717MB	AddCallGuards
  time: 0.033; rss: 2717MB	PreTrans
time: 2.586; rss: 2717MB	MIR optimisations
  time: 0.683; rss: 2819MB	write metadata
  time: 5.005; rss: 2895MB	translation item collection
  time: 1.267; rss: 2939MB	codegen unit partitioning
  time: 0.674; rss: 5032MB	internalize symbols
time: 47.199; rss: 5033MB	translation
time: 0.001; rss: 5033MB	assert dep graph
fatal runtime error: out of memory
error: Could not compile `script`.

@michaelwoerister
Copy link
Member Author

Just to follow-up: The simple "fix" I implemented will only make swapping less of a problem, since fewer threads will compete for the physical memory available. If (memory + swap space) < total memory needed by rustc we still expect to crash.

@sophiajt
Copy link
Contributor

It's been a while since this issue has been updated. What's the current status?

@nikomatsakis
Copy link
Contributor

Some refactoring has been done, but the main change remains undone. I've been meaning for a while now to try and write-up a few notes and try to mentor a PR for this.

@nikomatsakis
Copy link
Contributor

OK, so, I think this would be an awesome bug for someone who is interested in helping out with incremental. It's going to be a certain amount of refactoring, but a lot of it is done I think. To be honest, I'm not 100% sure what's the best path here, so let me lay out a few thoughts and see if anyone is interested in picking it up.

The big goal

Right now, we do the following to generate code:

  1. loop over all rust "codegen-units" And generate the LLVM bitcode for all of them
  2. in non-incremental mode, do a little reachability optimization I'll discuss in a bit
  3. free the "tcx" (type context), which contains the HIR, types, and a bunch of compiler state
    • the previous bits of code were part of "phase 4" in the driver, which itself was part of "phase 3"
    • once we return from the phase 3 function, the tcx will have been freed and will no longer exist
  4. start up N threads and parcel out those codegen-units to the various threads, letting them do optimizations

The problem is that storing the LLVM bitcode for ALL the modules leads to a huge memory spike. This in turn leads to slow compilation. How slow? Hard to say. I see that using incremental the base cost of building the script crate is 2x slower than not (at least if I remember correctly), but I'm not sure how much of that is attributed to this spike in memory usage. Regardless, it's a problem.

One (relatively minor) complication

One complication is step 2 in the list above. If we are not in incremental mode, then we have this bit of code that figures out which functions wind up getting used by other code-gen-units and marking them as "external", and the rest as internal. Marking things as internal permits more optimization. We don't do this in incremental mode though because we want to allow other modules to be updated independently.

The problem is that this pass inherently requires us to have built all codegen-units so we can walk them, and we have to do that before we do any optimizations. It's sort of the opposite of what we want.

There are two ways to fix this. The right way, and the wrong way. The right way is probably to modify the "trans collector" (which is the thing that creates the codegen-units in the first place) and have IT figure out what must be marked as private/public. This would allow us to do the optimization even in incremental mode (though we might choose not to), since we could easily detect when things have changed (i.e., when something must now be public that was private before) and recompile as needed.

The wrong way is to keep the current setup, which is also very clean in its own way, but just have it only enable if (a) incremental is off and (b) the number of codegen-units is small. Then basically we can kind of disable the "pipelineing" if it's not going to buy us much, effectively, and trade it for better optimization.

One bigger complication

We have to restructure the driver and the way is separates "phase 4" (LLVM generation) from "phase 5" (LLVM optimization). This distinction isn't really that meaningful and eliminating it is, after all, the whole point of this bug.

The overall plan

My general plan is basically to start up a thread (or many threads) that will do LLVM optimization early on, basically before phase 4. These threads will share a fixed-size queue (with a size ~ the number of cores or something) with the LLVM generation threads. The generation threads will generate IR and then push it in the queue; if the queue is full, they will block. The other threads will take things out of the queue and process it. What is now called "phase 5" would thus not be starting the threads but rather joining the threads.

To handle the "internalize symbols" stuff, we might tweak this process a bit -- e.g., if there are a very small number of CGU, then we could delay starting the worker threads, accumulate the CGUs into a vector, and then process the vector in a big chunk -- and fire off all the completed results to the worker threads.

Pieces

If you're interested, there are probably a couple of independent PRs:

  • Refactor the phase4/phase5 setup in the driver (without changing how things work internally that much).
  • Refactor the "internalize symbols" stuff to use the collector (if we go that route).

Or maybe some other path. This is not an easy refactor, but it seems like a fun one to me!

@nikomatsakis nikomatsakis added the E-mentor Call for participation: This issue has a mentor. Use #t-compiler/help on Zulip for discussion. label May 25, 2017
@krstoff
Copy link

krstoff commented May 26, 2017

Hi, I'm interested in doing this. :) @nikomatsakis

@nikomatsakis
Copy link
Contributor

@krstoff great! let me know if I can be of help. I think probably the best place to start is just by reshuffling the "phase4/phase5" setup so that LLVM optimizations aren't something the user can separately invoke.

@nikomatsakis
Copy link
Contributor

@krstoff any updates? I'm thinking of taking a crack at some preliminary refactoring here, but I don't want to step on your toes.

@sophiajt
Copy link
Contributor

sophiajt commented Jul 3, 2017

@nikomatsakis - sounds like this one might be open for people jumping in. Would still be great to see this one go in

@nikomatsakis
Copy link
Contributor

@michaelwoerister has done some initial refactoring, removing at least one obstacle here.

@nikomatsakis
Copy link
Contributor

I think he plans to keep working on it.

bors added a commit that referenced this issue Jul 21, 2017
…lvm, r=eddyb

trans: Internalize symbols without relying on LLVM

This PR makes the compiler use the information gather by the trans collector in order to determine which symbols/trans-items can be made internal. This has the advantages:
+ of being LLVM independent,
+ of also working in incremental mode, and
+ of allowing to not keep all LLVM modules in memory at the same time.

This is in preparation for fixing issue #39280.

cc @rust-lang/compiler
@Mark-Simulacrum Mark-Simulacrum added I-compilemem Issue: Problems and improvements with respect to memory usage during compilation. C-cleanup Category: PRs that clean code up or issues documenting cleanup. labels Jul 26, 2017
@alexcrichton
Copy link
Member

@michaelwoerister this is closed now right with #43506 landed?

@michaelwoerister
Copy link
Member Author

Yes 🎉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-codegen Area: Code generation A-concurrency Area: Concurrency A-incr-comp Area: Incremental compilation C-cleanup Category: PRs that clean code up or issues documenting cleanup. E-mentor Call for participation: This issue has a mentor. Use #t-compiler/help on Zulip for discussion. I-compilemem Issue: Problems and improvements with respect to memory usage during compilation. I-compiletime Issue: Problems and improvements with respect to compile times. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.
Projects
None yet
Development

No branches or pull requests