Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Increase performance and decrease memory footprint #539

Open
redstar opened this issue Nov 5, 2013 · 28 comments
Open

Increase performance and decrease memory footprint #539

redstar opened this issue Nov 5, 2013 · 28 comments

Comments

@redstar
Copy link
Member

redstar commented Nov 5, 2013

As reported on the newsgroup, LDC is around 4 times slower then DMD.

My personal experience is that LDC is really fast for small files but speed decreases dramatically with larger files, e.g. std.datetime and std.array. At the same time, LDC consumes a huge amount of memory (compare with issue #438)!
A possible explanation for the decreased speed could be that the system has to swap out memory to satisfies LDC requirements.

Memory and performance profiling must be performed to find the root cause(s).

Trass3r added a commit to Trass3r/ldc that referenced this issue Jun 30, 2014
@Trass3r
Copy link
Contributor

Trass3r commented Jun 30, 2014

IrDsymbol::resetAll sucks 13% processing time when passing several modules! Optimized that.
The rest is mainly spent in llvm, 43.5% codegenModule, 33% optimize_module.

The real problem is memory consumption. The command allocates 12GB of memory!

bin\RelWithDebInfo\ldc2.exe -lib -Ildcsrc/runtime/druntime/src -Ildcsrc/runtime/druntime/src/gc -w -d -O3 -release -unittest -Ildcsrc/runtime/phobos std/algorithm.d std/array.d std/ascii.d std/base64.d std/bigint.d std/bitmanip.d std/compiler.d std/complex.d std/concurrency.d std/container.d std/conv.d std/cstream.d std/csv.d std/datetime.d std/demangle.d std/encoding.d std/exception.d std/file.d std/format.d std/functional.d std/getopt.d std/json.d std/math.d std/mathspecial.d std/metastrings.d std/mmfile.d std/numeric.d std/outbuffer.d std/parallelism.d std/path.d std/process.d std/random.d std/range.d std/regex.d std/signals.d std/socket.d std/socketstream.d std/stdint.d std/stdio.d std/stdiobase.d std/stream.d std/string.d std/syserror.d std/system.d std/traits.d std/typecons.d std/typelist.d std/typetuple.d std/uni.d std/uri.d std/utf.d std/uuid.d std/variant.d std/xml.d std/zip.d std/zlib.d std/internal/processinit.d std/internal/uni.d std/internal/uni_tab.d std/internal/unicode_comp.d std/internal/unicode_decomp.d std/internal/unicode_grapheme.d std/internal/unicode_norm.d std/internal/unicode_tables.d std/windows/charset.d std/windows/iunknown.d std/windows/registry.d std/windows/syserror.d std/net/curl.d std/net/isemail.d -oq

@dnadlinger
Copy link
Member

Is the impact of using memset really significant? If so, what about just zeroing the whole vector storage?

@Trass3r
Copy link
Contributor

Trass3r commented Jun 30, 2014

I also thought about that, but it's a vector of pointers to IrDsymbol spread out in the Dsymbols of the AST.
memset should be way easier to optimize for the compiler, will test to see what clang does with it.

@dnadlinger
Copy link
Member

Ah yes, of course – makes sense because of locality, I guess.

@Trass3r
Copy link
Contributor

Trass3r commented Jul 1, 2014

Ok as expected clang is able to use vector instructions either way and VS isn't.
If you get rid of the padding introduced by the bools in the middle it doesn't make much difference performance-wise. It's just more concise.

Trass3r added a commit to Trass3r/ldc that referenced this issue Jul 1, 2014
@dnadlinger
Copy link
Member

@Trass3r: As these are non-virtual structs anyway, I don't really care either way, let's get this in. Do you have any concrete data points on how much an impact the change has overall?

@Trass3r
Copy link
Contributor

Trass3r commented Jul 2, 2014

For the command above reduces runtime by 10-15% (370s vs 310s).
The rest is really just 85% writeModule, i.e. llvm passes.
Memory is the big issue. In the first phase it allocates 10GB with insane speed.

@dnadlinger
Copy link
Member

@Trass3r: You edited your earlier post, right? If not, I now feel pretty stupid for asking for numbers. ;) Regarding memory usage, the same problem occurs with DMD, right? (i.e., you mean frontend semantic analysis with "first phase")

@Trass3r
Copy link
Contributor

Trass3r commented Jul 2, 2014

Yeah I enhanced that post right after I put it in ;)

@Trass3r
Copy link
Contributor

Trass3r commented Jul 2, 2014

Has anyone ever evaluated the "quality" of the IR emitted by ldc? The better it is the less effort for llvm.

@dnadlinger
Copy link
Member

Not particularly thoroughly, no. I have been noticing a few issues such as unnecessary allocas/loads being emitted, but never really worked on tracking down as I would always discover them while working on some serious codegen bug. Improvements in this area would also directly benefit debug executable performance (i.e. -O0 builds).

@redstar
Copy link
Member Author

redstar commented Jul 2, 2014

I started a rough comparison between ldc and clang some time ago. Basically, the IR generated by clang at -O0 looks like the IR from ldc at -O1.
Biggest difference is that ldc allocates space for every parameter and also generates a store. clang does not do this. I think that prevented some optimizations on Win64 but I don't know if tis is still true.

@dnadlinger
Copy link
Member

@redstar: Clang must only do this in simple cases then. Duplicating SSA formation (mem2reg, … ) in each frontend doesn't make a lot of sense, and I clearly remember Chandler saying that in multiple of his talks.

@dnadlinger
Copy link
Member

That being said, maybe we can easily track whether we assign a parameter during semantic analysis and elide some loads/stores if we don't – I'd need to have a look at the code to be sure. Then again, IIRC the ABI transformation stuff is also tied into the parameter allocas.

@Safety0ff
Copy link

I've posted an alternative to std::set / std::list / std::vector for IrDSymbol: https://github.com/Safety0ff/ldc/compare/intdlist

@Safety0ff
Copy link

Ok, I've fixed the bug, I used the same command as Trass3r minus std/datetime.d (I only have 11 gigs of ram.)
Before: 4m56.5s
After: 4m11s
~15% improvement.
Using memset in reset did not yield any improvement, perhaps because I moved the boolean fields to avoid having padding in the middle of the struct?
I did not try using memcpy for the copy ctor.
I'll test with Trass3r's code now.

@Safety0ff
Copy link

Trass3r's code gives a time of: 3m59s
I did not include the memcpy or memset but I did move the booleans to match my code.

@Safety0ff
Copy link

Alright, I created an intrusive SList version, using Trass3r's conclusion that removals aren't random, they're mostly from the back: https://github.com/Safety0ff/ldc/compare/intslist
Performance is on par with the doubly linked list, but still below std::vector.
It uses a tiny fraction less memory than std::vector (dlist > vector > intrusive slist for memory.)

@Trass3r
Copy link
Contributor

Trass3r commented Jul 4, 2014

An intrusive linked list is indeed the optimal struture for this (?!)
But since objects are (almost) never deleted vector is still faster.
Though I'd suspect most loss in the constructor.

@Safety0ff
Copy link

EDIT: nvm.
Deletion is not as rare as we thought, are there stack allocations of these objects?

@Trass3r
Copy link
Contributor

Trass3r commented Jul 4, 2014

Yes 1. Some template symbols.

@Safety0ff
Copy link

Hmm, yea.
The whole "almost always remove from the back" thing seems to mostly due to DMD never releasing memory.
So removal only happens when it uses a stack temporary and it goes out of scope (e.g. TemplateInstance in leastAsSpecialized within dmd2/template.c.)
So when pigs fly and DMD releases memory (ddmd?) the intrusive dlist will have its day. Though the difference would then be unnoticeable since we'd be operating on significantly less objects.

@dnadlinger
Copy link
Member

Actually, this might be an indication that the design needs to be rethought. OTOH, IrDsymbol shouldn't be accessed before codegen, and at that point, the frontend certainly doesn't create or destroy any Dsymbols anymore.

@Safety0ff
Copy link

Trying to estimate what there is to gain from going full lazy I hacked something together here: https://github.com/Safety0ff/ldc/compare/lazyhack (it's not pretty)
Runs in 3m20s compared to: 4m00s for eager vector, 4m10s for eager list, 4m56.5s baseline.
Anyways, as I said, just estimating how much is left to gain.

@Safety0ff
Copy link

OK, well I found out my build settings were bonk for all my previous timings, here are the correct numbers:
baseline: 3m42s
dlist: 3m28.8s
slist: 3m27.8s
vector: 2m55s
super lazy: 2m41.9s

@Safety0ff
Copy link

I started on a lazy underordered_map version but I'm abandoning it for now.
It would save some space but the memory usage issues stem from not releasing memory in dmd.
EDIT: Additional details:
Total memory used in my benchmark command for the ir symbol metadata is ~575MB, only ~0.9% of that is actually written to (even less is actually "used".)
Total usage for the benchmark command ~8.5 GiB.

Trass3r added a commit to Trass3r/ldc that referenced this issue Jul 7, 2014
@Safety0ff
Copy link

I started on a lazy underordered_map version but I'm abandoning it for now.

Ok, I finished it, but it's pretty ugly: https://github.com/Safety0ff/ldc/compare/lazymap
Running time is slightly slower than the lazy slist version (2.6% slower but still faster than vector.)
It uses ~9% less memory by lazily creating the data members in a global hash table.
All of the other methods add data members to the DSymbol object regardless whether they are used.
Since many DSymbols are created and their Ir metadata are unused, this amounts to a fair amount of memory.

@Safety0ff
Copy link

Anyways, for now I think we should go with @Trass3r's vector for simplicity and later it can be decided if one of the more invasive solutions is desired.

Trass3r added a commit to Trass3r/ldc that referenced this issue Jul 16, 2014
Trass3r added a commit to Trass3r/ldc that referenced this issue Jul 16, 2014
Trass3r added a commit to Trass3r/ldc that referenced this issue Jul 16, 2014
Trass3r added a commit to Trass3r/ldc that referenced this issue Jul 16, 2014
Trass3r added a commit to Trass3r/ldc that referenced this issue Jul 16, 2014
Trass3r added a commit to Trass3r/ldc that referenced this issue Jul 18, 2014
Trass3r added a commit to Trass3r/ldc that referenced this issue Jul 21, 2014
Trass3r added a commit to Trass3r/ldc that referenced this issue Jul 28, 2014
Trass3r added a commit to Trass3r/ldc that referenced this issue Jul 31, 2014
Trass3r added a commit to Trass3r/ldc that referenced this issue Jul 31, 2014
Trass3r added a commit to Trass3r/ldc that referenced this issue Sep 8, 2014
Trass3r added a commit to Trass3r/ldc that referenced this issue Sep 8, 2014
Trass3r added a commit to Trass3r/ldc that referenced this issue Sep 9, 2014
dnadlinger added a commit that referenced this issue Sep 9, 2014
redstar pushed a commit that referenced this issue Sep 27, 2014
Add various constants/types for Linux/PPC.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants