Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compilation bottleneck #14556

Closed
timholy opened this issue Jan 4, 2016 · 37 comments
Closed

Compilation bottleneck #14556

timholy opened this issue Jan 4, 2016 · 37 comments
Labels
performance Must go faster

Comments

@timholy
Copy link
Member

timholy commented Jan 4, 2016

I don't understand the compile toolchain very well, and I haven't yet gotten good profile info (see https://groups.google.com/d/msg/julia-dev/RKAahOwppNs/Kg0TFx_SBwAJ), but at least for the SubArray tests I'm guessing that this line, via jl_compress_ast, is an enormous bottleneck in compilation. I inserted a debugging line,

diff --git a/src/dump.c b/src/dump.c
index e11bff5..9c3c96f 100644
--- a/src/dump.c
+++ b/src/dump.c
@@ -618,6 +618,7 @@ static int is_ast_node(jl_value_t *v)

 static int literal_val_id(jl_value_t *v)
 {
+    jl_printf(JL_STDOUT, "len = %d\n", jl_array_len(tree_literal_values));
     for(int i=0; i < jl_array_len(tree_literal_values); i++) {
         if (jl_egal(jl_cellref(tree_literal_values,i), v))
             return i;

and saw that most of the lengths were around 10k or so. That's a heck of a lot of calls to jl_egal. Is there anything that can be done to reduce this number?

@vtjnash
Copy link
Member

vtjnash commented Jan 4, 2016

we could put them in an ObjectIdDict, but that may make it slower in the expected usage. that table is supposed to be nearly empty, but it can get populated if code is interpolating non-ast objects into the ast.

perhaps we should switch to using a per-linfo list of literals? the list could re-use the codegen roots list -- it seems likely that these same objects are likely getting copied there by emit_linfo_root during codegen anyways.

@JeffBezanson
Copy link
Member

At one point we switched to per-module literal lists since many values ended up being shared by several functions. But I agree it seems silly to have both that and linfo->roots.

10k is way more than I'd expect. I wonder if we can cut that down (e.g. by serializing more things). Or just fully switch to a bytecode representation.

@timholy
Copy link
Member Author

timholy commented Jan 4, 2016

It will be a little while, but to save anyone else time, I'm collecting the histogram of counts by running make testall1 with

diff --git a/src/dump.c b/src/dump.c
index e11bff5..50ab040 100644
--- a/src/dump.c
+++ b/src/dump.c
@@ -6,6 +6,7 @@
 #include <stdlib.h>
 #include <string.h>
 #include <assert.h>
+#include <stdio.h>

 #include "julia.h"
 #include "julia_internal.h"
@@ -616,8 +617,21 @@ static int is_ast_node(jl_value_t *v)
         jl_is_labelnode(v) || jl_is_linenode(v) || jl_is_globalref(v);
 }

+int tlvlen[50000] = { 0 };
+
+JL_DLLEXPORT void jl_write_tlvlen(void)
+{
+    FILE *f = fopen("/tmp/tlvlen.bin", "w+");
+    fwrite(tlvlen, sizeof(int), 50000, f);
+    fclose(f);
+}
+
 static int literal_val_id(jl_value_t *v)
 {
+    int n = jl_array_len(tree_literal_values);
+    if (n >= 50000)
+        n = 49999;
+    tlvlen[n] += 1;
     for(int i=0; i < jl_array_len(tree_literal_values); i++) {
         if (jl_egal(jl_cellref(tree_literal_values,i), v))
             return i;
diff --git a/src/julia.h b/src/julia.h
index b93c268..03545fc 100644
--- a/src/julia.h
+++ b/src/julia.h
@@ -1678,6 +1678,8 @@ JL_DLLEXPORT extern const char* jl_ver_string(void);
 JL_DLLEXPORT const char *jl_git_branch(void);
 JL_DLLEXPORT const char *jl_git_commit(void);

+    JL_DLLEXPORT void jl_write_tlvlen(void);
+
 // nullable struct representations
 typedef struct {
     uint8_t isnull;
diff --git a/test/runtests.jl b/test/runtests.jl
index 969c72e..6b7cb00 100644
--- a/test/runtests.jl
+++ b/test/runtests.jl
@@ -72,3 +72,5 @@ cd(dirname(@__FILE__)) do
     @unix_only n > 1 && rmprocs(workers(), waitfor=5.0)
     println("    \033[32;1mSUCCESS\033[0m")
 end
+
+ccall(:jl_write_tlvlen, Void, ())

Will report back when it finishes.

@JeffBezanson
Copy link
Member

It might also make sense to count the total iterations spent in that loop (# of jl_egal calls). Or maybe you have that already?

@timholy
Copy link
Member Author

timholy commented Jan 4, 2016

Yeah, I thought of that after I launched the first one. FWIW here are the results:
tlvlen

Median value was 10, mean value was 254.

Want me to collect the # of jl_egal calls too?

@timholy
Copy link
Member Author

timholy commented Jan 4, 2016

Wait, I was totally borked on those statistics. Median value is 10024, mean value is 11305.

@timholy
Copy link
Member Author

timholy commented Jan 4, 2016

Plotting the cumulative histogram on a linear scale suggests there's one group of "short lists" (<300) and another with "long lists" (between about 7k and 30k):
tlvlen_linear

@timholy
Copy link
Member Author

timholy commented Jan 4, 2016

If you just count jl_egal calls, it's a radically different picture:
tlvlen_egalcounts

Median value 128, mean value 1585.

@timholy
Copy link
Member Author

timholy commented Jan 4, 2016

I decided to profile the test suite. I had to exclude a few tests (profile because it calls Profile.clear(); also "cmdlineargs", "file", "libdl", "parallel" because they gave problems for reasons I didn't track down). I ran this on release-0.4 (0.4.3-pre6), because I don't trust profiling on 0.5 due to #14281. I collected one sample every 10 ms (100samples/sec).

Analysis script (whose purpose was to show "flat" results that don't suffer from the duplicates-due-to-recursion problem as illustrated by dumbsum) and results are here: https://gist.github.com/timholy/67762c65cbfe4226b43a. I only included lines that accounted for at least 100 samples, i.e., at least 1s of time. Scroll to the end to see the worst offenders.

While it's a shame that so many important lines lack good backtrace info (here's hoping that @Keno's work will help), to me it looks as if jl_egal is the single worst "low-level" offender, accounting for more than 8% of the total number of samples collected. That, combined with the fact that the number of jl_egal calls is clearly not as small as seems to have been expected, suggests there's an opportunity for measurable improvement.

@Keno
Copy link
Member

Keno commented Jan 4, 2016

Unfortunately, I'm probably not as much help here. I briefly implemented a more efficient/accurate profiler on top of Gallium, but it only worked on OS X and had a number of bugs that made it essentially unusable. That said, I do plan to revisit the topic of profiling once I've made some progress on the debugger front.

I can say however, that I concur with your analysis of jl_egal being a significant chunk of our compile time. I don't have the traces anymore, but I remember pointing this fact out to @JeffBezanson

@timholy
Copy link
Member Author

timholy commented Jan 4, 2016

Bummer about the backtraces. Hopefully we'll get there some day. I also just remembered that long ago I convinced myself (rightly or wrongly) that most of the borked backtraces are almost surely from our many C functions that are declared static---they're not exported in the .so file, and as a consequence our means of looking them up fail.

@Keno
Copy link
Member

Keno commented Jan 4, 2016

Well, the backtraces shouldn't be super borked. What OS are you on?

@timholy
Copy link
Member Author

timholy commented Jan 4, 2016

See more info here.

@timholy
Copy link
Member Author

timholy commented Jan 4, 2016

Kubuntu 14.04.3 LTS, Trusty Tahr

@Keno
Copy link
Member

Keno commented Jan 4, 2016

Hmm, should be fine then. On OS X you have to jump through some hoops to make the dSYM available such that we can find it to get at the debug line info.

@timholy
Copy link
Member Author

timholy commented Jan 4, 2016

This affects every computer I've tested that's running Kubuntu. I seem to remember noticing a CentOS machine did better, however (can't test anymore, that machine was wiped clean and replaced with Kubuntu).

I could try to set up an account for you on my machine---it won't be trivial because here they're pretty draconian about the firewall, but I suspect I can find a way.

@KristofferC
Copy link
Member

I did a full run of the tests (on latest master) with VTune XE 2016. If anyone want to download the traces, I put them up here (80M file, maybe I should have used a longer trace interval): https://dl.dropboxusercontent.com/u/14340581/julia_tests_vtune.tar.gz. Note, if too many download them, it is likely that Dropbox disable the file download.

Here are a few screenshots:

Bottom up tree, jl_egal stands out a lot:
image

Stuff getting called from jl_egal:
image

@timholy
Copy link
Member Author

timholy commented Jan 4, 2016

I get a 403 from the last two links---I didn't try the first link.

But that's what I was guessing from your email to julia-dev (I take it you fixed the line numbers?). Does that mean that 11.2% of the total test-suite time is that one line?? Surprising that the line above it doesn't take any appreciable time, given that this is a pointer comparison and the line above it also calls jl_typeof. Or perhaps things are being reordered, and the cost of both jl_typeof calls are being credited to that line?

Also, can you tell how many calls are on each line? If it usually exits at line 294, that might explain why the later lines aren't so expensive despite the fact that they involve more operations.

Another oddity is that the fact this appears at all in the backtraces leads me to guess that jl_egal is not inlined despite Ulrich's careful work in #9732. CC @drepper.

@KristofferC
Copy link
Member

Updated screenshot links, do they work now?

@timholy
Copy link
Member Author

timholy commented Jan 4, 2016

Nope.

@vchuravy
Copy link
Member

vchuravy commented Jan 4, 2016

@timholy After refreshing they worked for me.

@timholy
Copy link
Member Author

timholy commented Jan 4, 2016

Thanks, @vchuravy. Wow, that is a standout.

@KristofferC
Copy link
Member

I'm confused.. does the screenshot below show that one single instruction take 11% of the total CPU time of the whole test suite?...

image

@timholy
Copy link
Member Author

timholy commented Jan 4, 2016

CC @ArchRobison, who I suspect is the local expert on VTune.

@drepper
Copy link
Contributor

drepper commented Jan 4, 2016

On Mon, Jan 4, 2016 at 12:03 PM, Kristoffer Carlsson <
[email protected]> wrote:

I'm confused.. does the screenshot below show that one single instruction
take 11% of the total CPU time...

This is statistical sampling. The profile tick might hit the same
instruction over and over (in this case probably because of cache misses)
but what this really means is that the entire basic block as part of a
larger construct like a function is run something in the order of 11% of
all the time.

I am actually looking at profiles myself right now. Will see what I can
find. FWIW, I see jl_egal at the top of the list myself when using perf:

Overhead Command Shared Object Symbol
7.86% julia libjulia.so [.] jl_egal
5.71% julia libjulia.so [.]
jl_method_table_assoc_exact
3.76% julia libjulia.so [.] jl_subtype_le
2.62% julia libjulia.so [.]
jl_has_typevars__.constprop.49
2.32% julia libc-2.22.so [.] int_malloc
2.11% julia libjulia.so [.]
llvm::DenseMapBase<llvm::DenseMap<unsigned int, llvm::SmallVector<llvm::Re
1.68% julia libjulia.so [.]
lookup_type_idx.isra.21
1.67% julia libjulia.so [.] jl_apply_generic
1.42% julia libc-2.22.so [.] int_free
1.27% julia libc-2.22.so [.] malloc
1.16% julia libjulia.so [.]
jl_type_intersection_matching
1.14% julia libjulia.so [.] jl_serialize_value

1.08% julia libjulia.so [.]
jl_tuple_subtype
.isra.8
0.92% julia libjulia.so [.]
llvm::RuntimeDyldImpl::resolveRelocations
0.74% julia libjulia.so [.] ptrhash_get
0.62% julia [kernel.vmlinux] [k] sys_mprotect
0.61% julia libjulia.so [.] gc_sweep_inc
0.61% julia libjulia.so [.] jl_f_get_field
0.60% julia libjulia.so [.]
jl_type_intersect.constprop.37
0.56% julia libjulia.so [.] jl_f_subtype
0.55% julia libjulia.so [.] jl_get_binding_
0.53% julia libjulia.so [.] jl_eqtable_get

If the issue is similar to last time it should be easy to figure out.

@KristofferC
Copy link
Member

Thanks for the explanation. Also, good that two different tools seem to agree.

@drepper
Copy link
Contributor

drepper commented Jan 5, 2016

There was indeed again one function call in jl_egal which caused a stack
frame to be created. With the following change this can be avoided and
this shaves off around 0.5% from the total cost of jl_egal. The function
is still at the top but this is nothing to sneeze at.

I'll prepare a formal pull request in a bit but first I want to see whether
more can be done. Has there ever been a discussion about interposition of
symbols at runtime? Given that Julia runs on multiple platforms this is
not something that can be relied on for the project as a whole. Without
restricting interposition functions like jl_egal can never get inlined
since the compiler mustn't assume it knows what the definition looks like.
If this is restricted it opens all kinds of new optimizations. The source
code what ends up in libjulia will need more annotation, though, and for
maximum effect I expect more complex Makefile rules are needed.

The question about this is: do people think it's time to do these types of
optimizations already? Think about premature optimizations being evil etc.
If the C/C++ portion is declared sufficiently stable and people won't be
grossed out by some additional ugliness in the source code and perhaps in
the Makefile I think I can provide some patches which will get quite a bit
of performance. Of course I'd have concrete performance numbers before a
pull request but I want to get an impression about how such changes would
be received.

diff --git a/src/builtins.c b/src/builtins.c
index d50f7ce..be52583 100644
--- a/src/builtins.c
+++ b/src/builtins.c
@@ -210,7 +210,7 @@ JL_DLLEXPORT void jl_pop_handler(int n)

// primitives

-static int bits_equal(void a, void *b, int sz)
+static int NOINLINE bits_equal(void *a, void *b, int sz)
{
switch (sz) {
case 1: return *(int8_t
)a == (int8_t)b;

On Mon, Jan 4, 2016 at 12:44 PM, Kristoffer Carlsson <
[email protected]> wrote:

Thanks for the explanation. Also, good that two different tools seem to
agree.


Reply to this email directly or view it on GitHub
#14556 (comment).

@tkelman
Copy link
Contributor

tkelman commented Jan 5, 2016

Guess it depends just how ugly, but there's a fair amount of platform specific pieces in the C runtime already so I'd think platform-specific optimizations would be fair game as long as the changes are maintainable going forward.

@eschnett
Copy link
Contributor

eschnett commented Jan 5, 2016

I used Instruments on OS X to profile Julia running Base.runtests("int"):
jl_egal

The output is sorted by L2 cache misses, and recursive calls to jl_egal is the top hit. Calls to jl_egal from literal_val_id are also expensive, and seem to be the main caller of jl_egal. The most expensive line of code I'm seeing is the first typeof call, i.e. the same basic block as mentioned above. This confirms @drepper's suggestion.

As an outsider I'm wondering: There is a function that compares types for equality, but no function for a "less than" relation or hash function? Does this mean that types are stored mostly in lists and arrays, and not in trees or heaps or hash tables?

@timholy
Copy link
Member Author

timholy commented Jan 5, 2016

but no function for a "less than" relation or hash function?

See jl_subtype_le, and indeed much of jltypes.c.

If you read from the top of this issue, you'll see that the number of calls to jl_egal greatly exceeds expectations. The best approach (aside from making jl_egal itself faster, which I fully support) is to call it fewer times. That might be achieved by restructuring the things that lie upstream, or by using a hash table.

@JeffBezanson
Copy link
Member

@eschnett There is jl_object_id, which is a hash function that pairs with jl_egal. However for types these are structural, and do not implement type equality (in the sense of A<:B && B<:A). The type cache in jltypes.c needs to look up types, and it does this by imposing an arbitrary total ordering on a (commonly-occurring) subset of types. Extending this to all types appears to be a hard problem. It would basically require canonicalizing types, which might be harder than subtyping itself (both in the sense of programming effort and computational complexity)!

@JeffBezanson
Copy link
Member

@drepper Yes jl_egal should be quite stable now and I think we can handle some ugliness there.

@drepper
Copy link
Contributor

drepper commented Jan 5, 2016

On Tue, Jan 5, 2016 at 12:14 PM, Jeff Bezanson [email protected]
wrote:

@drepper https://github.com/drepper Yes jl_egal should be quite stable
now and I think we can handle some ugliness there.

The ugliness will spread, unfortunately. The jl_egal function cannot be
optimized much more. I have a few changes but unless something can be
changed functionally I fear there is not much that can be done. The bigger
changes I am thinking about would hopefully allow jl_egal to be inlined in
all the call sites. The function is called so often that the cost of the
function call with the parameter setup and register variable restoration
will be significant. This requires (at least on ELF systems) that export
functions are also known using aliases and those aliases are known to be
hidden. Then whole program optimization or LTO can inline the function.
That's really the only way I can see for significant speedup.

@vtjnash
Copy link
Member

vtjnash commented Jan 5, 2016

Would that be equivalent to making it a static inline function in julia.h?

@JeffBezanson
Copy link
Member

I agree with restricting interposition for jl_egal; replacing it would not be a good idea anyway. However it's also true that the bigger problem is us just calling it too often.

@JeffBezanson
Copy link
Member

#14656 might fix this.

@timholy
Copy link
Member Author

timholy commented Jan 14, 2016

Exciting! Looking forward to testing it. Thanks so much for tackling this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Must go faster
Projects
None yet
Development

No branches or pull requests

9 participants