Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Module loading slow due to method insertion and external target verification #48092

Open
topolarity opened this issue Jan 3, 2023 · 20 comments
Labels
compiler:latency Compiler latency

Comments

@topolarity
Copy link
Member

topolarity commented Jan 3, 2023

Now that #47184 is landed (🎉 ), load time has become a significant fraction of TTFX. At @staticfloat's recommendation, I wanted to share some early profiling results.

Command: using CairoMakie
Platform: x86_64-darwin
Total load time: 11.78s

The big contributors to load time appear to be:

  1. jl_verify_edges, 56% (6.54s): These calls are fast, but we are severely strained by their quantity. In total, 180k external targets in CairoMakie and its dependencies trigger 180k calls to jl_matching_methods.
  2. jl_insert_methods, 21% (2.44s): The slowest (>40ms, up to 500ms) calls to insert new methods make the biggest contribution here, accounting for more than 80% of the total call time.

The next major contributor would be jl_restore_package_image_from_stream_at 0.95s (8% of load time), but I haven't instrumented it in detail. Depending on I/O and dyld cache details, I've also measured up to 20% of load time spent in jl_dlopen. I doubt too much can be done to improve that number, and once caches are warm it typically falls to ~1%.

Tracing was done with Tracy. You can examine the trace file locally or in the browser.

To see what was instrumented, take a look at this commit.

@topolarity
Copy link
Member Author

Some misc. thoughts on which packages are most likely to be affected:

(1) is where we verify edge targets in external modules, and it strongly dominates the load time for packages with many types that use definitions in external packages (e.g. Tuples, Generators, and Arrays using definitions provided by Base).

For example, Makie eventually calls Base.iterate with over 217 Generator types and 385 Tuple types. The result is thousands of external targets to defs in Base:

julia> using PkgCacheInspector
julia> cf = info_cachefile("Makie")
julia> mis = cf.external_targets[2:3:end]
# Print the (5) most commonly-targeted external defs for Makie
julia> last(sort(collect(countmap([(mi.def.module, mi.def.name) for mi in mis if typeof(mi) === Core.MethodInstance])), by=x->x[2]), 5)
5-element Vector{Pair{Tuple{Module, Symbol}, Int64}}:
               (Base, :iterate) => 1535
           (Base, :getproperty) => 1996
              (Base, :getindex) => 2257
               (Base, :convert) => 2401
 (Base.Broadcast, :Broadcasted) => 2570

(2) is responsible for >90% of the load time for small packages with many simple methods, such as Distributions.jl, FillArrays.jl, and DataStructures.jl. These packages have few edges or external targets, making (1) a non-issue.

@KristofferC
Copy link
Member

KristofferC commented Jan 3, 2023

Using a tracing profiler (which I've mostly seemed used for profiling games) like Tracy is a good idea which I thought about using in Julia since I saw it (actually Telemetry instead of Tracy but same thing) used by Jonathan Blow for his compiler, e.g. https://youtu.be/iD08Vpkie8Q?t=1639.

I think we should fill Julia with profiling zones so you can get a very accurate view of how the system behaves over time.

@quinnj
Copy link
Member

quinnj commented Jan 3, 2023

(@kpamnany has ideas on tracing instrumentation as well, but not sure if there's another issue/discussion where those ideas can come together)

@StefanKarpinski
Copy link
Member

A couple of possibly un/helpful thoughts:

  1. Could we not verify edges and instead trust that they are correct? The failure mode if invalid edges would be crashing, but I'm not sure if that's a case we expect in normal operation or not.
  2. Barring that, could the work be deferred until required? Not sure how feasible/complex that would be.
  3. Would this work benefit from batching? If there's any work that is shared in the very large number of calls, then it might help. The loading process could construct lists of edges to be verified and methods to be inserted and then do a batch call instead of individual calls. Might be possible to sort the batched edges/methods to make the verification/insertion processes faster.

@IanButterworth
Copy link
Member

IanButterworth commented Jan 3, 2023

Could we not verify edges and instead trust that they are correct?

I happened to look into this yesterday with Makie too, and observed that if you force jl_verify_edges to immediately return all as valid, a simple Makie plot window renders successfully quickly. I also saw very few invalid edges when enabling that debug printing. Obviously that's not correct to do, but the speed up is motivating.. There was some discussion in the #ttfx slack channel

@timholy
Copy link
Member

timholy commented Jan 3, 2023

We need to call verify_edges. I'm repeating/expanding on things I've said previously, but Julia has (amazingly cool) features that require us to expect invalidation in the normal course of business:

  • we aggressively and automatically specialize
  • we allow arbitrary downstream users to add new methods to external functions
  • we are interactive and thus can't control exactly what's in the user's environment
  • we attempt guarantee that you "get the same result no matter how you got there": whether you load all the code/packages at the beginning and then trigger compilation, or whether you load, compile, load more, compile more, ..., we try to make sure that the end result is the same.

The last property was the fix to #265 and is also what makes Revise possible. I don't think that's a property we want to give up. Therefore we have to check whether calls that we precompiled in a "clean" environment (the minimal one that was used for precompilation) are also valid in the user's "dirty" environment.

While profiling is a nice idea, I'd be surprised if big gains can be found. The trick, I think, will be to do less work rather than making that work faster. Quoting myself on slack:

I suspect that the main savings are to be had by deferring checking edges. If you load PkgB, any call that dispatches to a function PkgA.f owned by PkgA is external and must be checked. But if you had said using PkgC and both PkgA and PkgB were loaded because they are dependencies of PkgC, then from the standpoint of PkgC those are all internal edges and may not need to be checked; you might be able to just cache the result of jl_verify_edges. Note you may not be able to do this if PkgA had already been loaded into your session before you said using PkgC, because at that point you can't rule out the possibility that some other package added new methods to PkgA.f. I haven't thought through all the details carefully though.

@StefanKarpinski
Copy link
Member

What about batching? Could that perhaps help?

@topolarity
Copy link
Member Author

topolarity commented Jan 3, 2023

Adding an interesting data point (thanks to @maleadt for requesting):
CUDA.jl's load time is dominated by method insertion (2) rather than (1).

In CUDA's case, we spend 78% (3.81s) of 4.90s load time in jl_insert_methods. Only 12% (0.59s) is spent verifying external targets. So it looks like we'll almost certainly want to try to improve both of these load time contributors.

What about batching? Could that perhaps help?

It might be fruitful to aggregate related external targets into a combined data structure (most likely, something tree-ish) for bulk-checking intersections, so that e.g. all iterate(::Tuple, ...) targets in a package can be verified "at once" based on the non-existence of new iterate(::Tuple, ...) definitions, etc.

The intersection would necessarily become more expensive to compute in "dirtier" environments and the implementation would likely be complex, but for common scenarios this might save a lot of work.

@timholy
Copy link
Member

timholy commented Jan 3, 2023

@topolarity I think method insertion is also largely about edges...when you insert a new method in an existing method table, you have to check any callers with edges to things that might be superseded by the new method. But I haven't looked at this in detail; is that consistent with what the profiling shows?

@StefanKarpinski,

What about batching? Could that perhaps help?

Yes, that is a way of saying something similar to my point above about "deferring work." Anything that can be deemed "internal" doesn't need to be checked; likewise if you were in control of when a package got loaded, that effectively makes calls to that package "internal-izable" since it seems plausible we could cache the outcome of edge-checking rather than needing to do it at load time. It's only when a package got loaded sometime in the distant past (and Base is one such "package"... 😮) that you lose control and have to check from scratch.

If someone wants to start tackling this, a potential battle plan would be to hack together a mode of base/loading.jl where you can collect the two edge-related arrays of pkgimages (external_targets and edges in the nomenclature of PkgCacheInspector) for every package that gets loaded when you do something like using GLMakie. Then you could write analysis code to figure out how many of them are effectively internal to GLMakie vs connecting to something external like Base. That would give you a pretty good guess about what fraction of the work you can save. It wouldn't be a ton of work to do this, either; PkgCacheInspector is pretty small and much of the code is stolen/modified from base/loading.jl.

@topolarity
Copy link
Member Author

topolarity commented Jan 3, 2023

@topolarity I think method insertion is also largely about edges...when you insert a new method in an existing method table, you have to check any callers with edges to things that might be superseded by the new method. But I haven't looked at this in detail; is that consistent with what the profiling shows?

It looks like we spend most of the method insertion time computing a few very expensive type intersections, actually.

78% (3.85s) of CUDA load time is spent in jl_typemap_intersection_visitor. In fact, more than half of that time is spent computing type intersections for just the 20 most expensive method insertions (out of 6615).

Here's the trace file and in-browser viewer from loading CUDA.jl.

@KristofferC
Copy link
Member

I bet 5$ that mul! is among those expensive methods.

@oscardssmith
Copy link
Member

I'm guessing \ is also another.

@topolarity
Copy link
Member Author

Updated the CUDA trace to annotate jl_method_table_insert with the inserted method name

mul! methods are indeed very expensive to intersect, along with lmul!, ldiv!, copyto! and dot

@oscardssmith
Copy link
Member

Ah, in other words, it's exactly what we would expect.

@oscardssmith
Copy link
Member

possibly dumb question, but what is the differeence between lmul! and *? (i.e. isn't Matrix .= Number*Matrix already good?

@timholy
Copy link
Member

timholy commented Jan 4, 2023

It looks like we spend most of the method insertion time computing a few very expensive type intersections, actually.

I don't mean to be pedantic, but we use intersection to determine, "is there any possible overlap between these methods?" If so, some things might need to be invalidated. See

julia/src/gf.c

Lines 1967 to 1978 in 4a42367

jl_typemap_visitor(jl_atomic_load_relaxed(&mt->cache), invalidate_mt_cache, (void*)&mt_cache_env);
jl_array_t *leafcache = jl_atomic_load_relaxed(&mt->leafcache);
size_t i, l = jl_array_len(leafcache);
for (i = 1; i < l; i += 2) {
jl_value_t *entry = jl_array_ptr_ref(leafcache, i);
if (entry) {
while (entry != jl_nothing) {
invalidate_mt_cache((jl_typemap_entry_t*)entry, (void*)&mt_cache_env);
entry = (jl_value_t*)jl_atomic_load_relaxed(&((jl_typemap_entry_t*)entry)->next);
}
}
}

The first line is your expensive intersection call, but the reason we need the intersection is so we can "invalidate if we find a nontrivial intersection." The type intersection has no other purpose than for use in evaluating invalidations.

@topolarity
Copy link
Member Author

topolarity commented Jan 4, 2023

I don't mean to be pedantic, but we use intersection to determine, "is there any possible overlap between these methods?" If so, some things might need to be invalidated.

Ah, I only meant to highlight that the computationally-intensive overhead observed in (2) is not iterating over edges (e.g. which would be the case if we were bottlenecked by the invalidation itself, since that iterates over backedges).

Edit: To be clear, the expensive type intersection that I observe in jl_method_table_insert is this call a bit earlier:

julia/src/gf.c

Line 1804 in 4a42367

oldvalue = get_intersect_matches(jl_atomic_load_relaxed(&mt->defs), newentry);

The main product of the all this compute is to have accurate invalidations, agreed - and for the reasons you give above, the work mustn't be avoided.

Thanks for clarifying - Keep the corrections coming!

@timholy
Copy link
Member

timholy commented Jan 5, 2023

Maybe it would be helpful if I also gave one more bit of context:

  • in jl_verify_edges, we're asking the question: "which of my new specializations are invalidated by old code?" (old code = code that was loaded earlier in your session)
  • in jl_insert_methods, we're asking the question "which of my old specializations are invalidated by new code?" (old specializations = compiled code that was around before we loaded the current package)

@KristofferC KristofferC added the compiler:latency Compiler latency label Jan 9, 2023
@timholy
Copy link
Member

timholy commented Jan 30, 2023

Belated idea: while I've mostly thought in terms of "refactoring" to improve load performance, given that we have just two bottlenecks (and neither one of them is disk I/O), could we exploit these profiling results and get a quick win with multithreading? Kind of like @IanButterworth's addition of parallel package precompilation---it didn't make the compiler intrinsically faster, but parallelization reduced the pain considerably.

There seem to be two potential levels we could add it: either at the package level (from the Julia side, i.e., Threads.@spawn) or on the "worklist" level (e.g., in C for the lists of edges/methods that need validation/insertion). To do it from Julia, you'd probably want to build the dependency graph for using PkgX and then traverse the graph, loading a pkgimage once all of its dependencies are loaded. Pkg graphs cannot contain cycles, so this is a well-defined concept. Alternatively, to implement threading at the C level in our two bottlenecks, you'd implement a workqueue over the lists of items to be processed.

One delicate item is locking, which I think is only an issue for method-insertion (not jl_verify_edges): you don't want to be iterating over a list of backedges while some other thread deletes the entire list due to invalidation. MethodTable already has a writelock which I think would likely be necessary to avoid races; it also seems possible we'd need to add one to MethodInstance to protect its backedge list, although we could make manipulation of that list conditional on holding the lock to the corresponding MethodTable. There are so many MethodTables that I'd guess lock contention would not be a big performance issue.

@timholy
Copy link
Member

timholy commented Mar 13, 2023

While profiling is a nice idea, I'd be surprised if big gains can be found.

I'd like to retract this statement now 🙂. @topolarity it's been great seeing all the dividends from your work starting to land ❤️.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
compiler:latency Compiler latency
Projects
None yet
Development

No branches or pull requests

7 participants