-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Module loading slow due to method insertion and external target verification #48092
Comments
Some misc. thoughts on which packages are most likely to be affected: (1) is where we verify edge targets in external modules, and it strongly dominates the load time for packages with many types that use definitions in external packages (e.g. Tuples, Generators, and Arrays using definitions provided by Base). For example, Makie eventually calls julia> using PkgCacheInspector
julia> cf = info_cachefile("Makie")
julia> mis = cf.external_targets[2:3:end]
# Print the (5) most commonly-targeted external defs for Makie
julia> last(sort(collect(countmap([(mi.def.module, mi.def.name) for mi in mis if typeof(mi) === Core.MethodInstance])), by=x->x[2]), 5)
5-element Vector{Pair{Tuple{Module, Symbol}, Int64}}:
(Base, :iterate) => 1535
(Base, :getproperty) => 1996
(Base, :getindex) => 2257
(Base, :convert) => 2401
(Base.Broadcast, :Broadcasted) => 2570 (2) is responsible for >90% of the load time for small packages with many simple methods, such as Distributions.jl, FillArrays.jl, and DataStructures.jl. These packages have few edges or external targets, making (1) a non-issue. |
Using a tracing profiler (which I've mostly seemed used for profiling games) like Tracy is a good idea which I thought about using in Julia since I saw it (actually Telemetry instead of Tracy but same thing) used by Jonathan Blow for his compiler, e.g. https://youtu.be/iD08Vpkie8Q?t=1639. I think we should fill Julia with profiling zones so you can get a very accurate view of how the system behaves over time. |
(@kpamnany has ideas on tracing instrumentation as well, but not sure if there's another issue/discussion where those ideas can come together) |
A couple of possibly un/helpful thoughts:
|
I happened to look into this yesterday with Makie too, and observed that if you force |
We need to call
The last property was the fix to #265 and is also what makes Revise possible. I don't think that's a property we want to give up. Therefore we have to check whether calls that we precompiled in a "clean" environment (the minimal one that was used for precompilation) are also valid in the user's "dirty" environment. While profiling is a nice idea, I'd be surprised if big gains can be found. The trick, I think, will be to do less work rather than making that work faster. Quoting myself on slack:
|
What about batching? Could that perhaps help? |
Adding an interesting data point (thanks to @maleadt for requesting): In CUDA's case, we spend 78% (3.81s) of 4.90s load time in
It might be fruitful to aggregate related external targets into a combined data structure (most likely, something tree-ish) for bulk-checking intersections, so that e.g. all The intersection would necessarily become more expensive to compute in "dirtier" environments and the implementation would likely be complex, but for common scenarios this might save a lot of work. |
@topolarity I think method insertion is also largely about edges...when you insert a new method in an existing method table, you have to check any callers with edges to things that might be superseded by the new method. But I haven't looked at this in detail; is that consistent with what the profiling shows?
Yes, that is a way of saying something similar to my point above about "deferring work." Anything that can be deemed "internal" doesn't need to be checked; likewise if you were in control of when a package got loaded, that effectively makes calls to that package "internal-izable" since it seems plausible we could cache the outcome of edge-checking rather than needing to do it at load time. It's only when a package got loaded sometime in the distant past (and If someone wants to start tackling this, a potential battle plan would be to hack together a mode of |
It looks like we spend most of the method insertion time computing a few very expensive type intersections, actually. 78% (3.85s) of CUDA load time is spent in Here's the trace file and in-browser viewer from loading CUDA.jl. |
I bet 5$ that |
I'm guessing |
Updated the CUDA trace to annotate
|
Ah, in other words, it's exactly what we would expect. |
possibly dumb question, but what is the differeence between |
I don't mean to be pedantic, but we use intersection to determine, "is there any possible overlap between these methods?" If so, some things might need to be invalidated. See Lines 1967 to 1978 in 4a42367
The first line is your expensive intersection call, but the reason we need the intersection is so we can "invalidate if we find a nontrivial intersection." The type intersection has no other purpose than for use in evaluating invalidations. |
Ah, I only meant to highlight that the computationally-intensive overhead observed in (2) is not iterating over edges (e.g. which would be the case if we were bottlenecked by the invalidation itself, since that iterates over backedges). Edit: To be clear, the expensive type intersection that I observe in Line 1804 in 4a42367
The main product of the all this compute is to have accurate invalidations, agreed - and for the reasons you give above, the work mustn't be avoided. Thanks for clarifying - Keep the corrections coming! |
Maybe it would be helpful if I also gave one more bit of context:
|
Belated idea: while I've mostly thought in terms of "refactoring" to improve load performance, given that we have just two bottlenecks (and neither one of them is disk I/O), could we exploit these profiling results and get a quick win with multithreading? Kind of like @IanButterworth's addition of parallel package precompilation---it didn't make the compiler intrinsically faster, but parallelization reduced the pain considerably. There seem to be two potential levels we could add it: either at the package level (from the Julia side, i.e., One delicate item is locking, which I think is only an issue for method-insertion (not |
I'd like to retract this statement now 🙂. @topolarity it's been great seeing all the dividends from your work starting to land ❤️. |
Now that #47184 is landed (🎉 ), load time has become a significant fraction of TTFX. At @staticfloat's recommendation, I wanted to share some early profiling results.
Command:
using CairoMakie
Platform:
x86_64-darwin
Total load time: 11.78s
The big contributors to load time appear to be:
jl_verify_edges
, 56% (6.54s): These calls are fast, but we are severely strained by their quantity. In total, 180k external targets in CairoMakie and its dependencies trigger 180k calls tojl_matching_methods
.jl_insert_methods
, 21% (2.44s): The slowest (>40ms, up to 500ms) calls to insert new methods make the biggest contribution here, accounting for more than 80% of the total call time.The next major contributor would be
jl_restore_package_image_from_stream_
at 0.95s (8% of load time), but I haven't instrumented it in detail. Depending on I/O and dyld cache details, I've also measured up to 20% of load time spent injl_dlopen
. I doubt too much can be done to improve that number, and once caches are warm it typically falls to ~1%.Tracing was done with Tracy. You can examine the trace file locally or in the browser.
To see what was instrumented, take a look at this commit.
The text was updated successfully, but these errors were encountered: