Add more debug timing to loading and precompilation. Investigate speed ups #60333
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This is the product of asking Claude Opus 4.5 to investigate ways to speed up package precompilation and code loading.
I think there's primarily value in the debug timing that has been added, and may be some insightful ideas. One of which is included for speeding up precompilation by skipping invalidation passes because the session should be identical to the one it was precompiled in.
Example for precompiling GLMakie (54s total)
Here's the summary:
Package Image Loading Performance Analysis
This document summarizes the performance characteristics of Julia's package image (pkgimage) loading system, based on instrumented timing analysis of
src/staticdata.c.Overview
Package image loading occurs in two main phases:
jl_restore_package_image_from_stream) - Deserialize and reconstruct the imageTiming Breakdown by Package Size
System Image (~188MB)
The system image is dominated by relocation processing due to the large number of pointers (1.6M + 7.7M entries) that need adjustment.
Large Packages (e.g., Plots ~38MB)
Large packages spend most time in uniquing operations - ensuring types and method instances match those already in the runtime.
Medium Packages (e.g., PlotUtils ~7MB)
Small Packages (JLLs, ~15-50KB)
JLL packages load in 0.05-0.15ms total, with time split between:
Key Bottlenecks
1. Relocations (System Image)
The
jl_read_relocationsfunction processes relocation lists that patch pointers in the loaded image. For the system image with ~7.8MB of relocation data (~9.3M total entries), this takes 45ms.Current implementation: Sequential processing with
ios_readcalls and pointer arithmetic.Parallel relocation attempt: Tested a two-phase approach (decode all entries, then apply in parallel with pthreads). Result: slower (45ms → 48ms) due to:
Potential optimizations:
2. Object Uniquing (Largest Bottleneck for Packages)
Method instances and bindings must be matched with existing runtime objects or created if new. For Plots with ~62K objects, this takes ~12ms.
Detailed breakdown for Plots:
Per-lookup cost: ~480ns per MethodInstance lookup/insertion via
jl_specializations_get_linfoCurrent implementation:
m->writelockfor each insertionjl_smallintset_lookupjl_get_specializedPotential optimizations:
3. Type Uniquing (Large Packages)
The type uniquing phase ensures that types in the loaded image are unified with existing types in the runtime. For Plots with 49,413 types, this takes ~6ms.
Detailed breakdown for Plots:
Current implementation:
typecache_lockfor entire loopjl_lookup_cache_type_which usestypekey_hash+ linear probingjl_new_uninitialized_datatypePotential optimizations:
4. Method Activation (Packages with Many External Methods)
Packages that extend methods from other packages (external methods) have activation overhead. SparseArrays with 2,528 external methods takes 132ms to activate.
Current implementation: Sequential method table insertions with world age management.
Potential optimizations:
Data Sizes
Representative size breakdown for the system image:
Instrumentation
To enable detailed timing output for package loading, uncomment
#define JL_DEBUG_LOADINGnear the top ofsrc/staticdata.cand rebuild Julia:To enable detailed timing output for package cache generation (saving), uncomment
#define JL_DEBUG_SAVING:Then rebuild with
make -C srcand run:This will print timing for each phase of image loading/saving, including detailed breakdowns of type and object uniquing operations. The instrumentation is compile-time guarded for zero overhead in release builds.
Summary of Optimization Opportunities
Highest Impact: Method Activation (63% of total time)
The
activate methodsphase dominates package loading time, accounting for 206ms out of ~330ms total forusing Plots. A single package (SparseArrays) takes 139ms - 67% of all method activation time.Why SparseArrays is slow: It extends Base methods (arithmetic operators, indexing, etc.) that have large method tables. Each external method requires:
get_intersect_matches- find all intersecting methods in Base's method tablesjl_type_morespecificchecks for each intersectionjl_type_intersection2checks for each MethodInstanceSparseArrays: 55μs per external method (2,528 methods)
Plots: 1.8μs per external method (2,388 methods)
The 30x difference comes from which methods are being extended - Base's core arithmetic has much larger method tables than plotting-specific methods.
Potential optimizations:
Skip Invalidation During Precompilation (Implemented)
During incremental precompilation (
jl_generating_output() && jl_options.incremental), method activation can skip the expensive invalidation checks because:dispatch_statusandinterferencesare still computedImplementation: Added
skip_invalidationflag injl_method_table_activate(src/gf.c) that skips:jl_type_intersection2)_invalidate_dispatch_backedges)_typename_invalidate_backedges,invalidate_mt_cache)jl_method_table_invalidate)Results:
The improvement is modest because the type intersection and morespecific checks (which we keep for dispatch correctness) dominate the activation cost. The invalidation loops we skip are relatively cheap.
Further Opportunities (Not Yet Implemented)
The remaining expensive operations in method activation are:
get_intersect_matches- Finds all methods in Base that intersect with the new method's signature. Required to computedispatch_bitsand update other methods'interferences.jl_type_morespecificloop - Called for each intersecting method to determine ambiguity and update dispatch optimization flags.Interference set updates - Updates both the new method's
interferencesand existing methods' interference sets.Why these can't be skipped during precompilation:
MethodErrorduring precompilation when dispatch relies on these sets to find the correct method.interferencesfield is used in method sorting during dispatch to determine which methods should be considered.__init__, type inference).Tested and rejected: Skipping interference set updates during precompilation caused
MethodErrorfailures - dispatch couldn't find methods that should have matched.Potential future optimizations:
Medium Impact: Apply Relocations (16%)
52ms for all packages. Currently sequential processing.
Status: Parallel attempt failed due to memory allocation overhead and cache effects.
Remaining ideas:
Lower Impact: Uniquing Operations (12%)
Object uniquing: 24ms (7%)
Type uniquing: 15ms (5%)
Tested and rejected: Save-side sorting by method pointer made things 15% slower by destroying natural serialization locality.
Remaining ideas:
Lower Priority
Rejected Optimizations
Parallel Relocations
Attempted a two-phase approach: decode all relocation entries into a buffer, then apply in parallel using pthreads.
Result: Slower (45ms → 48ms) due to memory allocation overhead, thread creation/join cost, and cache effects from non-sequential access.
Save-side Sorting of uniquing_objs
Hypothesis: Sorting MIs by method pointer during serialization would improve cache locality during load, keeping each method's specializations hash table hot in CPU cache.
Implementation: Added parallel arraylist to store method pointers, sorted (offset, method) pairs before writing.
Results from
using Plotsbenchmark:Why it failed: The identical mi_done/mi_lookup ratios show that sorting doesn't change the "already done" cache hit rate. The sorting appears to have destroyed natural locality that existed in the original serialization order. Objects are serialized in traversal order, which likely groups related items together. Sorting by method pointer spreads out items that were naturally co-located.
Lesson learned: The serialization traversal order may already have good cache locality properties that shouldn't be disturbed.
Batch Method Activation by Method Table
Hypothesis: Sorting external methods by their method table before activation would improve cache locality when accessing
mt->defsfor intersection checks.Implementation: Added qsort to order entries by method table pointer before the activation loop.
Result: 10% slower than master. The sorting overhead outweighed any cache benefits, and likely destroyed beneficial natural ordering (methods are serialized in dependency order which may naturally group related activations).
Lesson learned: Same as above - the natural serialization order has good properties.
Performance Characteristics
Object uniquing rate: ~5.2K objects/ms (~195ns per object average)
jl_specializations_get_linfoType uniquing rate: ~8.7K types/ms (~115ns per type average)