Skip to content

Commit 2f1ed06

Browse files
NHDalyvchuravy
authored andcommitted
Allocation Profiler: Types for all allocations (JuliaLang#50337) (#34)
Backports PR JuliaLang#50337 for RAI julia v1.9.2 Original description: =================== Pass the types to the allocator functions. ------- Before this PR, we were missing the types for allocations in two cases: 1. allocations from codegen 2. allocations in `gc_managed_realloc_` The second one is easy: those are always used for buffers, right? For the first one: we extend the allocation functions called from codegen, to take the type as a parameter, and set the tag there. I kept the old interfaces around, since I think that they cannot be removed due to supporting legacy code? ------ An example of the generated code: ```julia %ptls_field6 = getelementptr inbounds {}**, {}*** %4, i64 2 %13 = bitcast {}*** %ptls_field6 to i8** %ptls_load78 = load i8*, i8** %13, align 8 %box = call noalias nonnull dereferenceable(32) {}* @ijl_gc_pool_alloc_typed(i8* %ptls_load78, i32 1184, i32 32, i64 4366152144) #7 ``` Fixes JuliaLang#43688. Fixes JuliaLang#45268. Co-authored-by: Valentin Churavy <[email protected]>
1 parent 06d82fa commit 2f1ed06

File tree

13 files changed

+242
-53
lines changed

13 files changed

+242
-53
lines changed

doc/src/manual/profile.md

Lines changed: 116 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -310,7 +310,122 @@ the amount of memory allocated by each line of code.
310310

311311
### Line-by-Line Allocation Tracking
312312

313-
To measure allocation line-by-line, start Julia with the `--track-allocation=<setting>` command-line
313+
While [`@time`](@ref) logs high-level stats about memory usage and garbage collection over the course
314+
of evaluating an expression, it can be useful to log each garbage collection event, to get an
315+
intuitive sense of how often the garbage collector is running, how long it's running each time,
316+
and how much garbage it collects each time. This can be enabled with
317+
[`GC.enable_logging(true)`](@ref), which causes Julia to log to stderr every time
318+
a garbage collection happens.
319+
320+
### [Allocation Profiler](@id allocation-profiler)
321+
322+
!!! compat "Julia 1.8"
323+
This functionality requires at least Julia 1.8.
324+
325+
The allocation profiler records the stack trace, type, and size of each
326+
allocation while it is running. It can be invoked with
327+
[`Profile.Allocs.@profile`](@ref).
328+
329+
This information about the allocations is returned as an array of `Alloc`
330+
objects, wrapped in an `AllocResults` object. The best way to visualize these is
331+
currently with the [PProf.jl](https://github.com/JuliaPerf/PProf.jl) and
332+
[ProfileCanvas.jl](https://github.com/pfitzseb/ProfileCanvas.jl) packages, which
333+
can visualize the call stacks which are making the most allocations.
334+
335+
The allocation profiler does have significant overhead, so a `sample_rate`
336+
argument can be passed to speed it up by making it skip some allocations.
337+
Passing `sample_rate=1.0` will make it record everything (which is slow);
338+
`sample_rate=0.1` will record only 10% of the allocations (faster), etc.
339+
340+
!!! compat "Julia 1.11"
341+
342+
Older versions of Julia could not capture types in all cases. In older versions of
343+
Julia, if you see an allocation of type `Profile.Allocs.UnknownType`, it means that
344+
the profiler doesn't know what type of object was allocated. This mainly happened when
345+
the allocation was coming from generated code produced by the compiler. See
346+
[issue #43688](https://github.com/JuliaLang/julia/issues/43688) for more info.
347+
348+
Since Julia 1.11, all allocations should have a type reported.
349+
350+
For more details on how to use this tool, please see the following talk from JuliaCon 2022:
351+
https://www.youtube.com/watch?v=BFvpwC8hEWQ
352+
353+
##### Allocation Profiler Example
354+
355+
In this simple example, we use PProf to visualize the alloc profile. You could use another
356+
visualization tool instead. We collect the profile (specifying a sample rate), then we visualize it.
357+
```julia
358+
using Profile, PProf
359+
Profile.Allocs.clear()
360+
Profile.Allocs.@profile sample_rate=0.0001 my_function()
361+
PProf.Allocs.pprof()
362+
```
363+
364+
Here is a more in-depth example, showing how we can tune the sample rate. A
365+
good number of samples to aim for is around 1 - 10 thousand. Too many, and the
366+
profile visualizer can get overwhelmed, and profiling will be slow. Too few,
367+
and you don't have a representative sample.
368+
369+
370+
```julia-repl
371+
julia> import Profile
372+
373+
julia> @time my_function() # Estimate allocations from a (second-run) of the function
374+
0.110018 seconds (1.50 M allocations: 58.725 MiB, 17.17% gc time)
375+
500000
376+
377+
julia> Profile.Allocs.clear()
378+
379+
julia> Profile.Allocs.@profile sample_rate=0.001 begin # 1.5 M * 0.001 = ~1.5K allocs.
380+
my_function()
381+
end
382+
500000
383+
384+
julia> prof = Profile.Allocs.fetch(); # If you want, you can also manually inspect the results.
385+
386+
julia> length(prof.allocs) # Confirm we have expected number of allocations.
387+
1515
388+
389+
julia> using PProf # Now, visualize with an external tool, like PProf or ProfileCanvas.
390+
391+
julia> PProf.Allocs.pprof(prof; from_c=false) # You can optionally pass in a previously fetched profile result.
392+
Analyzing 1515 allocation samples... 100%|████████████████████████████████| Time: 0:00:00
393+
Main binary filename not available.
394+
Serving web UI on http://localhost:62261
395+
"alloc-profile.pb.gz"
396+
```
397+
Then you can view the profile by navigating to http://localhost:62261, and the profile is saved to disk.
398+
See PProf package for more options.
399+
400+
##### Allocation Profiling Tips
401+
402+
As stated above, aim for around 1-10 thousand samples in your profile.
403+
404+
Note that we are uniformly sampling in the space of _all allocations_, and are not weighting
405+
our samples by the size of the allocation. So a given allocation profile may not give a
406+
representative profile of where most bytes are allocated in your program, unless you had set
407+
`sample_rate=1`.
408+
409+
Allocations can come from users directly constructing objects, but can also come from inside
410+
the runtime or be inserted into compiled code to handle type instability. Looking at the
411+
"source code" view can be helpful to isolate them, and then other external tools such as
412+
[`Cthulhu.jl`](https://github.com/JuliaDebug/Cthulhu.jl) can be useful for identifying the
413+
cause of the allocation.
414+
415+
##### Allocation Profile Visualization Tools
416+
417+
There are several profiling visualization tools now that can all display Allocation
418+
Profiles. Here is a small list of some of the main ones we know about:
419+
- [PProf.jl](https://github.com/JuliaPerf/PProf.jl)
420+
- [ProfileCanvas.jl](https://github.com/pfitzseb/ProfileCanvas.jl)
421+
- VSCode's built-in profile visualizer (`@profview_allocs`) [docs needed]
422+
- Viewing the results directly in the REPL
423+
- You can inspect the results in the REPL via [`Profile.Allocs.fetch()`](@ref), to view
424+
the stacktrace and type of each allocation.
425+
426+
#### Line-by-Line Allocation Tracking
427+
428+
An alternative way to measure allocations is to start Julia with the `--track-allocation=<setting>` command-line
314429
option, for which you can choose `none` (the default, do not measure allocation), `user` (measure
315430
memory allocation everywhere except Julia's core code), or `all` (measure memory allocation at
316431
each line of Julia code). Allocation gets measured for each line of compiled code. When you quit

src/gc-alloc-profiler.h

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -35,6 +35,7 @@ void _maybe_record_alloc_to_profile(jl_value_t *val, size_t size, jl_datatype_t
3535

3636
extern int g_alloc_profile_enabled;
3737

38+
// This should only be used from _deprecated_ code paths. We shouldn't see UNKNOWN anymore.
3839
#define jl_gc_unknown_type_tag ((jl_datatype_t*)0xdeadaa03)
3940

4041
static inline void maybe_record_alloc_to_profile(jl_value_t *val, size_t size, jl_datatype_t *typ) JL_NOTSAFEPOINT {

src/gc.c

Lines changed: 21 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1177,14 +1177,21 @@ static inline jl_value_t *jl_gc_big_alloc_inner(jl_ptls_t ptls, size_t sz)
11771177
return jl_valueof(&v->header);
11781178
}
11791179

1180-
// Instrumented version of jl_gc_big_alloc_inner, called into by LLVM-generated code.
1180+
// Deprecated version, supported for legacy code.
11811181
JL_DLLEXPORT jl_value_t *jl_gc_big_alloc(jl_ptls_t ptls, size_t sz)
11821182
{
11831183
jl_value_t *val = jl_gc_big_alloc_inner(ptls, sz);
11841184

11851185
maybe_record_alloc_to_profile(val, sz, jl_gc_unknown_type_tag);
11861186
return val;
11871187
}
1188+
// Instrumented version of jl_gc_big_alloc_inner, called into by LLVM-generated code.
1189+
JL_DLLEXPORT jl_value_t *jl_gc_big_alloc_instrumented(jl_ptls_t ptls, size_t sz, jl_value_t *type)
1190+
{
1191+
jl_value_t *val = jl_gc_big_alloc_inner(ptls, sz);
1192+
maybe_record_alloc_to_profile(val, sz, (jl_datatype_t*)type);
1193+
return val;
1194+
}
11881195

11891196
// This wrapper exists only to prevent `jl_gc_big_alloc_inner` from being inlined into
11901197
// its callers. We provide an external-facing interface for callers, and inline `jl_gc_big_alloc_inner`
@@ -1494,7 +1501,7 @@ static inline jl_value_t *jl_gc_pool_alloc_inner(jl_ptls_t ptls, int pool_offset
14941501
return jl_valueof(v);
14951502
}
14961503

1497-
// Instrumented version of jl_gc_pool_alloc_inner, called into by LLVM-generated code.
1504+
// Deprecated version, supported for legacy code.
14981505
JL_DLLEXPORT jl_value_t *jl_gc_pool_alloc(jl_ptls_t ptls, int pool_offset,
14991506
int osize)
15001507
{
@@ -1503,6 +1510,14 @@ JL_DLLEXPORT jl_value_t *jl_gc_pool_alloc(jl_ptls_t ptls, int pool_offset,
15031510
maybe_record_alloc_to_profile(val, osize, jl_gc_unknown_type_tag);
15041511
return val;
15051512
}
1513+
// Instrumented version of jl_gc_pool_alloc_inner, called into by LLVM-generated code.
1514+
JL_DLLEXPORT jl_value_t *jl_gc_pool_alloc_instrumented(jl_ptls_t ptls, int pool_offset,
1515+
int osize, jl_value_t* type)
1516+
{
1517+
jl_value_t *val = jl_gc_pool_alloc_inner(ptls, pool_offset, osize);
1518+
maybe_record_alloc_to_profile(val, osize, (jl_datatype_t*)type);
1519+
return val;
1520+
}
15061521

15071522
// This wrapper exists only to prevent `jl_gc_pool_alloc_inner` from being inlined into
15081523
// its callers. We provide an external-facing interface for callers, and inline `jl_gc_pool_alloc_inner`
@@ -4054,7 +4069,10 @@ static void *gc_managed_realloc_(jl_ptls_t ptls, void *d, size_t sz, size_t olds
40544069
SetLastError(last_error);
40554070
#endif
40564071
errno = last_errno;
4057-
maybe_record_alloc_to_profile((jl_value_t*)b, sz, jl_gc_unknown_type_tag);
4072+
// gc_managed_realloc_ is currently used exclusively for resizing array buffers.
4073+
if (allocsz > oldsz) {
4074+
maybe_record_alloc_to_profile((jl_value_t*)b, allocsz - oldsz, (jl_datatype_t*)jl_buff_tag);
4075+
}
40584076
return b;
40594077
}
40604078

src/jl_exported_funcs.inc

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -160,6 +160,7 @@
160160
XX(jl_gc_alloc_3w) \
161161
XX(jl_gc_alloc_typed) \
162162
XX(jl_gc_big_alloc) \
163+
XX(jl_gc_big_alloc_instrumented) \
163164
XX(jl_gc_collect) \
164165
XX(jl_gc_conservative_gc_support_enabled) \
165166
XX(jl_gc_counted_calloc) \
@@ -186,6 +187,7 @@
186187
XX(jl_gc_new_weakref_th) \
187188
XX(jl_gc_num) \
188189
XX(jl_gc_pool_alloc) \
190+
XX(jl_gc_pool_alloc_instrumented) \
189191
XX(jl_gc_queue_multiroot) \
190192
XX(jl_gc_queue_root) \
191193
XX(jl_gc_safepoint) \

src/llvm-final-gc-lowering.cpp

Lines changed: 5 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -205,12 +205,13 @@ Value *FinalLowerGC::lowerQueueGCBinding(CallInst *target, Function &F)
205205
Value *FinalLowerGC::lowerGCAllocBytes(CallInst *target, Function &F)
206206
{
207207
++GCAllocBytesCount;
208-
assert(target->arg_size() == 2);
208+
assert(target->arg_size() == 3);
209209
CallInst *newI;
210210

211211
IRBuilder<> builder(target);
212212
builder.SetCurrentDebugLocation(target->getDebugLoc());
213213
auto ptls = target->getArgOperand(0);
214+
auto type = target->getArgOperand(2);
214215
Attribute derefAttr;
215216

216217
if (auto CI = dyn_cast<ConstantInt>(target->getArgOperand(1))) {
@@ -221,19 +222,19 @@ Value *FinalLowerGC::lowerGCAllocBytes(CallInst *target, Function &F)
221222
if (offset < 0) {
222223
newI = builder.CreateCall(
223224
bigAllocFunc,
224-
{ ptls, ConstantInt::get(getSizeTy(F.getContext()), sz + sizeof(void*)) });
225+
{ ptls, ConstantInt::get(getSizeTy(F.getContext()), sz + sizeof(void*)), type });
225226
derefAttr = Attribute::getWithDereferenceableBytes(F.getContext(), sz + sizeof(void*));
226227
}
227228
else {
228229
auto pool_offs = ConstantInt::get(Type::getInt32Ty(F.getContext()), offset);
229230
auto pool_osize = ConstantInt::get(Type::getInt32Ty(F.getContext()), osize);
230-
newI = builder.CreateCall(poolAllocFunc, { ptls, pool_offs, pool_osize });
231+
newI = builder.CreateCall(poolAllocFunc, { ptls, pool_offs, pool_osize, type });
231232
derefAttr = Attribute::getWithDereferenceableBytes(F.getContext(), osize);
232233
}
233234
} else {
234235
auto size = builder.CreateZExtOrTrunc(target->getArgOperand(1), getSizeTy(F.getContext()));
235236
size = builder.CreateAdd(size, ConstantInt::get(getSizeTy(F.getContext()), sizeof(void*)));
236-
newI = builder.CreateCall(allocTypedFunc, { ptls, size, ConstantPointerNull::get(Type::getInt8PtrTy(F.getContext())) });
237+
newI = builder.CreateCall(allocTypedFunc, { ptls, size, type });
237238
derefAttr = Attribute::getWithDereferenceableBytes(F.getContext(), sizeof(void*));
238239
}
239240
newI->setAttributes(newI->getCalledFunction()->getAttributes());

src/llvm-late-gc-lowering.cpp

Lines changed: 30 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -2324,22 +2324,6 @@ bool LateLowerGCFrame::CleanupIR(Function &F, State *S, bool *CFGModified) {
23242324
IRBuilder<> builder(CI);
23252325
builder.SetCurrentDebugLocation(CI->getDebugLoc());
23262326

2327-
// Create a call to the `julia.gc_alloc_bytes` intrinsic, which is like
2328-
// `julia.gc_alloc_obj` except it doesn't set the tag.
2329-
auto allocBytesIntrinsic = getOrDeclare(jl_intrinsics::GCAllocBytes);
2330-
auto ptlsLoad = get_current_ptls_from_task(builder, CI->getArgOperand(0), tbaa_gcframe);
2331-
auto ptls = builder.CreateBitCast(ptlsLoad, Type::getInt8PtrTy(builder.getContext()));
2332-
auto newI = builder.CreateCall(
2333-
allocBytesIntrinsic,
2334-
{
2335-
ptls,
2336-
builder.CreateIntCast(
2337-
CI->getArgOperand(1),
2338-
allocBytesIntrinsic->getFunctionType()->getParamType(1),
2339-
false)
2340-
});
2341-
newI->takeName(CI);
2342-
23432327
// LLVM alignment/bit check is not happy about addrspacecast and refuse
23442328
// to remove write barrier because of it.
23452329
// We pretty much only load using `T_size` so try our best to strip
@@ -2378,7 +2362,36 @@ bool LateLowerGCFrame::CleanupIR(Function &F, State *S, bool *CFGModified) {
23782362
builder.CreateAlignmentAssumption(DL, tag, 16);
23792363
}
23802364
}
2381-
// Set the tag.
2365+
2366+
// Create a call to the `julia.gc_alloc_bytes` intrinsic, which is like
2367+
// `julia.gc_alloc_obj` except it specializes the call based on the constant
2368+
// size of the object to allocate, to save one indirection, and doesn't set
2369+
// the type tag. (Note that if the size is not a constant, it will call
2370+
// gc_alloc_obj, and will redundantly set the tag.)
2371+
auto allocBytesIntrinsic = getOrDeclare(jl_intrinsics::GCAllocBytes);
2372+
auto ptlsLoad = get_current_ptls_from_task(builder, CI->getArgOperand(0), tbaa_gcframe);
2373+
auto ptls = builder.CreateBitCast(ptlsLoad, Type::getInt8PtrTy(builder.getContext()));
2374+
auto newI = builder.CreateCall(
2375+
allocBytesIntrinsic,
2376+
{
2377+
ptls,
2378+
builder.CreateIntCast(
2379+
CI->getArgOperand(1),
2380+
allocBytesIntrinsic->getFunctionType()->getParamType(1),
2381+
false),
2382+
builder.CreatePtrToInt(tag, T_size),
2383+
});
2384+
newI->takeName(CI);
2385+
2386+
// Now, finally, set the tag. We do this in IR instead of in the C alloc
2387+
// function, to provide possible optimization opportunities. (I think? TBH
2388+
// the most recent editor of this code is not entirely clear on why we
2389+
// prefer to set the tag in the generated code. Providing optimziation
2390+
// opportunities is the most likely reason; the tradeoff is slightly
2391+
// larger code size and increased compilation time, compiling this
2392+
// instruction at every allocation site, rather than once in the C alloc
2393+
// function.)
2394+
auto &M = *builder.GetInsertBlock()->getModule();
23822395
StoreInst *store = builder.CreateAlignedStore(
23832396
tag, EmitTagPtr(builder, tag_type, newI), Align(sizeof(size_t)));
23842397
store->setOrdering(AtomicOrdering::Unordered);

src/llvm-pass-helpers.cpp

Lines changed: 21 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -120,6 +120,12 @@ namespace jl_intrinsics {
120120
static const char *QUEUE_GC_ROOT_NAME = "julia.queue_gc_root";
121121
static const char *QUEUE_GC_BINDING_NAME = "julia.queue_gc_binding";
122122

123+
static auto T_size_t(const JuliaPassContext &context) {
124+
return sizeof(size_t) == sizeof(uint32_t) ?
125+
Type::getInt32Ty(context.getLLVMContext()) :
126+
Type::getInt64Ty(context.getLLVMContext());
127+
}
128+
123129
// Annotates a function with attributes suitable for GC allocation
124130
// functions. Specifically, the return value is marked noalias and nonnull.
125131
// The allocation size is set to the first argument.
@@ -150,9 +156,8 @@ namespace jl_intrinsics {
150156
FunctionType::get(
151157
context.T_prjlvalue,
152158
{ Type::getInt8PtrTy(context.getLLVMContext()),
153-
sizeof(size_t) == sizeof(uint32_t) ?
154-
Type::getInt32Ty(context.getLLVMContext()) :
155-
Type::getInt64Ty(context.getLLVMContext()) },
159+
T_size_t(context),
160+
T_size_t(context) }, // type
156161
false),
157162
Function::ExternalLinkage,
158163
GC_ALLOC_BYTES_NAME);
@@ -227,12 +232,18 @@ namespace jl_intrinsics {
227232
}
228233

229234
namespace jl_well_known {
230-
static const char *GC_BIG_ALLOC_NAME = XSTR(jl_gc_big_alloc);
231-
static const char *GC_POOL_ALLOC_NAME = XSTR(jl_gc_pool_alloc);
235+
static const char *GC_BIG_ALLOC_NAME = XSTR(jl_gc_big_alloc_instrumented);
236+
static const char *GC_POOL_ALLOC_NAME = XSTR(jl_gc_pool_alloc_instrumented);
232237
static const char *GC_QUEUE_ROOT_NAME = XSTR(jl_gc_queue_root);
233238
static const char *GC_QUEUE_BINDING_NAME = XSTR(jl_gc_queue_binding);
234239
static const char *GC_ALLOC_TYPED_NAME = XSTR(jl_gc_alloc_typed);
235240

241+
static auto T_size_t(const JuliaPassContext &context) {
242+
return sizeof(size_t) == sizeof(uint32_t) ?
243+
Type::getInt32Ty(context.getLLVMContext()) :
244+
Type::getInt64Ty(context.getLLVMContext());
245+
}
246+
236247
using jl_intrinsics::addGCAllocAttributes;
237248

238249
const WellKnownFunctionDescription GCBigAlloc(
@@ -242,9 +253,8 @@ namespace jl_well_known {
242253
FunctionType::get(
243254
context.T_prjlvalue,
244255
{ Type::getInt8PtrTy(context.getLLVMContext()),
245-
sizeof(size_t) == sizeof(uint32_t) ?
246-
Type::getInt32Ty(context.getLLVMContext()) :
247-
Type::getInt64Ty(context.getLLVMContext()) },
256+
T_size_t(context),
257+
T_size_t(context) },
248258
false),
249259
Function::ExternalLinkage,
250260
GC_BIG_ALLOC_NAME);
@@ -258,7 +268,7 @@ namespace jl_well_known {
258268
auto poolAllocFunc = Function::Create(
259269
FunctionType::get(
260270
context.T_prjlvalue,
261-
{ Type::getInt8PtrTy(context.getLLVMContext()), Type::getInt32Ty(context.getLLVMContext()), Type::getInt32Ty(context.getLLVMContext()) },
271+
{ Type::getInt8PtrTy(context.getLLVMContext()), Type::getInt32Ty(context.getLLVMContext()), Type::getInt32Ty(context.getLLVMContext()), T_size_t(context) },
262272
false),
263273
Function::ExternalLinkage,
264274
GC_POOL_ALLOC_NAME);
@@ -301,10 +311,8 @@ namespace jl_well_known {
301311
FunctionType::get(
302312
context.T_prjlvalue,
303313
{ Type::getInt8PtrTy(context.getLLVMContext()),
304-
sizeof(size_t) == sizeof(uint32_t) ?
305-
Type::getInt32Ty(context.getLLVMContext()) :
306-
Type::getInt64Ty(context.getLLVMContext()),
307-
Type::getInt8PtrTy(context.getLLVMContext()) },
314+
T_size_t(context),
315+
T_size_t(context) }, // type
308316
false),
309317
Function::ExternalLinkage,
310318
GC_ALLOC_TYPED_NAME);

0 commit comments

Comments
 (0)