Skip to content

Increase frame buffer to max 1024 frames per trace#908

Closed
dalehamel wants to merge 1 commit intoopen-telemetry:mainfrom
Shopify:bump-frame-buffer-to-1024
Closed

Increase frame buffer to max 1024 frames per trace#908
dalehamel wants to merge 1 commit intoopen-telemetry:mainfrom
Shopify:bump-frame-buffer-to-1024

Conversation

@dalehamel
Copy link
Copy Markdown
Contributor

@dalehamel dalehamel commented Oct 31, 2025

Most ruby / rails apps have deep stacks. As part of my efforts to make the otel BPF profiler work with ruby, I'm requesting to increase the maximum number of frames per trace to 1024, which fixes #760.

This is particularly needed for ruby, as it is a very common to have wrapper functions in frameworks like rails, such request as middlewares which all implement "env.call". When we have an intermixing of ruby and native frames, this becomes even more problematic.

This was originally part of #907 but I'm pulling it out to a separate PR at @florianl 's request so it can be discussed separately, and the memory implications for other interpreters / users can be considered.

Per @fabled 's thoughts in #760 though, I'm hoping this change is reasonable:

Having said this, it probably would be OK to bump the limit. And languages like Ruby would see instant improvement on it (but e.g. Java would not get the full benefit).

@dalehamel dalehamel requested review from a team as code owners October 31, 2025 20:08
@ivoanjo
Copy link
Copy Markdown

ivoanjo commented Nov 3, 2025

+1 this is quite valuable to improving the Ruby profiling out-of-the-box usability.

Our experience at Datadog for our Ruby profiler matches @dalehamel's experience of Ruby apps consistently having deeper stacks than most other languages/runtimes. This is why the Datadog Ruby profiler defaults to 400 frames, and I know of customers that needed to increase this further.

Comment thread support/ebpf/types.h
#define MAX_FRAME_UNWINDS 128
// 1024 is the maximum power of 2 we can fit in a single perf event, and
// in a single per CPU array entry
#define MAX_FRAME_UNWINDS 1024
Copy link
Copy Markdown
Member

@florianl florianl Nov 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not every environment does run ruby application and so will not directly benefit from this change. As with this change, eight times more memory is required for every trace that is reported from kernel to user space, did you measure the impact of this change for various workloads?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The link you shared is for race sanitizer to workaround exactly that fact(only for race sanitizer) that we do not send the whole trace to userspace.
We should not be sending empty frames

const u64 send_size = sizeof(Trace) - sizeof(Frame) * num_empty_frames;

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry that I mixed up the race sanitizer work around. As this was just resently changed, it was the first that dig pop into my head.

Another component that will grow with this increase are the eBPF maps per_cpu_records.

How close are we getting to unwind 1024 frames? Without further changes, we might be limited:

Unwinder Unwinder limit Unwinder limit * tail call limit
native NATIVE_FRAMES_PER_PROGRAM = 4 4 * 32 = 128
perl PERL_FRAMES_PER_PROGRAM = 12 12 * 32 = 384
php FRAMES_PER_WALK_PHP_STACK = 19 19 * 32 = 608
python FRAMES_PER_WALK_PYTHON_STACK = 12 12 * 32 = 384
ruby FRAMES_PER_WALK_RUBY_STACK = 27 27 * 32 = 864
v8 V8_FRAMES_PER_PROGRAM = 8 8 * 32 = 256
dotnet DOTNET_FRAMES_PER_PROGRAM = 6 6 * 12 = 192

With further jumps between the different unwinders, the actual number might be lower.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@florianl yeah so just about everything is potentially truncated right now.

Another component that will grow with this increase are the eBPF maps per_cpu_records.

I picked 1024 because It is the largest number that still fits into a single perf event, which is the limit for the per cpu record. Any higher power of 2 won't fit. There is a discussion about how to potentially go higher than that, as the per cpu record is currently an array of size of 1 but it doesn't necessarily need to be.

It is actually half of the size of other ruby profilers (stackprof and vernier allow a buffer of 2048, believe it or not - and rbspy is theoretically unlimited). My PR #907 raises the number of frames per tail call to 32, so the max theoretical number for ruby would actually be ~ 32 * 32 = 1024.

I don't believe we will really be wasting any memory here as profilers that don't write this many frames won't really allocate it. The per CPU array entry size does increase to something like 24Kb, but this isn't much of an increase on the system over all even if you have lots of processors. The benefit of being able to have untruncated stacks I think is worth it.

On the golang side, we should only end up actually transferring as much data as was actually written, so if memory increases it should be because we legitimately are sending more (useful) data to userspace.

@fabled
Copy link
Copy Markdown
Contributor

fabled commented Nov 6, 2025

I think this should be done together when fixing #940. There we can just set the variable length uint64_t hold the maximum amount of elements that fit in a perf event. That PR should also fix the Trace caching so that there will be no huge GC pressure on the agent side. Currently each Trace and []Frame slice are probably allocated separately without reusing data structures. I suspect this is a big GC pressure point currently.

@dalehamel
Copy link
Copy Markdown
Contributor Author

I think this should be done together when fixing #940. There we can just set the variable length uint64_t hold the maximum amount of elements that fit in a perf event. That PR should also fix the Trace caching so that there will be no huge GC pressure on the agent side. Currently each Trace and []Frame slice are probably allocated separately without reusing data structures. I suspect this is a big GC pressure point currently.

Yeah, I agree. I think it is a better solution overall from the sound of it

@florianl
Copy link
Copy Markdown
Member

[..] this should be done together when fixing #940 [..]

Can we then close this PR?

@dalehamel
Copy link
Copy Markdown
Contributor Author

[..] this should be done together when fixing #940 [..]

Can we then close this PR?

Yup i'll close it and wait for #943

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Max stack depth of 128 is insufficient for a variety of cases

5 participants