-
Notifications
You must be signed in to change notification settings - Fork 399
[PROF-13510] Heap profiling for ruby 4.x #5201
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
|
Thank you for updating Change log entry section 👏 Visited at: 2026-01-29 11:52:43 UTC |
BenchmarksBenchmark execution time: 2026-01-29 12:15:52 Comparing candidate commit a64a546 in PR branch Found 0 performance improvements and 1 performance regressions! Performance is the same for 43 metrics, 2 unstable metrics. scenario:profiling - intern_all 1000 repeated strings
|
c4fdb76 to
49477bf
Compare
|
Reminder that to measure this, we should do ON/OFF. Some of the cost will be in the VM itself. |
|
✅ Tests 🎉 All green!❄️ No new flaky tests detected 🎯 Code Coverage 🔗 Commit SHA: a64a546 | Docs | Datadog PR Page | Was this helpful? Give us feedback! |
ivoanjo
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I ran out of time today, here's what I got so far! In particular, I still need to stare at heap_recorder.c for longer, I'm not yet convinced it's correct, I'm seeing some state getting carried across calls that I'm not confident is right.
The current notes are small-ish stuff, other than the extra overhead that's unneeded (and I believe should be easy to fix) + the code exposing the heap recorder directly to the cpu and wall collector that ideally I'd like to avoid too if possible.
| #ifdef DEFERRED_HEAP_ALLOCATION_RECORDING | ||
| static rb_postponed_job_handle_t finalize_heap_allocation_from_postponed_job_handle; | ||
| #endif |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor: More as a style thing, in general I've avoided overly-ifdefing things out when they're harmless.
That is, the cost of a leftover extra field on Rubies that don't need it is so small that I usually prefer the advantage of less code and easier to reason due to less ifdef branching. (Same for most spots in this file)
ext/datadog_profiling_native_extension/collectors_cpu_and_wall_time_worker.c
Outdated
Show resolved
Hide resolved
ext/datadog_profiling_native_extension/collectors_cpu_and_wall_time_worker.c
Outdated
Show resolved
Hide resolved
ext/datadog_profiling_native_extension/collectors_cpu_and_wall_time_worker.c
Outdated
Show resolved
Hide resolved
| context "on Ruby 4.0 or newer" do | ||
| let(:testing_version) { "4.0.0" } | ||
|
|
||
| it "initializes StackRecorder without heap sampling support and warns" do | ||
| before do | ||
| settings.profiling.allocation_enabled = true | ||
| allow(logger).to receive(:debug) | ||
| end | ||
|
|
||
| it "initializes StackRecorder with heap sampling support" do | ||
| expect(Datadog::Profiling::StackRecorder).to receive(:new) | ||
| .with(hash_including(heap_samples_enabled: false, heap_size_enabled: false)) | ||
| .with(hash_including(heap_samples_enabled: true, heap_size_enabled: true)) | ||
| .and_call_original | ||
|
|
||
| expect(logger).to receive(:warn).with(/Datadog Ruby heap profiler is currently incompatible with Ruby 4/) | ||
|
|
||
| build_profiler_component |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we do add an extra setting to toggle the new code, this testcase is still worth keeping. On the other hand, if we don't, then this test-case has become redundant: since Ruby 4 is no longer special, the existing tests below already cover this situation. (The test was added exactly because there was an exception for Ruby 4)
| #ifdef DEFERRED_HEAP_ALLOCATION_RECORDING | ||
| // A pending recording is used to defer the object_id call on Ruby 4+ | ||
| // where calling rb_obj_id during on_newobj_event is unsafe. | ||
| typedef struct { | ||
| VALUE object_ref; | ||
| heap_record *heap_record; | ||
| live_object_data object_data; | ||
| } pending_recording; | ||
|
|
||
| #define MAX_PENDING_RECORDINGS 64 | ||
| #endif |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor: The same note about not overly if-def'ing things from the cpu and wall collector I believe applies here as well
| VALUE obj = heap_recorder->pending_recordings[i].object_ref; | ||
| if (obj != Qnil) { | ||
| rb_gc_mark(obj); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The object can't ever be Qnil (or if it is, we have a bug...), since it was set from an allocation and Qnil is not a heap-allocated object. (It's a tagged pointer)
| if (heap_recorder->active_deferred_object != Qnil) { | ||
| rb_gc_mark(heap_recorder->active_deferred_object); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor: This can can indeed trivially be Qnil; but btw it's OK to mark Qnil, so maybe remove the branch anyway? Less code ;)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
feels weird to do so ^^
ivoanjo
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok here's my comments from the full pass :)
I think the key change needed is to make sure we call during_sample_enter to avoid any other parts of the profiler firing in the middle of finalization.
Having said that, especially on the heap profiler I'm not a huge fan of some of the duplication -- that code is ultra-fiddly and so having complex logic duplicated across if-defs I worry makes it easy to forget to update one of the versions.
| static void finalize_heap_allocation_from_postponed_job(DDTRACE_UNUSED void *_unused) { | ||
| cpu_and_wall_time_worker_state *state = active_sampler_instance_state; | ||
|
|
||
| if (state == NULL) return; | ||
|
|
||
| if (!ddtrace_rb_ractor_main_p()) { | ||
| return; | ||
| } | ||
|
|
||
| // Get the heap_recorder from the thread_context_collector | ||
| heap_recorder *recorder = thread_context_collector_get_heap_recorder(state->thread_context_collector_instance); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, on a second pass, there's two things missing here that can create a bit of a sharp edge:
a. We're missing during_sample_enter and during_sample_exit -- these set a flag that allows us to avoid nested operations inside the profiler. E.g. some sharp edges along the line of "the profiler is sampling something else -> it calls some VM api that causes the VM to check for interruptions -> the VM decides now it's a really nice time to flush heap things -> our current state may not be in a consistent sate" (or reverse -- maybe this is the function that started first, and it triggers an allocation, and the flip situation happens)
b. (Minor) We're missing the discrete_dynamic_sampler_before_sample and discrete_dynamic_sampler_after_sample calls to update the dynamic sampling rate mechanism. In practice, this means that work done inside this function isn't accounted as being profiler overhead. TBH what we're doing right now isn't a lot but... yeah maybe at least leave a comment saying "This is not being accounted for the dynamic sampling rate update, and it's ok because the amount of work we do in this case is very small"
Both things happen in e.g. on_newobj_event so doing the same here should be enough.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Edit: Actually I wonder if we can even get recursion where finalize gets called -> goes into the vm -> vm calls finalize again (e.g. if there's an allocation) or something. during_sample_enter/during_sample_exit would also protect against that.
| bool heap_recorder_has_pending_recordings(heap_recorder *heap_recorder) { | ||
| if (heap_recorder == NULL) { | ||
| return false; | ||
| } | ||
| return heap_recorder->pending_recordings_count > 0; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor: Since this is only for debugging, I wonder if we should just return the count instead of a boolean. Same cost for the logic, and a bit easier to debug from the Ruby side.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As this is style, I'll leave it to the end.
| describe "pending heap recordings cleanup" do | ||
| def has_pending_recordings? | ||
| described_class::Testing.send(:_native_has_pending_heap_recordings?, stack_recorder) | ||
| end | ||
|
|
||
| def track_object_without_finalize(obj) | ||
| described_class::Testing._native_track_object(stack_recorder, obj, sample_rate, obj.class.name) | ||
| Datadog::Profiling::Collectors::Stack::Testing | ||
| ._native_sample(Thread.current, stack_recorder, metric_values, labels, numeric_labels) | ||
| end | ||
|
|
||
| it "clears pending recordings after finalization" do | ||
| skip "Only applies to Ruby 4+ with deferred heap allocation recording" if RUBY_VERSION < "4" | ||
|
|
||
| test_object = Object.new | ||
|
|
||
| track_object_without_finalize(test_object) | ||
|
|
||
| expect(has_pending_recordings?).to be true | ||
|
|
||
| described_class::Testing._native_finalize_pending_heap_recordings(stack_recorder) | ||
|
|
||
| expect(has_pending_recordings?).to be false | ||
| end | ||
|
|
||
| it "clears pending recordings after multiple allocations" do | ||
| skip "Only applies to Ruby 4+ with deferred heap allocation recording" if RUBY_VERSION < "4" | ||
|
|
||
| 3.times do | ||
| test_object = Object.new | ||
| track_object_without_finalize(test_object) | ||
| end | ||
|
|
||
| expect(has_pending_recordings?).to be true | ||
|
|
||
| described_class::Testing._native_finalize_pending_heap_recordings(stack_recorder) | ||
|
|
||
| expect(has_pending_recordings?).to be false | ||
| end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor: I'm not sure these tests add a lot... Specifically, because they don't assert on any results, it's easy for them to pass while not doing the correct thing.
Furthermore, we have existing coverage where we already check if the objects we intend to sample/track do get sampled/track.
So... I'd be inclined to remove these tests (and the pending_heap_recordings? helper maybe as well?).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
things slightly changed. I'll let you refer to comment above (in this file)
e966c43 to
ea9eba7
Compare
|
I took into account major comments. |
ea9eba7 to
4dbdd06
Compare
|
|
||
| expect(has_pending_recordings?).to be true | ||
|
|
||
| described_class::Testing._native_finalize_pending_heap_recordings(stack_recorder) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test shows that you need these bits which is not ideal (but I still like that the test forces me to do this).
The idea is to delay the time at which we record object ids. Once we are outside of the allocation code path, we can request the object ID.
- Avoid scheduling many postponed jobs This fixes some of the issues we had with accuracy Also I suspect that this has less overhead - Avoid re-entrancy based on Ivo's comments
Although some of this code is dead code on legacy Rubies, always compiling it in means less ifdefs spread throughout and it helps keep the code focused on modern rubies, rather than on legacy ones.
This check is already covered by `heap_recorder->active_recording != NULL` (they're set and unset together).
This reverts commit e153759. (Avoid touching CHANGELOG for nicer diff)
…is needed This will replace the more heavy-handed query in `thread_context_collector_heap_pending_buffer_pressure`.
…filer directly This avoids other parts of the profiler needing to care about this -- they only need to care to run the `after_sample` callback.
…ons directly We no longer need to ask other parts of the code to raise instead :)
This probably needs adjusting for non-4.0 rubies, will do it as a separate pass.
c0e60ce to
f04eb52
Compare
In the future we may end up using the deferred recording for legacy rubies as well, so might as well lay the groundwork.
**What does this PR do?** This PR adds Ruby 4 heap profiling test variants. As per DataDog/dd-trace-rb#5201, we need a different implementation of heap profiling on Ruby 4, so we want to validate it's still sane with prof-correctness. These variants are effectively the same as the regular ones but: * Replace Ruby 3.3 with Ruby 4.0 * Point to the branch from DataDog/dd-trace-rb#5201 (we'll need to change this back to master once landed) * Set the `DD_PROFILING_EXPERIMENTAL_HEAP_RUBY4_ENABLED` env variable we're using to gate the new feature * Update the `expected_profile.json` to take into account that object allocation in Ruby 4 produces a slightly different stack trace (there's no `new` method on the stack -- it gets inlined into the caller) **Motivation:** Validate Ruby 4 heap profiling. **Additional Notes:** There's still a few anomalies in the results we're looking into... **How to test the change?** Run as usual!
**What does this PR do?** This PR adds Ruby 4 heap profiling test variants. As per DataDog/dd-trace-rb#5201, we need a different implementation of heap profiling on Ruby 4, so we want to validate it's still sane with prof-correctness. These variants are effectively the same as the regular ones but: * Replace Ruby 3.3 with Ruby 4.0 * Point to the branch from DataDog/dd-trace-rb#5201 (we'll need to change this back to master once landed) * Set the `DD_PROFILING_EXPERIMENTAL_HEAP_RUBY4_ENABLED` env variable we're using to gate the new feature * Update the `expected_profile.json` to take into account that object allocation in Ruby 4 produces a slightly different stack trace (there's no `new` method on the stack -- it gets inlined into the caller) **Motivation:** Validate Ruby 4 heap profiling. **Additional Notes:** There's still a few anomalies in the results we're looking into... **How to test the change?** Run as usual!
Allow heap profiling to be enabled on Ruby 4 without any extra flags.
|
TODO: We're still investigating a few benchmarks / validation results, but this should be ready for review. |
AlexJF
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks mostly good to me. Some minor feedback to try to prevent so much split logic.
Also, if the deferred mechanism is also compatible with Ruby < 4, could it make sense to unify codepaths and have all be deferred? Or does this incur a perf penalty?
| // The intuition here is: Usually we ask for an `after_allocation` callback only when the buffer is about to go | ||
| // from empty -> non-empty, because this is going to be mapped onto a postponed job, so after it gets queued once | ||
| // it doesn't seem worth it to keep spamming requests. | ||
| // Yet, if for some reason the postponed job doesn't flush the pending list (or if e.g. it ran with `during_sample == true ` and thus | ||
| // was skipped) we need to have some mechanism to recover -- and so if the buffer starts accumulating too much we | ||
| // start always requesting the callback to happen so that we eventually flush the buffer. | ||
| bool needs_after_allocation = | ||
| heap_recorder->pending_recordings_count == 0 || heap_recorder->pending_recordings_count >= MAX_PENDING_RECORDINGS / 2; | ||
|
|
||
| return needs_after_allocation; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: I think this logic is tied with the #ifdef above (which is why the return true in line 352 mentions this code block. We could promote needs_after_allocation as a top-level variable with default value false, move this logic to the first #ifdef and leave this second #ifdef to focus on the capturing of the deferred object data.
I feel doing this makes it harder to forget to update things if our logic for needs_after_allocation changes in the future?
| VALUE active_deferred_object; | ||
| live_object_data active_deferred_object_data; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor: This duplication of "responsibility/intent" between active_recording and active_deferred feels weird and error-prone (e.g. I'm not sure if rate-based skipping in line 343 plays well with your #ifdef handling logic on the end - after all, sounds possible for the end to not see a valid active_deferred_object reference in this case. Could we maybe just have active_recording be a flexible_object_record which would be a object_record with either a long obj_id; or a VALUE object_ref;?
r1viollet
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM (can't approve as I opened this)
Though feel free to merge.
I was thinking that a test that would exercise some of the raise code paths could be good, though not blocking for this PR.
Performance looks the same as other heap instrumentation.
Accuracy tests are OK.

What does this PR do?
This PR reimplements part of the heap profiler to support Ruby 4 (which was
disabled in #5148 ).
Motivation:
Support heap profiling on Ruby 4!
Change log entry
Yes. Add support for heap profiling on Ruby 4.
Additional Notes:
The Datadog Ruby heap profiler tracks live heap objects by storing their
object_idwhen they're allocated, then later usingObjectSpace._id2refto check if those objects are still alive. This mechanism is currently incompatible with Ruby 4.x.Key Components
collectors_cpu_and_wall_time_worker.c- Main sampling coordinatorheap_recorder.c- Tracks live heap objects using object IDsAllocation Flow (Before Fix)
Liveness Check Flow
The Ruby 4.x Problem
What Changed
Ruby 4.x changed how
object_idworks internally. The key issue:on_newobj_eventis called during object allocation (object is in "in-between state")rb_obj_id()during this event mutates the object (assigns an ID)Implemented Solution: Deferred Object ID Recording
We defer the
rb_obj_id()call to after the allocation tracepoint completes usingrb_postponed_job_trigger.Allocation Flow (After Fix - Ruby 4+)
How to test the change?
The usual test coverage uses the new code path on Ruby 4.
We've also added additional coverage in DataDog/prof-correctness#89 + internally in the reliability environment.