reporter: compute profile duration using the time during which the profiler was collecting samples#1129
Conversation
…ofiler was collecting samples previously, we were reporting the profile duration based on sampled events for specific resources. this means that idle resources could end up with much smaller profile durations that would not reflect the actual utilization, compared to busy resources. more generally, using the sample timestamps to compute the duration leads to very small duration for idle hosts, which would be inaccurate if the profile is running for much longer. this change proposes setting the duration to the time during which the profiler was running / collecting samples rather than samples timestamp.
| *traceEventsPtr = newEvents | ||
| collectionEndTime := time.Now() | ||
| collectionStartTime := r.collectionStartTime | ||
| r.collectionStartTime = collectionEndTime |
There was a problem hiding this comment.
Since Sample timestamps are asynchronously created in the kernel, and this part of the reporter isn't synchronized with the tracer subsystem in any way, do we have a race here where it's possible for Samples with timestamps < collectionStartTime to appear in a profile which would violate the protocol?
There was a problem hiding this comment.
Good point, thanks for catching this. I'll work on a patch to update the start time to the minimum between collectionStartTime and the smallest timestamp in samples. This way the invariant should always be true.
Let me know if you would prefer fixing this in another way 🙏
There was a problem hiding this comment.
Thanks! Looking deeper into this, I think this will not be sufficient as there's still the possibility that a trace with timestamp < tmin where tmin is the mininimum timestamp in a reporter batch will show up in the next batch.
However, I'm not sure if this is a protocol violation: We don't violate the invariant of having all Sample timestamps fall within the Profile duration, but there's a possibility that Profiles will have start times that go back in time (from one to the next). Even if we haven't made it explicit, it's probably best not to violate the implicit contract that reported Profiles from the same profiler are strictly linear in time (though clock changes can mess with that, and we do periodically sync with the realtime clock even if our base timestamps are monotonic).
To better understand this issue (that was pre-existing, you didn't introduce it with this PR) see my comment here.
In short, the tracer doesn't sort trace events by timestamp and also doesn't batch, so there are no ordering guarantees (either between traces or between subsequent batches) as events are sent from tracer to processmanager and to the reporter. That opens a window where a trace event can show up with a timestamp that's smaller than the minimum timestamp of the previously processed 'batch'. We've had to deal with this issue when cleaning up process metadata (e.g. for a process that exited) while still allowing for in-flight traces to be symbolized.
We came up with the ProcessedUntil mechanism that lets the rest of the agent know that trace events up to a particular time have been processed. This allows us to make decisions that depend on ordering without enforcing explicit ordering of events. The price we pay is a slight increase in processing latency (that does not compromise trace fidelity, as trace timestamps stay intact).
I think the same mechanism can be used to fix this issue in the reporter but it'll probably require some changes. So the options I see for now:
- We review and merge this PR and I open an issue with follow-up work
- We amend this PR to fix the issue in one go (using
ProcessedUntilor another mechanism)
Let me know what you prefer.
EDIT: Thinking about it some more, I would go with 1. for now unless you feel strongly about 2.
There was a problem hiding this comment.
@christos68k Thanks a lot for the context and thorough review 🙇 !!
I thought about this a bit, and my understanding is that we have the following constraints:
- All sample timestamps >= profile start time.
- No dropped samples.
- Profile duration is based on the wall clock time (i.e. close to report interval).
- Monotonic profile start times.
The fundamental issue is that it's always theoretically possible to receive samples asynchronously from per-CPU buffers, so in the following scenario:
- At wall-clock T1, we send Profile A containing samples with timestamps [t3, t5, t7]. To satisfy "samples >= start", we set start t3.
- At wall-clock T2, we're preparing Profile B. A late sample with timestamp t2 arrives (it was buffered in a different CPU's buffer).
We're forced with the following options:
- Drop the late sample: violates "no dropped samples" (2.)
- Wait for late samples and include all samples in a single profile: violates "wall-clock duration" (3.)
- Include t2 in Profile B: Either violates (1.) if we don't adjust start time, or violates (4.) if we do.
So ultimately there is some tension between all these constraints and we need to treat at least one of them as best-effort.
I've tried hacking a tentative implementation here that uses ProcessedUntil in a best-effort fashion to reduce the likelihood of (4.) (so keeping the same tradeoffs as this current PR).
My thinking was that constraints (1), (2), and (3) are protocol/correctness requirements, while (4) is more of a "nice to have" for downstream consumers. But maybe we could accept dropped samples if we think the number of samples dropped would be low. In practice, I'd expect late-arriving samples to be rare since ProcessedUntil typically advances within one poll interval (~250ms), so the window for "too late" samples is small.
The implementation above creates a "pending" report and defers sending until ProcessedUntil confirms that samples up to that kernel time have been delivered. This gives late-arriving samples (from per-CPU reordering) a chance to be included in the pending report.
A couple notes:
- This required dealing with what happens if
ProcessedUntildoesn't arrive before the next report interval. The implementation above sends the report anyway as a timeout fallback. - The time passed to
ProcessedUntilisn't strictly guaranteed to increase monotonically (due to per-CPU ring buffer reordering). I used a monotonically increasingmaxProcessedUntilKTimewatermark to work around this. - I've chosen to adjust the start time of the pending report in case we receive a sample that is older than the start time of the pending report, to ensure constraints (1.) and (2.) are met, but we could also drop that sample if we want to relax constraint (2.) as opposed to (4.)
All in all, it looks like this question has non-trivial design / implementation consequences, so we should probably align first on which constraint we want to relax, and what high level implementation / design we're willing to consider.
Based on that, I'd personally also vote for merging this PR as is, and opening an issue to discuss this further 🙌
There was a problem hiding this comment.
As you wrote, I think we can maintain 1, 2 and 3 (with 1 and 2 being hard guarantees).
Regarding 4, there are two sources of potential discontinuities:
- In the default configuration we periodically sync with the realtime clock (Sample timestamps are monotonic in the kernel but we add a delta to convert them to UnixTime, and that delta is not calculated once for the life of the agent process, but periodically). This means that system clock changes can result in Profile start time discontinuities (expected to be rare). However, this is a configuration option (clock-sync-interval) and the user may decide to switch it off, in which case Sample timestamps will be strictly monotonic for the life of the agent process.
- Per-CPU event reordering: I think we should be able to deal with this by slightly relaxing 3. What I had in mind is something similar to what you described (have reporter buffer trace events and calculate reported batches based on
ProcessedUntil). IIRC the current logic guarantees that thektimepassed toProcessedUntilis monotonically increasing but I need to revisit and swap a lot of that context back in.
In the meantime, I'll merge this PR as discussed! Thanks!
EDIT: Opened #1142 to track this issue
this should ensure that the profile start time is always earlier than the earliest sample, even when accountting for buffered samples. also add tests to validate the adjustement behaviour.
8a9c988 to
fb57048
Compare
…ofiler was collecting samples (open-telemetry#1129)
previously, we were reporting the profile duration based on sampled events for specific resources. this means that idle resources could end up with much smaller profile durations that would not reflect the actual utilization, compared to busy resources.
more generally, using the sample timestamps to compute the duration could lead to very small durations for mostly idle hosts, which would be inaccurate if the profiler is running for much longer.
this change proposes setting the duration to the time during which the profiler was running / collecting samples rather than samples timestamp.