Fix large event handling for DynamoDB backend#62899
Conversation
juliaogris
left a comment
There was a problem hiding this comment.
LGTM with a few minor comments.
One remark / question: Athena already had this large event handling, and Dynamo is now fixed by this PR. But PostgreSQL, Firestore, and Filelog don't have this handling and could potentially hit the same stuck behavior if they encounter a single event approaching 1 MiB, correct?
This is correct, I plan to open separate PR(s) to contain large event handling as well |
| } | ||
| return out, true, nil | ||
|
|
||
| if l.totalSize+len(trimmedData) <= events.MaxEventBytesInResponse { |
There was a problem hiding this comment.
If we are here, I suppose we are in the single-event case, else we would have been caught in the early return if len(out) > 0. If so, how can l.totalSize+len(trimmedData) <= events.MaxEventBytesInResponse be true? l.totalSize is supposed to be zero in a single-event case, and we are trimming the event.
Is it possible for the trimmed event to still exceed the max size?
There was a problem hiding this comment.
Yes, it may still be possible that the trim was unsuccessful, so we guard check here. We then return an error in this code path to indicate a bug with certain events not being trimmed properly (to avoid skipping large events altogether)
There was a problem hiding this comment.
How does the event handler reacts to this error?
There was a problem hiding this comment.
The error trickles up to (*EventsJob).runPolling and to (*EventsJob).run, and then the event handler re-enters the runPolling loop after 5s. The event handler doesn't seem to panic and crash, but will likely loop indefinitely because of the large event not being able to be exported.
events_job.go snippet
for {
err := j.runPolling(ctx)
if err == nil || ctx.Err() != nil {
j.app.log.DebugContext(ctx, "Watch loop exiting")
return trace.Wrap(err)
}
j.app.log.ErrorContext(
ctx, "Unexpected error in watch loop. Reconnecting in 5s...",
"error", err,
)
select {
case <-time.After(time.Second * 5):
case <-ctx.Done():
return nil
}
}There was a problem hiding this comment.
In this case I think we should either serve the oversized event, or skip it entirely. We are already blocking on this so it's not worse that the current state.
There was a problem hiding this comment.
I agree that we should serve the event. As long as the event size is lower than 4MB, the gRPC max message size, we are good.
Retrospectively, we shouldn't trim the events to 1MB. That limit doesn't make much sense from an operational point of view
There was a problem hiding this comment.
In that case, should we skip trimming events to 1MB? Should we set the limit to 4MB and try to trim to that size?
There was a problem hiding this comment.
For consistency, I’d keep the 1 MB trim check in place for now.
Going forward, I’d like to eliminate this limit, either by splitting large events into several events or returning them intact.
There’s little value in recording and storing complete data if it can’t be read everything afterwards
|
After discussing with Hugo, we decided to serve the oversized event in the case that trimming is unsuccessful (bypassing the 1 MiB limit), as opposed to sending an error. If an error is returned, the event handler will re-engage the events_job.go snippet
for {
err := j.runPolling(ctx)
if err == nil || ctx.Err() != nil {
j.app.log.DebugContext(ctx, "Watch loop exiting")
return trace.Wrap(err)
}
j.app.log.ErrorContext(
ctx, "Unexpected error in watch loop. Reconnecting in 5s...",
"error", err,
)
select {
case <-time.After(time.Second * 5):
case <-ctx.Done():
return nil
}
} |
juliaogris
left a comment
There was a problem hiding this comment.
Still LGTM, with minor comments, feel free to ignore.
One observation: IIUC Athena still returns an error when trimming fails (querier.go:1018-1020), which would cause the same infinite retry loop you're fixing here for DynamoDB. Might be worth a follow-up PR to align Athena with this approach (serve oversized event + log error) for consistency.
This makes sense, I will work on a follow-up PR for this. After rummaging around in issues, I encountered this! #54480 |
| } | ||
| return out, true, nil | ||
|
|
||
| if l.totalSize+len(trimmedData) <= events.MaxEventBytesInResponse { |
There was a problem hiding this comment.
I agree that we should serve the event. As long as the event size is lower than 4MB, the gRPC max message size, we are good.
Retrospectively, we shouldn't trim the events to 1MB. That limit doesn't make much sense from an operational point of view
| if err != nil { | ||
| return nil, false, trace.Wrap(err, "failed to trim event to max size") | ||
| } | ||
| trimmedData, err := json.Marshal(e.FieldsMap) |
There was a problem hiding this comment.
I wonder if we really need to marshal it again here.
Do we care if we trimmed it correctly if we return it anw?
There was a problem hiding this comment.
I guess for now it helps to keep track of the events.MetricQueriedTrimmedEvents metric, as well as display an error to see why certain events cannot be trimmed. In the case that a very large event (>4MiB?) causes unexpected behavior we can pinpoint it to this
* Fix single large event handling for DynamoDB backend * Change inequality sign from >= to > * Fix tests, add comments * Modify to serve the oversized event and bypass the limit * Fix minor typos * Address feedback
Fixes #61645
This PR fixes a bug where the event handler would get stuck when querying large events (>1 MiB) from DynamoDB.
We attempt to trim the single large event first using
AuditEvent.TrimToMaxSizebefore sending it over.changelog: Fixed bug where event handler would get stuck on DynamoDB backend when handling large events