Fix large event handling for DynamoDB backend by kshi36 · Pull Request #62899 · gravitational/teleport

kshi36 · 2026-01-16T00:21:11Z

This PR fixes a bug where the event handler would get stuck when querying large events (>1 MiB) from DynamoDB.

We attempt to trim the single large event first using AuditEvent.TrimToMaxSize before sending it over.

changelog: Fixed bug where event handler would get stuck on DynamoDB backend when handling large events

lib/events/dynamoevents/dynamoevents.go

juliaogris

LGTM with a few minor comments.

One remark / question: Athena already had this large event handling, and Dynamo is now fixed by this PR. But PostgreSQL, Firestore, and Filelog don't have this handling and could potentially hit the same stuck behavior if they encounter a single event approaching 1 MiB, correct?

lib/events/dynamoevents/dynamoevents_test.go

lib/events/dynamoevents/dynamoevents.go

lib/events/dynamoevents/dynamoevents_test.go

kshi36 · 2026-01-20T18:39:51Z

But PostgreSQL, Firestore, and Filelog don't have this handling and could potentially hit the same stuck behavior if they encounter a single event approaching 1 MiB, correct?

This is correct, I plan to open separate PR(s) to contain large event handling as well

hugoShaka · 2026-01-26T23:17:40Z

lib/events/dynamoevents/dynamoevents.go

 			}
-			return out, true, nil
+
+			if l.totalSize+len(trimmedData) <= events.MaxEventBytesInResponse {


If we are here, I suppose we are in the single-event case, else we would have been caught in the early return if len(out) > 0. If so, how can l.totalSize+len(trimmedData) <= events.MaxEventBytesInResponse be true? l.totalSize is supposed to be zero in a single-event case, and we are trimming the event.

Is it possible for the trimmed event to still exceed the max size?

Yes, it may still be possible that the trim was unsuccessful, so we guard check here. We then return an error in this code path to indicate a bug with certain events not being trimmed properly (to avoid skipping large events altogether)

How does the event handler reacts to this error?

The error trickles up to (*EventsJob).runPolling and to (*EventsJob).run, and then the event handler re-enters the runPolling loop after 5s. The event handler doesn't seem to panic and crash, but will likely loop indefinitely because of the large event not being able to be exported.

events_job.go snippet

for { err := j.runPolling(ctx) if err == nil || ctx.Err() != nil { j.app.log.DebugContext(ctx, "Watch loop exiting") return trace.Wrap(err) } j.app.log.ErrorContext( ctx, "Unexpected error in watch loop. Reconnecting in 5s...", "error", err, ) select { case <-time.After(time.Second * 5): case <-ctx.Done(): return nil } }

In this case I think we should either serve the oversized event, or skip it entirely. We are already blocking on this so it's not worse that the current state.

I agree that we should serve the event. As long as the event size is lower than 4MB, the gRPC max message size, we are good.

Retrospectively, we shouldn't trim the events to 1MB. That limit doesn't make much sense from an operational point of view

In that case, should we skip trimming events to 1MB? Should we set the limit to 4MB and try to trim to that size?

For consistency, I’d keep the 1 MB trim check in place for now.
Going forward, I’d like to eliminate this limit, either by splitting large events into several events or returning them intact.

There’s little value in recording and storing complete data if it can’t be read everything afterwards

kshi36 · 2026-01-28T00:14:12Z

After discussing with Hugo, we decided to serve the oversized event in the case that trimming is unsuccessful (bypassing the 1 MiB limit), as opposed to sending an error. If an error is returned, the event handler will re-engage the runPolling loop after 5 seconds. This would cause the event handler to loop indefinitely.

events_job.go snippet

	for {
		err := j.runPolling(ctx)
		if err == nil || ctx.Err() != nil {
			j.app.log.DebugContext(ctx, "Watch loop exiting")
			return trace.Wrap(err)
		}

		j.app.log.ErrorContext(
			ctx, "Unexpected error in watch loop. Reconnecting in 5s...",
			"error", err,
		)

		select {
		case <-time.After(time.Second * 5):
		case <-ctx.Done():
			return nil
		}
	}

juliaogris

Still LGTM, with minor comments, feel free to ignore.

One observation: IIUC Athena still returns an error when trimming fails (querier.go:1018-1020), which would cause the same infinite retry loop you're fixing here for DynamoDB. Might be worth a follow-up PR to align Athena with this approach (serve oversized event + log error) for consistency.

lib/events/dynamoevents/dynamoevents_test.go

lib/events/dynamoevents/dynamoevents.go

kshi36 · 2026-01-29T23:35:35Z

IIUC Athena still returns an error when trimming fails

This makes sense, I will work on a follow-up PR for this. After rummaging around in issues, I encountered this! #54480

tigrato · 2026-01-30T14:37:46Z

lib/events/dynamoevents/dynamoevents.go

 			}
-			return out, true, nil
+
+			if l.totalSize+len(trimmedData) <= events.MaxEventBytesInResponse {


I agree that we should serve the event. As long as the event size is lower than 4MB, the gRPC max message size, we are good.

Retrospectively, we shouldn't trim the events to 1MB. That limit doesn't make much sense from an operational point of view

tigrato · 2026-01-30T14:40:11Z

lib/events/dynamoevents/dynamoevents.go

+			if err != nil {
+				return nil, false, trace.Wrap(err, "failed to trim event to max size")
+			}
+			trimmedData, err := json.Marshal(e.FieldsMap)


I wonder if we really need to marshal it again here.

Do we care if we trimmed it correctly if we return it anw?

I guess for now it helps to keep track of the events.MetricQueriedTrimmedEvents metric, as well as display an error to see why certain events cannot be trimmed. In the case that a very large event (>4MiB?) causes unexpected behavior we can pinpoint it to this

lib/events/dynamoevents/dynamoevents.go

backport-bot-workflows · 2026-02-04T23:46:17Z

@kshi36 See the table below for backport results.

Branch	Result
branch/v17	Failed
branch/v18	Failed

* Fix single large event handling for DynamoDB backend * Change inequality sign from >= to > * Fix tests, add comments * Modify to serve the oversized event and bypass the limit * Fix minor typos * Address feedback

Fix single large event handling for DynamoDB backend

e3b2955

kshi36 changed the title ~~Fix single large event handling for DynamoDB backend~~ Fix large event handling for DynamoDB backend Jan 16, 2026

kshi36 marked this pull request as ready for review January 16, 2026 20:09

kshi36 requested review from bernardjkim and hugoShaka January 16, 2026 20:09

github-actions bot requested review from eriktate and juliaogris January 16, 2026 20:09

github-actions bot added audit-log Issues related to Teleports Audit Log size/md labels Jan 16, 2026

kshi36 added backport/branch/v17 backport/branch/v18 labels Jan 16, 2026

bernardjkim reviewed Jan 16, 2026

View reviewed changes

lib/events/dynamoevents/dynamoevents.go Outdated Show resolved Hide resolved

Change inequality sign from >= to >

32bccf4

juliaogris approved these changes Jan 19, 2026

View reviewed changes

lib/events/dynamoevents/dynamoevents_test.go Outdated Show resolved Hide resolved

lib/events/dynamoevents/dynamoevents.go Show resolved Hide resolved

lib/events/dynamoevents/dynamoevents_test.go Outdated Show resolved Hide resolved

Fix tests, add comments

6d5f9c1

kshi36 requested a review from bernardjkim January 20, 2026 18:42

bernardjkim approved these changes Jan 21, 2026

View reviewed changes

hugoShaka reviewed Jan 26, 2026

View reviewed changes

Modify to serve the oversized event and bypass the limit

6d99759

kshi36 requested review from bernardjkim, hugoShaka and juliaogris January 28, 2026 18:47

juliaogris approved these changes Jan 28, 2026

View reviewed changes

lib/events/dynamoevents/dynamoevents_test.go Outdated Show resolved Hide resolved

lib/events/dynamoevents/dynamoevents.go Outdated Show resolved Hide resolved

bernardjkim approved these changes Jan 29, 2026

View reviewed changes

Fix minor typos

c48f7ab

tigrato reviewed Jan 30, 2026

View reviewed changes

hugoShaka approved these changes Feb 3, 2026

View reviewed changes

lib/events/dynamoevents/dynamoevents.go Outdated Show resolved Hide resolved

lib/events/dynamoevents/dynamoevents.go Outdated Show resolved Hide resolved

public-teleport-github-review-bot bot removed the request for review from eriktate February 3, 2026 15:36

Address feedback

074274a

kshi36 requested a review from tigrato February 3, 2026 21:57

tigrato approved these changes Feb 4, 2026

View reviewed changes

kshi36 added this pull request to the merge queue Feb 4, 2026

kshi36 mentioned this pull request Feb 4, 2026

Fix large event handling for Athena backend #63522

Merged

1 task

github-merge-queue bot removed this pull request from the merge queue due to failed status checks Feb 4, 2026

kshi36 added this pull request to the merge queue Feb 4, 2026

Merged via the queue into master with commit 8164989 Feb 4, 2026
43 checks passed

kshi36 deleted the kevin/dynamoevents-large branch February 4, 2026 23:43

This was referenced Feb 5, 2026

[v18] Fix large event handling for DynamoDB backend #63526

Merged

[v17] Fix large event handling for DynamoDB backend #63562

Merged

Conversation

kshi36 commented Jan 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

juliaogris left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kshi36 commented Jan 20, 2026

Uh oh!

hugoShaka Jan 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kshi36 Jan 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kshi36 Jan 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kshi36 commented Jan 28, 2026

Uh oh!

juliaogris left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

kshi36 commented Jan 29, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

backport-bot-workflows bot commented Feb 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

kshi36 commented Jan 16, 2026 •

edited

Loading

hugoShaka Jan 26, 2026 •

edited

Loading

kshi36 Jan 26, 2026 •

edited

Loading

kshi36 Jan 27, 2026 •

edited

Loading