Skip to content

Fix large event handling for DynamoDB backend#62899

Merged
kshi36 merged 6 commits intomasterfrom
kevin/dynamoevents-large
Feb 4, 2026
Merged

Fix large event handling for DynamoDB backend#62899
kshi36 merged 6 commits intomasterfrom
kevin/dynamoevents-large

Conversation

@kshi36
Copy link
Copy Markdown
Contributor

@kshi36 kshi36 commented Jan 16, 2026

Fixes #61645

This PR fixes a bug where the event handler would get stuck when querying large events (>1 MiB) from DynamoDB.

We attempt to trim the single large event first using AuditEvent.TrimToMaxSize before sending it over.

changelog: Fixed bug where event handler would get stuck on DynamoDB backend when handling large events

@kshi36 kshi36 changed the title Fix single large event handling for DynamoDB backend Fix large event handling for DynamoDB backend Jan 16, 2026
@kshi36 kshi36 marked this pull request as ready for review January 16, 2026 20:09
@github-actions github-actions bot added audit-log Issues related to Teleports Audit Log size/md labels Jan 16, 2026
Copy link
Copy Markdown
Contributor

@juliaogris juliaogris left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with a few minor comments.

One remark / question: Athena already had this large event handling, and Dynamo is now fixed by this PR. But PostgreSQL, Firestore, and Filelog don't have this handling and could potentially hit the same stuck behavior if they encounter a single event approaching 1 MiB, correct?

@kshi36
Copy link
Copy Markdown
Contributor Author

kshi36 commented Jan 20, 2026

But PostgreSQL, Firestore, and Filelog don't have this handling and could potentially hit the same stuck behavior if they encounter a single event approaching 1 MiB, correct?

This is correct, I plan to open separate PR(s) to contain large event handling as well

@kshi36 kshi36 requested a review from bernardjkim January 20, 2026 18:42
}
return out, true, nil

if l.totalSize+len(trimmedData) <= events.MaxEventBytesInResponse {
Copy link
Copy Markdown
Contributor

@hugoShaka hugoShaka Jan 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we are here, I suppose we are in the single-event case, else we would have been caught in the early return if len(out) > 0. If so, how can l.totalSize+len(trimmedData) <= events.MaxEventBytesInResponse be true? l.totalSize is supposed to be zero in a single-event case, and we are trimming the event.

Is it possible for the trimmed event to still exceed the max size?

Copy link
Copy Markdown
Contributor Author

@kshi36 kshi36 Jan 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it may still be possible that the trim was unsuccessful, so we guard check here. We then return an error in this code path to indicate a bug with certain events not being trimmed properly (to avoid skipping large events altogether)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does the event handler reacts to this error?

Copy link
Copy Markdown
Contributor Author

@kshi36 kshi36 Jan 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error trickles up to (*EventsJob).runPolling and to (*EventsJob).run, and then the event handler re-enters the runPolling loop after 5s. The event handler doesn't seem to panic and crash, but will likely loop indefinitely because of the large event not being able to be exported.

events_job.go snippet

	for {
		err := j.runPolling(ctx)
		if err == nil || ctx.Err() != nil {
			j.app.log.DebugContext(ctx, "Watch loop exiting")
			return trace.Wrap(err)
		}

		j.app.log.ErrorContext(
			ctx, "Unexpected error in watch loop. Reconnecting in 5s...",
			"error", err,
		)

		select {
		case <-time.After(time.Second * 5):
		case <-ctx.Done():
			return nil
		}
	}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case I think we should either serve the oversized event, or skip it entirely. We are already blocking on this so it's not worse that the current state.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that we should serve the event. As long as the event size is lower than 4MB, the gRPC max message size, we are good.

Retrospectively, we shouldn't trim the events to 1MB. That limit doesn't make much sense from an operational point of view

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In that case, should we skip trimming events to 1MB? Should we set the limit to 4MB and try to trim to that size?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For consistency, I’d keep the 1 MB trim check in place for now.
Going forward, I’d like to eliminate this limit, either by splitting large events into several events or returning them intact.

There’s little value in recording and storing complete data if it can’t be read everything afterwards

@kshi36
Copy link
Copy Markdown
Contributor Author

kshi36 commented Jan 28, 2026

After discussing with Hugo, we decided to serve the oversized event in the case that trimming is unsuccessful (bypassing the 1 MiB limit), as opposed to sending an error. If an error is returned, the event handler will re-engage the runPolling loop after 5 seconds. This would cause the event handler to loop indefinitely.

events_job.go snippet

	for {
		err := j.runPolling(ctx)
		if err == nil || ctx.Err() != nil {
			j.app.log.DebugContext(ctx, "Watch loop exiting")
			return trace.Wrap(err)
		}

		j.app.log.ErrorContext(
			ctx, "Unexpected error in watch loop. Reconnecting in 5s...",
			"error", err,
		)

		select {
		case <-time.After(time.Second * 5):
		case <-ctx.Done():
			return nil
		}
	}

Copy link
Copy Markdown
Contributor

@juliaogris juliaogris left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still LGTM, with minor comments, feel free to ignore.

One observation: IIUC Athena still returns an error when trimming fails (querier.go:1018-1020), which would cause the same infinite retry loop you're fixing here for DynamoDB. Might be worth a follow-up PR to align Athena with this approach (serve oversized event + log error) for consistency.

@kshi36
Copy link
Copy Markdown
Contributor Author

kshi36 commented Jan 29, 2026

IIUC Athena still returns an error when trimming fails

This makes sense, I will work on a follow-up PR for this. After rummaging around in issues, I encountered this! #54480

}
return out, true, nil

if l.totalSize+len(trimmedData) <= events.MaxEventBytesInResponse {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that we should serve the event. As long as the event size is lower than 4MB, the gRPC max message size, we are good.

Retrospectively, we shouldn't trim the events to 1MB. That limit doesn't make much sense from an operational point of view

if err != nil {
return nil, false, trace.Wrap(err, "failed to trim event to max size")
}
trimmedData, err := json.Marshal(e.FieldsMap)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we really need to marshal it again here.

Do we care if we trimmed it correctly if we return it anw?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess for now it helps to keep track of the events.MetricQueriedTrimmedEvents metric, as well as display an error to see why certain events cannot be trimmed. In the case that a very large event (>4MiB?) causes unexpected behavior we can pinpoint it to this

@public-teleport-github-review-bot public-teleport-github-review-bot bot removed the request for review from eriktate February 3, 2026 15:36
@kshi36 kshi36 requested a review from tigrato February 3, 2026 21:57
@kshi36 kshi36 added this pull request to the merge queue Feb 4, 2026
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Feb 4, 2026
@kshi36 kshi36 added this pull request to the merge queue Feb 4, 2026
Merged via the queue into master with commit 8164989 Feb 4, 2026
43 checks passed
@kshi36 kshi36 deleted the kevin/dynamoevents-large branch February 4, 2026 23:43
@backport-bot-workflows
Copy link
Copy Markdown
Contributor

@kshi36 See the table below for backport results.

Branch Result
branch/v17 Failed
branch/v18 Failed

kshi36 added a commit that referenced this pull request Feb 5, 2026
* Fix single large event handling for DynamoDB backend

* Change inequality sign from >= to >

* Fix tests, add comments

* Modify to serve the oversized event and bypass the limit

* Fix minor typos

* Address feedback
github-merge-queue bot pushed a commit that referenced this pull request Feb 5, 2026
* Fix single large event handling for DynamoDB backend

* Change inequality sign from >= to >

* Fix tests, add comments

* Modify to serve the oversized event and bypass the limit

* Fix minor typos

* Address feedback
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Event handler gets stuck on DynamoDB backend

5 participants