Add a fallback for EmitAuditEvents failure due to event conflicts (DynamoDB backend) by gabrielcorado · Pull Request #40854 · gravitational/teleport

gabrielcorado · 2024-04-24T00:29:42Z

Adds a fallback for when the put item fails due to the condition exception (duplicate events).

In addition, we're adding a new option to disable condition checking, which can be configured through the DynamoDB URL. This option can be used to restore the old behavior.

NOTE: This is being solved on the DynamoDB events "layer" because multiple parts of Teleport are subject to this failure (not only the one described on the issue).

changelog: Fix audit event failures when using DynamoDB event storage.

zmb3 · 2024-04-24T14:08:15Z

 	require.NoError(t, tt.log.EmitAuditEvent(ctx, generateEvent(sessionID, 0)))
 	require.NoError(t, tt.log.EmitAuditEvent(ctx, generateEvent(sessionID, 1)))
-	require.Error(t, tt.log.EmitAuditEvent(ctx, generateEvent(sessionID, 1)))
+	require.NoError(t, tt.log.EmitAuditEvent(ctx, generateEvent(sessionID, 1)))


Hmm, do we want the audit log to be able to be overwritten.

I feel like the original behavior of "first write wins" is more correct for an audit log.

Should we instead just not treat already exists as an error?

Hmm, do we want the audit log to be able to be overwritten.

This will only happen if the disable_conflict_check is provided. Otherwise, the event will always need a unique session_id/event_index pair. The fallback logic only sets the event index to a different value and tries to put the item again. If there is still a conflict (very unlikely to happen since we're using unix nano and this is attached to the session), the event emission will still fail.

If it makes it clearer, we can add another test that causes the fallback to also fail. In this case, the EmitAuditEvent will return an error.

zmb3

Seems reasonable to me.

public-teleport-github-review-bot · 2024-04-25T17:38:24Z

@gabrielcorado See the table below for backport results.

Branch	Result
branch/v14	Create PR
branch/v15	Create PR

Alters the log messages from #40854 such that they only occur if the fallback mechanism fails. Closes #46801

Alters the log messages from #40854 such that they only occur if the fallback mechanism fails. Updates #46801

…48548) Alters the log messages from #40854 such that they only occur if the fallback mechanism fails. Updates #46801

Alters the log messages from #40854 such that they only occur if the fallback mechanism fails. Updates #46801

…48548) Alters the log messages from #40854 such that they only occur if the fallback mechanism fails. Updates #46801

…48605) Alters the log messages from #40854 such that they only occur if the fallback mechanism fails. Updates #46801

…48548) (#48624) Alters the log messages from #40854 such that they only occur if the fallback mechanism fails. Updates #46801

…48548) (#48626) Alters the log messages from #40854 such that they only occur if the fallback mechanism fails. Updates #46801

This commit fixes a data loss bug causing the DynamoDB cursor EventIndex field to be sightly changed due to conversion issues. As this field is used to index events, this could lead to paginated queries not returning the right events, either returning events from before or after the requirested page. In the worst case, this could cause a livelock as the query continuisly processes the same events. The data loss issue is caused by improper JSON unmarshalling of large integers. This happened because of this reasons: - JSON is fundamentally flawed as it offers a single number type "binary64" for all numbers, whether they are integers or float. Go's encoding/json library uses field types to detect if the number should be stored in an int64 or a float64. - [The AWS SDK v2 migration PR](#44363) changed the cursor JSON unmarshalling logic and unmarshalled the cursor into `map[string]any`. This caused every integer field of `event` to round-trip through float64. - [The Emit event fallback PR](#40854) changed the EventIndex value from a small incremental integer to a large unix nanosecond timestamp in case of conflict. The large value was no longer safe for storage in a float64. The combination of those 3 factors caused the cursor EventIndex to get corrupted and caused unexpected event query index offsets. When preseted with a non-existing document, DynamoDB still hashes it and starts the query from its supposed location in the index. This is why this issue has not been detected for so long. Its consequences were: - duplicated events returned on 2 consecutive pages (this case was handled properly by the event forwarder as it keeps track of the last processed event) - livelock if the number of duplicated events exceed the page size - non-forwarded events if the index offset was in the future

refactor(dynamoevents): add a fallback for put item condition error

9fea7f5

gabrielcorado requested review from greedy52 and zmb3 April 24, 2024 00:29

gabrielcorado self-assigned this Apr 24, 2024

github-actions bot requested review from EdwardDowling and avatus April 24, 2024 00:30

github-actions bot added audit-log Issues related to Teleports Audit Log size/sm labels Apr 24, 2024

zmb3 reviewed Apr 24, 2024

View reviewed changes

gabrielcorado requested a review from zmb3 April 24, 2024 16:29

greedy52 reviewed Apr 24, 2024

View reviewed changes

Comment thread lib/events/dynamoevents/dynamoevents.go

Comment thread lib/events/dynamoevents/dynamoevents.go Outdated

zmb3 approved these changes Apr 24, 2024

View reviewed changes

refactor(dynamoevents): code review suggestions

42e0e82

gabrielcorado requested a review from greedy52 April 25, 2024 16:06

gabrielcorado added backport/branch/v14 labels Apr 25, 2024

greedy52 approved these changes Apr 25, 2024

View reviewed changes

public-teleport-github-review-bot bot removed request for EdwardDowling and avatus April 25, 2024 16:23

gabrielcorado added this pull request to the merge queue Apr 25, 2024

Merged via the queue into master with commit af4a627 Apr 25, 2024

gabrielcorado deleted the gabrielcorado/dynamoevents-condition-handling branch April 25, 2024 17:36

This was referenced Apr 25, 2024

[v14] Add a fallback for EmitAuditEvents failure due to event conflicts (DynamoDB backend) #40912

Merged

[v15] Add a fallback for EmitAuditEvents failure due to event conflicts (DynamoDB backend) #40913

Merged

zmb3 mentioned this pull request Jul 29, 2024

Connection via ssh using Paramiko (Python) is generating an error in DynamoDB on AWS #42652

Closed

greedy52 mentioned this pull request Sep 20, 2024

Reduce log level on "Conflict on event session_id and event_index error" for DynamoDB events #46801

Closed

rosstimothy added a commit that referenced this pull request Nov 6, 2024

Reduce log spam generated by conflicting session_id and event_index

de9ab55

Alters the log messages from #40854 such that they only occur if the fallback mechanism fails. Closes #46801

rosstimothy added a commit that referenced this pull request Nov 6, 2024

Reduce log spam generated by conflicting session_id and event_index

e40c69e

Alters the log messages from #40854 such that they only occur if the fallback mechanism fails. Updates #46801

rosstimothy mentioned this pull request Nov 6, 2024

Reduce log spam generated by conflicting session_id and event_index #48548

Merged

rosstimothy added a commit that referenced this pull request Nov 7, 2024

Reduce log spam generated by conflicting session_id and event_index

9e4f1d9

Alters the log messages from #40854 such that they only occur if the fallback mechanism fails. Updates #46801

github-merge-queue bot pushed a commit that referenced this pull request Nov 7, 2024

Reduce log spam generated by conflicting session_id and event_index (#…

53ccda9

…48548) Alters the log messages from #40854 such that they only occur if the fallback mechanism fails. Updates #46801

github-actions bot pushed a commit that referenced this pull request Nov 7, 2024

Reduce log spam generated by conflicting session_id and event_index

9c6761f

Alters the log messages from #40854 such that they only occur if the fallback mechanism fails. Updates #46801

rosstimothy added a commit that referenced this pull request Nov 7, 2024

Reduce log spam generated by conflicting session_id and event_index (#…

f79b8ef

…48548) Alters the log messages from #40854 such that they only occur if the fallback mechanism fails. Updates #46801

rosstimothy added a commit that referenced this pull request Nov 7, 2024

Reduce log spam generated by conflicting session_id and event_index (#…

548ba52

…48548) Alters the log messages from #40854 such that they only occur if the fallback mechanism fails. Updates #46801

github-merge-queue bot pushed a commit that referenced this pull request Nov 7, 2024

Reduce log spam generated by conflicting session_id and event_index (#…

c85a081

…48605) Alters the log messages from #40854 such that they only occur if the fallback mechanism fails. Updates #46801

hugoShaka mentioned this pull request Sep 5, 2025

Marshal iterator into event instead of map[string]any #58765

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a fallback for EmitAuditEvents failure due to event conflicts (DynamoDB backend)#40854

Add a fallback for EmitAuditEvents failure due to event conflicts (DynamoDB backend)#40854
gabrielcorado merged 2 commits intomasterfrom
gabrielcorado/dynamoevents-condition-handling

gabrielcorado commented Apr 24, 2024

Uh oh!

zmb3 Apr 24, 2024

Uh oh!

gabrielcorado Apr 24, 2024

Uh oh!

gabrielcorado Apr 24, 2024

Uh oh!

Uh oh!

Uh oh!

zmb3 left a comment

Uh oh!

public-teleport-github-review-bot bot commented Apr 25, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

gabrielcorado commented Apr 24, 2024

Uh oh!

zmb3 Apr 24, 2024

Choose a reason for hiding this comment

Uh oh!

gabrielcorado Apr 24, 2024

Choose a reason for hiding this comment

Uh oh!

gabrielcorado Apr 24, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

zmb3 left a comment

Choose a reason for hiding this comment

Uh oh!

public-teleport-github-review-bot bot commented Apr 25, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants