Skip to content

Add a fallback for EmitAuditEvents failure due to event conflicts (DynamoDB backend)#40854

Merged
gabrielcorado merged 2 commits intomasterfrom
gabrielcorado/dynamoevents-condition-handling
Apr 25, 2024
Merged

Add a fallback for EmitAuditEvents failure due to event conflicts (DynamoDB backend)#40854
gabrielcorado merged 2 commits intomasterfrom
gabrielcorado/dynamoevents-condition-handling

Conversation

@gabrielcorado
Copy link
Copy Markdown
Contributor

Closes #40126.

Adds a fallback for when the put item fails due to the condition exception (duplicate events).

In addition, we're adding a new option to disable condition checking, which can be configured through the DynamoDB URL. This option can be used to restore the old behavior.

NOTE: This is being solved on the DynamoDB events "layer" because multiple parts of Teleport are subject to this failure (not only the one described on the issue).

changelog: Fix audit event failures when using DynamoDB event storage.

@gabrielcorado gabrielcorado requested review from greedy52 and zmb3 April 24, 2024 00:29
@gabrielcorado gabrielcorado self-assigned this Apr 24, 2024
@github-actions github-actions bot added audit-log Issues related to Teleports Audit Log size/sm labels Apr 24, 2024
require.NoError(t, tt.log.EmitAuditEvent(ctx, generateEvent(sessionID, 0)))
require.NoError(t, tt.log.EmitAuditEvent(ctx, generateEvent(sessionID, 1)))
require.Error(t, tt.log.EmitAuditEvent(ctx, generateEvent(sessionID, 1)))
require.NoError(t, tt.log.EmitAuditEvent(ctx, generateEvent(sessionID, 1)))
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, do we want the audit log to be able to be overwritten.

I feel like the original behavior of "first write wins" is more correct for an audit log.

Should we instead just not treat already exists as an error?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, do we want the audit log to be able to be overwritten.

This will only happen if the disable_conflict_check is provided. Otherwise, the event will always need a unique session_id/event_index pair. The fallback logic only sets the event index to a different value and tries to put the item again. If there is still a conflict (very unlikely to happen since we're using unix nano and this is attached to the session), the event emission will still fail.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it makes it clearer, we can add another test that causes the fallback to also fail. In this case, the EmitAuditEvent will return an error.

@gabrielcorado gabrielcorado requested a review from zmb3 April 24, 2024 16:29
Comment thread lib/events/dynamoevents/dynamoevents.go
Comment thread lib/events/dynamoevents/dynamoevents.go Outdated
Copy link
Copy Markdown
Collaborator

@zmb3 zmb3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems reasonable to me.

@gabrielcorado gabrielcorado added this pull request to the merge queue Apr 25, 2024
Merged via the queue into master with commit af4a627 Apr 25, 2024
@gabrielcorado gabrielcorado deleted the gabrielcorado/dynamoevents-condition-handling branch April 25, 2024 17:36
@public-teleport-github-review-bot
Copy link
Copy Markdown

@gabrielcorado See the table below for backport results.

Branch Result
branch/v14 Create PR
branch/v15 Create PR

rosstimothy added a commit that referenced this pull request Nov 7, 2024
Alters the log messages from #40854
such that they only occur if the fallback mechanism fails.

Updates #46801
github-merge-queue bot pushed a commit that referenced this pull request Nov 7, 2024
…48548)

Alters the log messages from #40854
such that they only occur if the fallback mechanism fails.

Updates #46801
github-actions bot pushed a commit that referenced this pull request Nov 7, 2024
Alters the log messages from #40854
such that they only occur if the fallback mechanism fails.

Updates #46801
rosstimothy added a commit that referenced this pull request Nov 7, 2024
…48548)

Alters the log messages from #40854
such that they only occur if the fallback mechanism fails.

Updates #46801
rosstimothy added a commit that referenced this pull request Nov 7, 2024
…48548)

Alters the log messages from #40854
such that they only occur if the fallback mechanism fails.

Updates #46801
github-merge-queue bot pushed a commit that referenced this pull request Nov 7, 2024
…48605)

Alters the log messages from #40854
such that they only occur if the fallback mechanism fails.

Updates #46801
github-merge-queue bot pushed a commit that referenced this pull request Nov 7, 2024
…48548) (#48624)

Alters the log messages from #40854
such that they only occur if the fallback mechanism fails.

Updates #46801
github-merge-queue bot pushed a commit that referenced this pull request Nov 7, 2024
…48548) (#48626)

Alters the log messages from #40854
such that they only occur if the fallback mechanism fails.

Updates #46801
hugoShaka added a commit that referenced this pull request Sep 5, 2025
This commit fixes a data loss bug causing the DynamoDB cursor EventIndex
field to be sightly changed due to conversion issues. As this field is
used to index events, this could lead to paginated queries not returning
the right events, either returning events from before or after the
requirested page. In the worst case, this could cause a livelock as the
query continuisly processes the same events.

The data loss issue is caused by improper JSON unmarshalling of large
integers. This happened because of this reasons:
- JSON is fundamentally flawed as it offers a single number type "binary64"
  for all numbers, whether they are integers or float. Go's
  encoding/json library uses field types to detect if the number should
  be stored in an int64 or a float64.
- [The AWS SDK v2 migration PR](#44363)
  changed the cursor JSON unmarshalling logic and unmarshalled the
  cursor into `map[string]any`. This caused every integer field of
  `event` to round-trip through float64.
- [The Emit event fallback PR](#40854)
  changed the EventIndex value from a small incremental integer to a
  large unix nanosecond timestamp in case of conflict. The large value
  was no longer safe for storage in a float64.

The combination of those 3 factors caused the cursor EventIndex to get
corrupted and caused unexpected event query index offsets. When preseted
with a non-existing document, DynamoDB still hashes it and starts the
query from its supposed location in the index. This is why this issue
has not been detected for so long. Its consequences were:
- duplicated events returned on 2 consecutive pages (this case was
  handled properly by the event forwarder as it keeps track of the last
  processed event)
- livelock if the number of duplicated events exceed the page size
- non-forwarded events if the index offset was in the future
hugoShaka added a commit that referenced this pull request Sep 5, 2025
This commit fixes a data loss bug causing the DynamoDB cursor EventIndex
field to be sightly changed due to conversion issues. As this field is
used to index events, this could lead to paginated queries not returning
the right events, either returning events from before or after the
requirested page. In the worst case, this could cause a livelock as the
query continuisly processes the same events.

The data loss issue is caused by improper JSON unmarshalling of large
integers. This happened because of this reasons:
- JSON is fundamentally flawed as it offers a single number type "binary64"
  for all numbers, whether they are integers or float. Go's
  encoding/json library uses field types to detect if the number should
  be stored in an int64 or a float64.
- [The AWS SDK v2 migration PR](#44363)
  changed the cursor JSON unmarshalling logic and unmarshalled the
  cursor into `map[string]any`. This caused every integer field of
  `event` to round-trip through float64.
- [The Emit event fallback PR](#40854)
  changed the EventIndex value from a small incremental integer to a
  large unix nanosecond timestamp in case of conflict. The large value
  was no longer safe for storage in a float64.

The combination of those 3 factors caused the cursor EventIndex to get
corrupted and caused unexpected event query index offsets. When preseted
with a non-existing document, DynamoDB still hashes it and starts the
query from its supposed location in the index. This is why this issue
has not been detected for so long. Its consequences were:
- duplicated events returned on 2 consecutive pages (this case was
  handled properly by the event forwarder as it keeps track of the last
  processed event)
- livelock if the number of duplicated events exceed the page size
- non-forwarded events if the index offset was in the future
hugoShaka added a commit that referenced this pull request Sep 5, 2025
This commit fixes a data loss bug causing the DynamoDB cursor EventIndex
field to be sightly changed due to conversion issues. As this field is
used to index events, this could lead to paginated queries not returning
the right events, either returning events from before or after the
requirested page. In the worst case, this could cause a livelock as the
query continuisly processes the same events.

The data loss issue is caused by improper JSON unmarshalling of large
integers. This happened because of this reasons:
- JSON is fundamentally flawed as it offers a single number type "binary64"
  for all numbers, whether they are integers or float. Go's
  encoding/json library uses field types to detect if the number should
  be stored in an int64 or a float64.
- [The AWS SDK v2 migration PR](#44363)
  changed the cursor JSON unmarshalling logic and unmarshalled the
  cursor into `map[string]any`. This caused every integer field of
  `event` to round-trip through float64.
- [The Emit event fallback PR](#40854)
  changed the EventIndex value from a small incremental integer to a
  large unix nanosecond timestamp in case of conflict. The large value
  was no longer safe for storage in a float64.

The combination of those 3 factors caused the cursor EventIndex to get
corrupted and caused unexpected event query index offsets. When preseted
with a non-existing document, DynamoDB still hashes it and starts the
query from its supposed location in the index. This is why this issue
has not been detected for so long. Its consequences were:
- duplicated events returned on 2 consecutive pages (this case was
  handled properly by the event forwarder as it keeps track of the last
  processed event)
- livelock if the number of duplicated events exceed the page size
- non-forwarded events if the index offset was in the future
hugoShaka added a commit that referenced this pull request Sep 8, 2025
This commit fixes a data loss bug causing the DynamoDB cursor EventIndex
field to be sightly changed due to conversion issues. As this field is
used to index events, this could lead to paginated queries not returning
the right events, either returning events from before or after the
requirested page. In the worst case, this could cause a livelock as the
query continuisly processes the same events.

The data loss issue is caused by improper JSON unmarshalling of large
integers. This happened because of this reasons:
- JSON is fundamentally flawed as it offers a single number type "binary64"
  for all numbers, whether they are integers or float. Go's
  encoding/json library uses field types to detect if the number should
  be stored in an int64 or a float64.
- [The AWS SDK v2 migration PR](#44363)
  changed the cursor JSON unmarshalling logic and unmarshalled the
  cursor into `map[string]any`. This caused every integer field of
  `event` to round-trip through float64.
- [The Emit event fallback PR](#40854)
  changed the EventIndex value from a small incremental integer to a
  large unix nanosecond timestamp in case of conflict. The large value
  was no longer safe for storage in a float64.

The combination of those 3 factors caused the cursor EventIndex to get
corrupted and caused unexpected event query index offsets. When preseted
with a non-existing document, DynamoDB still hashes it and starts the
query from its supposed location in the index. This is why this issue
has not been detected for so long. Its consequences were:
- duplicated events returned on 2 consecutive pages (this case was
  handled properly by the event forwarder as it keeps track of the last
  processed event)
- livelock if the number of duplicated events exceed the page size
- non-forwarded events if the index offset was in the future
github-merge-queue bot pushed a commit that referenced this pull request Sep 9, 2025
This commit fixes a data loss bug causing the DynamoDB cursor EventIndex
field to be sightly changed due to conversion issues. As this field is
used to index events, this could lead to paginated queries not returning
the right events, either returning events from before or after the
requirested page. In the worst case, this could cause a livelock as the
query continuisly processes the same events.

The data loss issue is caused by improper JSON unmarshalling of large
integers. This happened because of this reasons:
- JSON is fundamentally flawed as it offers a single number type "binary64"
  for all numbers, whether they are integers or float. Go's
  encoding/json library uses field types to detect if the number should
  be stored in an int64 or a float64.
- [The AWS SDK v2 migration PR](#44363)
  changed the cursor JSON unmarshalling logic and unmarshalled the
  cursor into `map[string]any`. This caused every integer field of
  `event` to round-trip through float64.
- [The Emit event fallback PR](#40854)
  changed the EventIndex value from a small incremental integer to a
  large unix nanosecond timestamp in case of conflict. The large value
  was no longer safe for storage in a float64.

The combination of those 3 factors caused the cursor EventIndex to get
corrupted and caused unexpected event query index offsets. When preseted
with a non-existing document, DynamoDB still hashes it and starts the
query from its supposed location in the index. This is why this issue
has not been detected for so long. Its consequences were:
- duplicated events returned on 2 consecutive pages (this case was
  handled properly by the event forwarder as it keeps track of the last
  processed event)
- livelock if the number of duplicated events exceed the page size
- non-forwarded events if the index offset was in the future
backport-bot-workflows bot pushed a commit that referenced this pull request Sep 9, 2025
This commit fixes a data loss bug causing the DynamoDB cursor EventIndex
field to be sightly changed due to conversion issues. As this field is
used to index events, this could lead to paginated queries not returning
the right events, either returning events from before or after the
requirested page. In the worst case, this could cause a livelock as the
query continuisly processes the same events.

The data loss issue is caused by improper JSON unmarshalling of large
integers. This happened because of this reasons:
- JSON is fundamentally flawed as it offers a single number type "binary64"
  for all numbers, whether they are integers or float. Go's
  encoding/json library uses field types to detect if the number should
  be stored in an int64 or a float64.
- [The AWS SDK v2 migration PR](#44363)
  changed the cursor JSON unmarshalling logic and unmarshalled the
  cursor into `map[string]any`. This caused every integer field of
  `event` to round-trip through float64.
- [The Emit event fallback PR](#40854)
  changed the EventIndex value from a small incremental integer to a
  large unix nanosecond timestamp in case of conflict. The large value
  was no longer safe for storage in a float64.

The combination of those 3 factors caused the cursor EventIndex to get
corrupted and caused unexpected event query index offsets. When preseted
with a non-existing document, DynamoDB still hashes it and starts the
query from its supposed location in the index. This is why this issue
has not been detected for so long. Its consequences were:
- duplicated events returned on 2 consecutive pages (this case was
  handled properly by the event forwarder as it keeps track of the last
  processed event)
- livelock if the number of duplicated events exceed the page size
- non-forwarded events if the index offset was in the future
hugoShaka added a commit that referenced this pull request Sep 11, 2025
This commit fixes a data loss bug causing the DynamoDB cursor EventIndex
field to be sightly changed due to conversion issues. As this field is
used to index events, this could lead to paginated queries not returning
the right events, either returning events from before or after the
requirested page. In the worst case, this could cause a livelock as the
query continuisly processes the same events.

The data loss issue is caused by improper JSON unmarshalling of large
integers. This happened because of this reasons:
- JSON is fundamentally flawed as it offers a single number type "binary64"
  for all numbers, whether they are integers or float. Go's
  encoding/json library uses field types to detect if the number should
  be stored in an int64 or a float64.
- [The AWS SDK v2 migration PR](#44363)
  changed the cursor JSON unmarshalling logic and unmarshalled the
  cursor into `map[string]any`. This caused every integer field of
  `event` to round-trip through float64.
- [The Emit event fallback PR](#40854)
  changed the EventIndex value from a small incremental integer to a
  large unix nanosecond timestamp in case of conflict. The large value
  was no longer safe for storage in a float64.

The combination of those 3 factors caused the cursor EventIndex to get
corrupted and caused unexpected event query index offsets. When preseted
with a non-existing document, DynamoDB still hashes it and starts the
query from its supposed location in the index. This is why this issue
has not been detected for so long. Its consequences were:
- duplicated events returned on 2 consecutive pages (this case was
  handled properly by the event forwarder as it keeps track of the last
  processed event)
- livelock if the number of duplicated events exceed the page size
- non-forwarded events if the index offset was in the future
github-merge-queue bot pushed a commit that referenced this pull request Sep 11, 2025
This commit fixes a data loss bug causing the DynamoDB cursor EventIndex
field to be sightly changed due to conversion issues. As this field is
used to index events, this could lead to paginated queries not returning
the right events, either returning events from before or after the
requirested page. In the worst case, this could cause a livelock as the
query continuisly processes the same events.

The data loss issue is caused by improper JSON unmarshalling of large
integers. This happened because of this reasons:
- JSON is fundamentally flawed as it offers a single number type "binary64"
  for all numbers, whether they are integers or float. Go's
  encoding/json library uses field types to detect if the number should
  be stored in an int64 or a float64.
- [The AWS SDK v2 migration PR](#44363)
  changed the cursor JSON unmarshalling logic and unmarshalled the
  cursor into `map[string]any`. This caused every integer field of
  `event` to round-trip through float64.
- [The Emit event fallback PR](#40854)
  changed the EventIndex value from a small incremental integer to a
  large unix nanosecond timestamp in case of conflict. The large value
  was no longer safe for storage in a float64.

The combination of those 3 factors caused the cursor EventIndex to get
corrupted and caused unexpected event query index offsets. When preseted
with a non-existing document, DynamoDB still hashes it and starts the
query from its supposed location in the index. This is why this issue
has not been detected for so long. Its consequences were:
- duplicated events returned on 2 consecutive pages (this case was
  handled properly by the event forwarder as it keeps track of the last
  processed event)
- livelock if the number of duplicated events exceed the page size
- non-forwarded events if the index offset was in the future
github-merge-queue bot pushed a commit that referenced this pull request Sep 11, 2025
This commit fixes a data loss bug causing the DynamoDB cursor EventIndex
field to be sightly changed due to conversion issues. As this field is
used to index events, this could lead to paginated queries not returning
the right events, either returning events from before or after the
requirested page. In the worst case, this could cause a livelock as the
query continuisly processes the same events.

The data loss issue is caused by improper JSON unmarshalling of large
integers. This happened because of this reasons:
- JSON is fundamentally flawed as it offers a single number type "binary64"
  for all numbers, whether they are integers or float. Go's
  encoding/json library uses field types to detect if the number should
  be stored in an int64 or a float64.
- [The AWS SDK v2 migration PR](#44363)
  changed the cursor JSON unmarshalling logic and unmarshalled the
  cursor into `map[string]any`. This caused every integer field of
  `event` to round-trip through float64.
- [The Emit event fallback PR](#40854)
  changed the EventIndex value from a small incremental integer to a
  large unix nanosecond timestamp in case of conflict. The large value
  was no longer safe for storage in a float64.

The combination of those 3 factors caused the cursor EventIndex to get
corrupted and caused unexpected event query index offsets. When preseted
with a non-existing document, DynamoDB still hashes it and starts the
query from its supposed location in the index. This is why this issue
has not been detected for so long. Its consequences were:
- duplicated events returned on 2 consecutive pages (this case was
  handled properly by the event forwarder as it keeps track of the last
  processed event)
- livelock if the number of duplicated events exceed the page size
- non-forwarded events if the index offset was in the future
mmcallister pushed a commit that referenced this pull request Sep 22, 2025
This commit fixes a data loss bug causing the DynamoDB cursor EventIndex
field to be sightly changed due to conversion issues. As this field is
used to index events, this could lead to paginated queries not returning
the right events, either returning events from before or after the
requirested page. In the worst case, this could cause a livelock as the
query continuisly processes the same events.

The data loss issue is caused by improper JSON unmarshalling of large
integers. This happened because of this reasons:
- JSON is fundamentally flawed as it offers a single number type "binary64"
  for all numbers, whether they are integers or float. Go's
  encoding/json library uses field types to detect if the number should
  be stored in an int64 or a float64.
- [The AWS SDK v2 migration PR](#44363)
  changed the cursor JSON unmarshalling logic and unmarshalled the
  cursor into `map[string]any`. This caused every integer field of
  `event` to round-trip through float64.
- [The Emit event fallback PR](#40854)
  changed the EventIndex value from a small incremental integer to a
  large unix nanosecond timestamp in case of conflict. The large value
  was no longer safe for storage in a float64.

The combination of those 3 factors caused the cursor EventIndex to get
corrupted and caused unexpected event query index offsets. When preseted
with a non-existing document, DynamoDB still hashes it and starts the
query from its supposed location in the index. This is why this issue
has not been detected for so long. Its consequences were:
- duplicated events returned on 2 consecutive pages (this case was
  handled properly by the event forwarder as it keeps track of the last
  processed event)
- livelock if the number of duplicated events exceed the page size
- non-forwarded events if the index offset was in the future
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

audit-log Issues related to Teleports Audit Log size/sm

Projects

None yet

Development

Successfully merging this pull request may close these issues.

session.command event hits a ConditionalCheckFailedException on dynamoevents

3 participants