Skip to content

[v18] Marshal iterator into event instead of map[string]any#59029

Merged
hugoShaka merged 1 commit intobranch/v18from
bot/backport-58765-branch/v18
Sep 11, 2025
Merged

[v18] Marshal iterator into event instead of map[string]any#59029
hugoShaka merged 1 commit intobranch/v18from
bot/backport-58765-branch/v18

Conversation

@hugoShaka
Copy link
Copy Markdown
Contributor

Backport #58765 to branch/v18

changelog: Fixed a DynamoDB bug potentially causing event queries to return a different range of events. In the worst case scenario, this bug would block the event-handler.

This commit fixes a data loss bug causing the DynamoDB cursor EventIndex
field to be sightly changed due to conversion issues. As this field is
used to index events, this could lead to paginated queries not returning
the right events, either returning events from before or after the
requirested page. In the worst case, this could cause a livelock as the
query continuisly processes the same events.

The data loss issue is caused by improper JSON unmarshalling of large
integers. This happened because of this reasons:
- JSON is fundamentally flawed as it offers a single number type "binary64"
  for all numbers, whether they are integers or float. Go's
  encoding/json library uses field types to detect if the number should
  be stored in an int64 or a float64.
- [The AWS SDK v2 migration PR](#44363)
  changed the cursor JSON unmarshalling logic and unmarshalled the
  cursor into `map[string]any`. This caused every integer field of
  `event` to round-trip through float64.
- [The Emit event fallback PR](#40854)
  changed the EventIndex value from a small incremental integer to a
  large unix nanosecond timestamp in case of conflict. The large value
  was no longer safe for storage in a float64.

The combination of those 3 factors caused the cursor EventIndex to get
corrupted and caused unexpected event query index offsets. When preseted
with a non-existing document, DynamoDB still hashes it and starts the
query from its supposed location in the index. This is why this issue
has not been detected for so long. Its consequences were:
- duplicated events returned on 2 consecutive pages (this case was
  handled properly by the event forwarder as it keeps track of the last
  processed event)
- livelock if the number of duplicated events exceed the page size
- non-forwarded events if the index offset was in the future
@github-actions github-actions bot added audit-log Issues related to Teleports Audit Log backport size/md labels Sep 11, 2025
@hugoShaka hugoShaka enabled auto-merge September 11, 2025 14:01
@hugoShaka hugoShaka added this pull request to the merge queue Sep 11, 2025
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Sep 11, 2025
@hugoShaka hugoShaka added this pull request to the merge queue Sep 11, 2025
Merged via the queue into branch/v18 with commit eedb0b4 Sep 11, 2025
44 checks passed
@hugoShaka hugoShaka deleted the bot/backport-58765-branch/v18 branch September 11, 2025 16:04
@doggydogworld doggydogworld mentioned this pull request Sep 12, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

audit-log Issues related to Teleports Audit Log backport size/md

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants