Add a fallback for EmitAuditEvents failure due to event conflicts (DynamoDB backend)#40854
Conversation
| require.NoError(t, tt.log.EmitAuditEvent(ctx, generateEvent(sessionID, 0))) | ||
| require.NoError(t, tt.log.EmitAuditEvent(ctx, generateEvent(sessionID, 1))) | ||
| require.Error(t, tt.log.EmitAuditEvent(ctx, generateEvent(sessionID, 1))) | ||
| require.NoError(t, tt.log.EmitAuditEvent(ctx, generateEvent(sessionID, 1))) |
There was a problem hiding this comment.
Hmm, do we want the audit log to be able to be overwritten.
I feel like the original behavior of "first write wins" is more correct for an audit log.
Should we instead just not treat already exists as an error?
There was a problem hiding this comment.
Hmm, do we want the audit log to be able to be overwritten.
This will only happen if the disable_conflict_check is provided. Otherwise, the event will always need a unique session_id/event_index pair. The fallback logic only sets the event index to a different value and tries to put the item again. If there is still a conflict (very unlikely to happen since we're using unix nano and this is attached to the session), the event emission will still fail.
There was a problem hiding this comment.
If it makes it clearer, we can add another test that causes the fallback to also fail. In this case, the EmitAuditEvent will return an error.
|
@gabrielcorado See the table below for backport results.
|
This commit fixes a data loss bug causing the DynamoDB cursor EventIndex field to be sightly changed due to conversion issues. As this field is used to index events, this could lead to paginated queries not returning the right events, either returning events from before or after the requirested page. In the worst case, this could cause a livelock as the query continuisly processes the same events. The data loss issue is caused by improper JSON unmarshalling of large integers. This happened because of this reasons: - JSON is fundamentally flawed as it offers a single number type "binary64" for all numbers, whether they are integers or float. Go's encoding/json library uses field types to detect if the number should be stored in an int64 or a float64. - [The AWS SDK v2 migration PR](#44363) changed the cursor JSON unmarshalling logic and unmarshalled the cursor into `map[string]any`. This caused every integer field of `event` to round-trip through float64. - [The Emit event fallback PR](#40854) changed the EventIndex value from a small incremental integer to a large unix nanosecond timestamp in case of conflict. The large value was no longer safe for storage in a float64. The combination of those 3 factors caused the cursor EventIndex to get corrupted and caused unexpected event query index offsets. When preseted with a non-existing document, DynamoDB still hashes it and starts the query from its supposed location in the index. This is why this issue has not been detected for so long. Its consequences were: - duplicated events returned on 2 consecutive pages (this case was handled properly by the event forwarder as it keeps track of the last processed event) - livelock if the number of duplicated events exceed the page size - non-forwarded events if the index offset was in the future
This commit fixes a data loss bug causing the DynamoDB cursor EventIndex field to be sightly changed due to conversion issues. As this field is used to index events, this could lead to paginated queries not returning the right events, either returning events from before or after the requirested page. In the worst case, this could cause a livelock as the query continuisly processes the same events. The data loss issue is caused by improper JSON unmarshalling of large integers. This happened because of this reasons: - JSON is fundamentally flawed as it offers a single number type "binary64" for all numbers, whether they are integers or float. Go's encoding/json library uses field types to detect if the number should be stored in an int64 or a float64. - [The AWS SDK v2 migration PR](#44363) changed the cursor JSON unmarshalling logic and unmarshalled the cursor into `map[string]any`. This caused every integer field of `event` to round-trip through float64. - [The Emit event fallback PR](#40854) changed the EventIndex value from a small incremental integer to a large unix nanosecond timestamp in case of conflict. The large value was no longer safe for storage in a float64. The combination of those 3 factors caused the cursor EventIndex to get corrupted and caused unexpected event query index offsets. When preseted with a non-existing document, DynamoDB still hashes it and starts the query from its supposed location in the index. This is why this issue has not been detected for so long. Its consequences were: - duplicated events returned on 2 consecutive pages (this case was handled properly by the event forwarder as it keeps track of the last processed event) - livelock if the number of duplicated events exceed the page size - non-forwarded events if the index offset was in the future
This commit fixes a data loss bug causing the DynamoDB cursor EventIndex field to be sightly changed due to conversion issues. As this field is used to index events, this could lead to paginated queries not returning the right events, either returning events from before or after the requirested page. In the worst case, this could cause a livelock as the query continuisly processes the same events. The data loss issue is caused by improper JSON unmarshalling of large integers. This happened because of this reasons: - JSON is fundamentally flawed as it offers a single number type "binary64" for all numbers, whether they are integers or float. Go's encoding/json library uses field types to detect if the number should be stored in an int64 or a float64. - [The AWS SDK v2 migration PR](#44363) changed the cursor JSON unmarshalling logic and unmarshalled the cursor into `map[string]any`. This caused every integer field of `event` to round-trip through float64. - [The Emit event fallback PR](#40854) changed the EventIndex value from a small incremental integer to a large unix nanosecond timestamp in case of conflict. The large value was no longer safe for storage in a float64. The combination of those 3 factors caused the cursor EventIndex to get corrupted and caused unexpected event query index offsets. When preseted with a non-existing document, DynamoDB still hashes it and starts the query from its supposed location in the index. This is why this issue has not been detected for so long. Its consequences were: - duplicated events returned on 2 consecutive pages (this case was handled properly by the event forwarder as it keeps track of the last processed event) - livelock if the number of duplicated events exceed the page size - non-forwarded events if the index offset was in the future
This commit fixes a data loss bug causing the DynamoDB cursor EventIndex field to be sightly changed due to conversion issues. As this field is used to index events, this could lead to paginated queries not returning the right events, either returning events from before or after the requirested page. In the worst case, this could cause a livelock as the query continuisly processes the same events. The data loss issue is caused by improper JSON unmarshalling of large integers. This happened because of this reasons: - JSON is fundamentally flawed as it offers a single number type "binary64" for all numbers, whether they are integers or float. Go's encoding/json library uses field types to detect if the number should be stored in an int64 or a float64. - [The AWS SDK v2 migration PR](#44363) changed the cursor JSON unmarshalling logic and unmarshalled the cursor into `map[string]any`. This caused every integer field of `event` to round-trip through float64. - [The Emit event fallback PR](#40854) changed the EventIndex value from a small incremental integer to a large unix nanosecond timestamp in case of conflict. The large value was no longer safe for storage in a float64. The combination of those 3 factors caused the cursor EventIndex to get corrupted and caused unexpected event query index offsets. When preseted with a non-existing document, DynamoDB still hashes it and starts the query from its supposed location in the index. This is why this issue has not been detected for so long. Its consequences were: - duplicated events returned on 2 consecutive pages (this case was handled properly by the event forwarder as it keeps track of the last processed event) - livelock if the number of duplicated events exceed the page size - non-forwarded events if the index offset was in the future
This commit fixes a data loss bug causing the DynamoDB cursor EventIndex field to be sightly changed due to conversion issues. As this field is used to index events, this could lead to paginated queries not returning the right events, either returning events from before or after the requirested page. In the worst case, this could cause a livelock as the query continuisly processes the same events. The data loss issue is caused by improper JSON unmarshalling of large integers. This happened because of this reasons: - JSON is fundamentally flawed as it offers a single number type "binary64" for all numbers, whether they are integers or float. Go's encoding/json library uses field types to detect if the number should be stored in an int64 or a float64. - [The AWS SDK v2 migration PR](#44363) changed the cursor JSON unmarshalling logic and unmarshalled the cursor into `map[string]any`. This caused every integer field of `event` to round-trip through float64. - [The Emit event fallback PR](#40854) changed the EventIndex value from a small incremental integer to a large unix nanosecond timestamp in case of conflict. The large value was no longer safe for storage in a float64. The combination of those 3 factors caused the cursor EventIndex to get corrupted and caused unexpected event query index offsets. When preseted with a non-existing document, DynamoDB still hashes it and starts the query from its supposed location in the index. This is why this issue has not been detected for so long. Its consequences were: - duplicated events returned on 2 consecutive pages (this case was handled properly by the event forwarder as it keeps track of the last processed event) - livelock if the number of duplicated events exceed the page size - non-forwarded events if the index offset was in the future
This commit fixes a data loss bug causing the DynamoDB cursor EventIndex field to be sightly changed due to conversion issues. As this field is used to index events, this could lead to paginated queries not returning the right events, either returning events from before or after the requirested page. In the worst case, this could cause a livelock as the query continuisly processes the same events. The data loss issue is caused by improper JSON unmarshalling of large integers. This happened because of this reasons: - JSON is fundamentally flawed as it offers a single number type "binary64" for all numbers, whether they are integers or float. Go's encoding/json library uses field types to detect if the number should be stored in an int64 or a float64. - [The AWS SDK v2 migration PR](#44363) changed the cursor JSON unmarshalling logic and unmarshalled the cursor into `map[string]any`. This caused every integer field of `event` to round-trip through float64. - [The Emit event fallback PR](#40854) changed the EventIndex value from a small incremental integer to a large unix nanosecond timestamp in case of conflict. The large value was no longer safe for storage in a float64. The combination of those 3 factors caused the cursor EventIndex to get corrupted and caused unexpected event query index offsets. When preseted with a non-existing document, DynamoDB still hashes it and starts the query from its supposed location in the index. This is why this issue has not been detected for so long. Its consequences were: - duplicated events returned on 2 consecutive pages (this case was handled properly by the event forwarder as it keeps track of the last processed event) - livelock if the number of duplicated events exceed the page size - non-forwarded events if the index offset was in the future
This commit fixes a data loss bug causing the DynamoDB cursor EventIndex field to be sightly changed due to conversion issues. As this field is used to index events, this could lead to paginated queries not returning the right events, either returning events from before or after the requirested page. In the worst case, this could cause a livelock as the query continuisly processes the same events. The data loss issue is caused by improper JSON unmarshalling of large integers. This happened because of this reasons: - JSON is fundamentally flawed as it offers a single number type "binary64" for all numbers, whether they are integers or float. Go's encoding/json library uses field types to detect if the number should be stored in an int64 or a float64. - [The AWS SDK v2 migration PR](#44363) changed the cursor JSON unmarshalling logic and unmarshalled the cursor into `map[string]any`. This caused every integer field of `event` to round-trip through float64. - [The Emit event fallback PR](#40854) changed the EventIndex value from a small incremental integer to a large unix nanosecond timestamp in case of conflict. The large value was no longer safe for storage in a float64. The combination of those 3 factors caused the cursor EventIndex to get corrupted and caused unexpected event query index offsets. When preseted with a non-existing document, DynamoDB still hashes it and starts the query from its supposed location in the index. This is why this issue has not been detected for so long. Its consequences were: - duplicated events returned on 2 consecutive pages (this case was handled properly by the event forwarder as it keeps track of the last processed event) - livelock if the number of duplicated events exceed the page size - non-forwarded events if the index offset was in the future
This commit fixes a data loss bug causing the DynamoDB cursor EventIndex field to be sightly changed due to conversion issues. As this field is used to index events, this could lead to paginated queries not returning the right events, either returning events from before or after the requirested page. In the worst case, this could cause a livelock as the query continuisly processes the same events. The data loss issue is caused by improper JSON unmarshalling of large integers. This happened because of this reasons: - JSON is fundamentally flawed as it offers a single number type "binary64" for all numbers, whether they are integers or float. Go's encoding/json library uses field types to detect if the number should be stored in an int64 or a float64. - [The AWS SDK v2 migration PR](#44363) changed the cursor JSON unmarshalling logic and unmarshalled the cursor into `map[string]any`. This caused every integer field of `event` to round-trip through float64. - [The Emit event fallback PR](#40854) changed the EventIndex value from a small incremental integer to a large unix nanosecond timestamp in case of conflict. The large value was no longer safe for storage in a float64. The combination of those 3 factors caused the cursor EventIndex to get corrupted and caused unexpected event query index offsets. When preseted with a non-existing document, DynamoDB still hashes it and starts the query from its supposed location in the index. This is why this issue has not been detected for so long. Its consequences were: - duplicated events returned on 2 consecutive pages (this case was handled properly by the event forwarder as it keeps track of the last processed event) - livelock if the number of duplicated events exceed the page size - non-forwarded events if the index offset was in the future
This commit fixes a data loss bug causing the DynamoDB cursor EventIndex field to be sightly changed due to conversion issues. As this field is used to index events, this could lead to paginated queries not returning the right events, either returning events from before or after the requirested page. In the worst case, this could cause a livelock as the query continuisly processes the same events. The data loss issue is caused by improper JSON unmarshalling of large integers. This happened because of this reasons: - JSON is fundamentally flawed as it offers a single number type "binary64" for all numbers, whether they are integers or float. Go's encoding/json library uses field types to detect if the number should be stored in an int64 or a float64. - [The AWS SDK v2 migration PR](#44363) changed the cursor JSON unmarshalling logic and unmarshalled the cursor into `map[string]any`. This caused every integer field of `event` to round-trip through float64. - [The Emit event fallback PR](#40854) changed the EventIndex value from a small incremental integer to a large unix nanosecond timestamp in case of conflict. The large value was no longer safe for storage in a float64. The combination of those 3 factors caused the cursor EventIndex to get corrupted and caused unexpected event query index offsets. When preseted with a non-existing document, DynamoDB still hashes it and starts the query from its supposed location in the index. This is why this issue has not been detected for so long. Its consequences were: - duplicated events returned on 2 consecutive pages (this case was handled properly by the event forwarder as it keeps track of the last processed event) - livelock if the number of duplicated events exceed the page size - non-forwarded events if the index offset was in the future
This commit fixes a data loss bug causing the DynamoDB cursor EventIndex field to be sightly changed due to conversion issues. As this field is used to index events, this could lead to paginated queries not returning the right events, either returning events from before or after the requirested page. In the worst case, this could cause a livelock as the query continuisly processes the same events. The data loss issue is caused by improper JSON unmarshalling of large integers. This happened because of this reasons: - JSON is fundamentally flawed as it offers a single number type "binary64" for all numbers, whether they are integers or float. Go's encoding/json library uses field types to detect if the number should be stored in an int64 or a float64. - [The AWS SDK v2 migration PR](#44363) changed the cursor JSON unmarshalling logic and unmarshalled the cursor into `map[string]any`. This caused every integer field of `event` to round-trip through float64. - [The Emit event fallback PR](#40854) changed the EventIndex value from a small incremental integer to a large unix nanosecond timestamp in case of conflict. The large value was no longer safe for storage in a float64. The combination of those 3 factors caused the cursor EventIndex to get corrupted and caused unexpected event query index offsets. When preseted with a non-existing document, DynamoDB still hashes it and starts the query from its supposed location in the index. This is why this issue has not been detected for so long. Its consequences were: - duplicated events returned on 2 consecutive pages (this case was handled properly by the event forwarder as it keeps track of the last processed event) - livelock if the number of duplicated events exceed the page size - non-forwarded events if the index offset was in the future
Closes #40126.
Adds a fallback for when the put item fails due to the condition exception (duplicate events).
In addition, we're adding a new option to disable condition checking, which can be configured through the DynamoDB URL. This option can be used to restore the old behavior.
NOTE: This is being solved on the DynamoDB events "layer" because multiple parts of Teleport are subject to this failure (not only the one described on the issue).
changelog: Fix audit event failures when using DynamoDB event storage.