Skip to content

Use time-based event index for app session events#38495

Merged
gabrielcorado merged 6 commits intomasterfrom
gabrielcorado/avoid-index-collision-app-sessions
Feb 29, 2024
Merged

Use time-based event index for app session events#38495
gabrielcorado merged 6 commits intomasterfrom
gabrielcorado/avoid-index-collision-app-sessions

Conversation

@gabrielcorado
Copy link
Copy Markdown
Contributor

@gabrielcorado gabrielcorado commented Feb 21, 2024

Closes #37049

As described on the issue, the events were being overwritten on DynamoDB events storage because they had the same key pair (SessionID and Index). The problem is that every time a new session chunk is initialized, it uses a new recorder, which would then have the index start at 0. This would cause any previous event from the session to be overwritten.

The solution proposed on this PR is, for app sessions, to generate an event index based on the session start (here, the time the certificate was issued), removing the necessity of a distributed counter.

In addition, the solution also needed to take into account if different servers handle an app session. For app access, the sessions are not ended if the server handling it changes (HA). In case that happens, the recorder would start over at index 0 and overwrite all the events emitted by the proxy and other app server.

NOTE: This change only affects app session recordings; other resources will still use the incremental method.

This PR also adds a DynamoDB condition expression to guarantee events are not being overwritten on the storage layer.

changelog: Fix application access events being overwritten when using DynamoDB as event storage.

@gabrielcorado gabrielcorado self-assigned this Feb 21, 2024
@github-actions github-actions Bot added application-access audit-log Issues related to Teleports Audit Log size/md labels Feb 21, 2024
@github-actions
Copy link
Copy Markdown
Contributor

The PR changelog entry failed validation: Changelog entry not found in the PR body. Please add a "no-changelog" label to the PR, or changelog lines starting with changelog: followed by the changelog entries for the PR.

Comment thread lib/events/dynamoevents/dynamoevents_test.go Outdated
Comment thread lib/events/setter.go Outdated
Comment thread lib/events/setter_test.go Outdated
Comment thread lib/events/recorder/recorder.go Outdated
Comment thread lib/events/setter.go Outdated
Comment thread lib/events/setter.go
Comment thread lib/srv/app/server.go Outdated
Comment thread lib/srv/app/server.go Outdated
}

// certFromConnState returns certificate from the connection.
func certFromConn(tlsConn *tls.Conn) *x509.Certificate {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
func certFromConn(tlsConn *tls.Conn) *x509.Certificate {
func leafCertFromConn(tlsConn *tls.Conn) *x509.Certificate {

Comment thread lib/srv/app/server.go Outdated
Comment thread lib/srv/app/server.go Outdated
Comment thread lib/events/setter_test.go
Comment thread lib/events/setter_test.go Outdated
Comment thread lib/events/setter_test.go Outdated
@codingllama
Copy link
Copy Markdown
Contributor

Overall looks good, added a few questions as I'm not super familiar with these parts.

Copy link
Copy Markdown
Contributor

@greedy52 greedy52 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The approach looks good to me. The key index is used in app sessions recordings as well. After this change, ppl will see big numbers as key index instead of 1,2,3, so we need to document this behavior somewhere.

TableName: aws.String(l.Tablename),
Item: av,
TableName: aws.String(l.Tablename),
ConditionExpression: aws.String("attribute_not_exists(SessionID) AND attribute_not_exists(EventIndex)"),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would this break existing environments when auth is upgraded but app server is not yet?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, as soon as they update their auth servers and events from older app servers, they will hit this, at least for the first app.session.chunk (which conflicts with app.session.start). As mentioned in the PR description, if the app server serving the requests changes, there will be more conflicts (generating more errors).

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we release this check with the patch, customers's auth will be "spammed" with this before app servers are upgraded. As discussed offline, it is preferred to handle this separately and ideally release it in a new major version. By that time, app servers should have been upgraded with this patch already and we can likely catch other events that have the same problem during the release testing.

Though, since this check does not block using the app, and we are losing events before app servers are upgraded anyway, I don't mind if we really want to release this together.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer having an error log to silently overwriting audit events (current behavior).

@codingllama, Any thoughts on this?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm afraid I don't have enough context to have a strong opinion here.

Comment thread lib/srv/app/server.go
// valid date.
func (s *Server) sessionStartTime(ctx context.Context) time.Time {
var startTime time.Time
if userCert, err := authz.UserCertificateFromContext(ctx); err == nil {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed interesting approach. Chunks are every five minutes if I recall. So using nanoseconds ensures chunks are relatively in order. However in HA case, time on different app servers may not be perfectly in sync.

@codingllama
Copy link
Copy Markdown
Contributor

After this change, ppl will see big numbers as key index instead of 1,2,3, so we need to document this behavior somewhere.

We could also use UnixNano for "smaller" numbers, but it'll still be quite large compared to "1,2,3".

Co-authored-by: Alan Parra <alan.parra@goteleport.com>
@greedy52 greedy52 self-requested a review February 23, 2024 18:48
Copy link
Copy Markdown
Contributor

@greedy52 greedy52 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested today and worked as expected 👍 . Do we expect this change affect other event backend?

TableName: aws.String(l.Tablename),
Item: av,
TableName: aws.String(l.Tablename),
ConditionExpression: aws.String("attribute_not_exists(SessionID) AND attribute_not_exists(EventIndex)"),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we release this check with the patch, customers's auth will be "spammed" with this before app servers are upgraded. As discussed offline, it is preferred to handle this separately and ideally release it in a new major version. By that time, app servers should have been upgraded with this patch already and we can likely catch other events that have the same problem during the release testing.

Though, since this check does not block using the app, and we are losing events before app servers are upgraded anyway, I don't mind if we really want to release this together.

@gabrielcorado gabrielcorado added this pull request to the merge queue Feb 29, 2024
Merged via the queue into master with commit 89d1aa4 Feb 29, 2024
@gabrielcorado gabrielcorado deleted the gabrielcorado/avoid-index-collision-app-sessions branch February 29, 2024 03:33
@public-teleport-github-review-bot
Copy link
Copy Markdown

@gabrielcorado See the table below for backport results.

Branch Result
branch/v13 Failed
branch/v14 Create PR
branch/v15 Create PR

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

application-access audit-log Issues related to Teleports Audit Log size/md

Projects

None yet

Development

Successfully merging this pull request may close these issues.

app.session.start events do not appear in DynamoDB backend

4 participants