Use time-based event index for app session events#38495
Use time-based event index for app session events#38495gabrielcorado merged 6 commits intomasterfrom
Conversation
|
The PR changelog entry failed validation: Changelog entry not found in the PR body. Please add a "no-changelog" label to the PR, or changelog lines starting with |
| } | ||
|
|
||
| // certFromConnState returns certificate from the connection. | ||
| func certFromConn(tlsConn *tls.Conn) *x509.Certificate { |
There was a problem hiding this comment.
| func certFromConn(tlsConn *tls.Conn) *x509.Certificate { | |
| func leafCertFromConn(tlsConn *tls.Conn) *x509.Certificate { |
|
Overall looks good, added a few questions as I'm not super familiar with these parts. |
| TableName: aws.String(l.Tablename), | ||
| Item: av, | ||
| TableName: aws.String(l.Tablename), | ||
| ConditionExpression: aws.String("attribute_not_exists(SessionID) AND attribute_not_exists(EventIndex)"), |
There was a problem hiding this comment.
Would this break existing environments when auth is upgraded but app server is not yet?
There was a problem hiding this comment.
Yes, as soon as they update their auth servers and events from older app servers, they will hit this, at least for the first app.session.chunk (which conflicts with app.session.start). As mentioned in the PR description, if the app server serving the requests changes, there will be more conflicts (generating more errors).
There was a problem hiding this comment.
If we release this check with the patch, customers's auth will be "spammed" with this before app servers are upgraded. As discussed offline, it is preferred to handle this separately and ideally release it in a new major version. By that time, app servers should have been upgraded with this patch already and we can likely catch other events that have the same problem during the release testing.
Though, since this check does not block using the app, and we are losing events before app servers are upgraded anyway, I don't mind if we really want to release this together.
There was a problem hiding this comment.
I prefer having an error log to silently overwriting audit events (current behavior).
@codingllama, Any thoughts on this?
There was a problem hiding this comment.
I'm afraid I don't have enough context to have a strong opinion here.
| // valid date. | ||
| func (s *Server) sessionStartTime(ctx context.Context) time.Time { | ||
| var startTime time.Time | ||
| if userCert, err := authz.UserCertificateFromContext(ctx); err == nil { |
There was a problem hiding this comment.
Indeed interesting approach. Chunks are every five minutes if I recall. So using nanoseconds ensures chunks are relatively in order. However in HA case, time on different app servers may not be perfectly in sync.
We could also use UnixNano for "smaller" numbers, but it'll still be quite large compared to "1,2,3". |
Co-authored-by: Alan Parra <alan.parra@goteleport.com>
greedy52
left a comment
There was a problem hiding this comment.
Tested today and worked as expected 👍 . Do we expect this change affect other event backend?
| TableName: aws.String(l.Tablename), | ||
| Item: av, | ||
| TableName: aws.String(l.Tablename), | ||
| ConditionExpression: aws.String("attribute_not_exists(SessionID) AND attribute_not_exists(EventIndex)"), |
There was a problem hiding this comment.
If we release this check with the patch, customers's auth will be "spammed" with this before app servers are upgraded. As discussed offline, it is preferred to handle this separately and ideally release it in a new major version. By that time, app servers should have been upgraded with this patch already and we can likely catch other events that have the same problem during the release testing.
Though, since this check does not block using the app, and we are losing events before app servers are upgraded anyway, I don't mind if we really want to release this together.
|
@gabrielcorado See the table below for backport results.
|
Closes #37049
As described on the issue, the events were being overwritten on DynamoDB events storage because they had the same key pair (SessionID and Index). The problem is that every time a new session chunk is initialized, it uses a new recorder, which would then have the index start at 0. This would cause any previous event from the session to be overwritten.
The solution proposed on this PR is, for app sessions, to generate an event index based on the session start (here, the time the certificate was issued), removing the necessity of a distributed counter.
In addition, the solution also needed to take into account if different servers handle an app session. For app access, the sessions are not ended if the server handling it changes (HA). In case that happens, the recorder would start over at index 0 and overwrite all the events emitted by the proxy and other app server.
NOTE: This change only affects app session recordings; other resources will still use the incremental method.
This PR also adds a DynamoDB condition expression to guarantee events are not being overwritten on the storage layer.
changelog: Fix application access events being overwritten when using DynamoDB as event storage.