athena audit logs - sqs receive by tobiaszheller · Pull Request #24038 · gravitational/teleport

tobiaszheller · 2023-04-04T12:19:17Z

Part of https://github.com/gravitational/teleport.e/issues/894
RFD: #23700

This PR adds receiver of SQS messages for audit logs.
It's using channel to send messages from "receiver workers" to "s3 workers" which will be responsible for writing parquet file to s3. Note that "s3 workers" are not part of this PR and will be added in separate one.

Channel is used, because we want to start writing parquet file, as soon as we receive first events, even though we will listen for whole batch interval.

tobiaszheller · 2023-04-06T14:16:51Z

@hugoShaka @rosstimothy PTAL

tobiaszheller · 2023-04-11T15:16:11Z

@hugoShaka friendly ping

rosstimothy · 2023-04-11T19:15:52Z

+		// TODO(tobiaszheller): come back at some point and rework configuration of runWhileLocked.
+		// Now it tries every 250ms to acquire lock which can cause pressure on backend.
+		err = backend.RunWhileLocked(ctx, c.backend, lockName, lockTTL, func(ctx context.Context) error {


Is there any reason a single Auth can't process multiple batches in a row? Can we use a longer TTL and just delete the lock if there is no more work?

hugoShaka

I've not much to say about the code. My biggest concern is how this thing will fail and how to know/react. I don't think the RFD discussed backpressure and failure modes; if it did please point me to the discussion. You can disregard or postpone addressing this comment, I'll approve the PR tomorrow anyway.

We have no guarantee the auth can consume items faster than they pile up in the queue. It's not an issue per-se, but when this happens, we need to know how the system is doing, if it is consuming faster or slower than the event input, is stopped, cannot acquire lock, ... I think the following metrics would be a solid starting point:

batch processing duration (histogram)
batch size (histogram)
batch count (histogram) (size vs count is because we can have a lot of small events or a few XXL, large batches can lead to memory pressure)
batch processed (counter)
last event seen (gauge/timestamp)

This will also allow tuning batch size, flush interval, and measure how the system behaves under load.

russjones · 2023-04-12T00:23:01Z

+	maxWaitTimeOnReceiveMessageFromSQS = 5 * time.Second
+	// maxNumberOfWorkers defines how many workers are processing messages
+	// from queue or writing parquet files to s3.
+	maxNumberOfWorkers = 5


Why was the number 5 chosen?

I have hardcode 5 for now just based on gut feeling. In future probably this number should depend on how many items are in queue. 5 workers on my dev machine were enough to handle max load defined in cloud RFD (250 events/s if I remember correctly).

tobiaszheller · 2023-04-13T07:35:32Z

I've not much to say about the code. My biggest concern is how this thing will fail and how to know/react. I don't think the RFD discussed backpressure and failure modes; if it did please point me to the discussion. You can disregard or postpone addressing this comment, I'll approve the PR tomorrow anyway.

We have no guarantee the auth can consume items faster than they pile up in the queue. It's not an issue per-se, but when this happens, we need to know how the system is doing, if it is consuming faster or slower than the event input, is stopped, cannot acquire lock, ... I think the following metrics would be a solid starting point:

batch processing duration (histogram)

batch size (histogram)

batch count (histogram) (size vs count is because we can have a lot of small events or a few XXL, large batches can lead to memory pressure)

batch processed (counter)

last event seen (gauge/timestamp)

This will also allow tuning batch size, flush interval, and measure how the system behaves under load.

@hugoShaka thanks for raising it and sorry for not making it clear in description. I am planning to come back later in next PRs and add multiple metrics and replace debug messages. For metrics PRs I was planning to involve and get insights from Cloud team, because they will be monitoring those. Some metrics are available by AWS out of the box but I am not sure if Cloud team prefer to use AWS ones or if we should publish ours. We will also utilize dead-letter queue if messages cannot be processed.

This PR is already complex so I think pushing metrics to other is reasonable.

tobiaszheller · 2023-04-18T16:11:04Z

@russjones @rosstimothy I have decided to move "locking" part into other PR and keep here only reading from SQS.

tobiaszheller · 2023-04-20T12:45:07Z

@rosstimothy
I have added two more improvements to prevent putting to much pressure on CPU when something goes wrong:

in receive message itself: 1dd13f4
in running single batch: df35388

public-teleport-github-review-bot · 2023-04-25T10:37:23Z

@tobiaszheller See the table below for backport results.

Branch	Result
branch/v13	Failed

public-teleport-github-review-bot · 2023-05-15T10:44:30Z

@tobiaszheller See the table below for backport results.

Branch	Result
branch/v13	Create PR

tobiaszheller requested a review from rosstimothy April 4, 2023 12:19

github-actions Bot requested a review from hugoShaka April 4, 2023 12:19

github-actions Bot added audit-log Issues related to Teleports Audit Log size/lg labels Apr 4, 2023

rosstimothy reviewed Apr 4, 2023

View reviewed changes

tobiaszheller requested a review from rosstimothy April 6, 2023 14:16

rosstimothy reviewed Apr 7, 2023

View reviewed changes

tobiaszheller requested a review from rosstimothy April 11, 2023 15:14

rosstimothy reviewed Apr 11, 2023

View reviewed changes

hugoShaka reviewed Apr 11, 2023

View reviewed changes

russjones reviewed Apr 12, 2023

View reviewed changes

hugoShaka approved these changes Apr 13, 2023

View reviewed changes

rosstimothy reviewed Apr 14, 2023

View reviewed changes

tobiaszheller requested a review from rosstimothy April 18, 2023 16:56

tobiaszheller mentioned this pull request Apr 19, 2023

athena audit logs - parquet writer #24805

Merged

rosstimothy reviewed Apr 21, 2023

View reviewed changes

tobiaszheller requested a review from rosstimothy April 24, 2023 11:09

rosstimothy approved these changes Apr 24, 2023

View reviewed changes

tobiaszheller added the backport/branch/v13 label Apr 24, 2023

tobiaszheller force-pushed the tobiaszheller/auditevents-athena-sqs branch from 8b6e9e4 to 5d1dd3b Compare April 24, 2023 16:35

tobiaszheller enabled auto-merge April 24, 2023 16:37

tobiaszheller added this pull request to the merge queue Apr 24, 2023

github-merge-queue Bot removed this pull request from the merge queue due to failed status checks Apr 24, 2023

tobiaszheller added this pull request to the merge queue Apr 24, 2023

github-merge-queue Bot removed this pull request from the merge queue due to failed status checks Apr 24, 2023

tobiaszheller force-pushed the tobiaszheller/auditevents-athena-sqs branch from 5d1dd3b to a7f01d4 Compare April 25, 2023 09:44

tobiaszheller enabled auto-merge April 25, 2023 09:44

athena audit logs - sqs receive

7eee9db

tobiaszheller force-pushed the tobiaszheller/auditevents-athena-sqs branch from a7f01d4 to 7eee9db Compare April 25, 2023 09:56

tobiaszheller added this pull request to the merge queue Apr 25, 2023

Merged via the queue into master with commit e93c6a9 Apr 25, 2023

tobiaszheller deleted the tobiaszheller/auditevents-athena-sqs branch April 25, 2023 10:36

tobiaszheller mentioned this pull request May 15, 2023

[v13] athena audit logs - sqs receive #26220

Merged

Conversation

tobiaszheller commented Apr 4, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tobiaszheller commented Apr 6, 2023

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tobiaszheller commented Apr 11, 2023

Uh oh!

Uh oh!

rosstimothy Apr 11, 2023

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

hugoShaka left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

russjones Apr 12, 2023

Choose a reason for hiding this comment

Uh oh!

tobiaszheller Apr 13, 2023

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tobiaszheller commented Apr 13, 2023

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tobiaszheller commented Apr 18, 2023

Uh oh!

tobiaszheller commented Apr 20, 2023

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

public-teleport-github-review-bot Bot commented Apr 25, 2023

Uh oh!

public-teleport-github-review-bot Bot commented May 15, 2023

Uh oh!

Reviewers

tobiaszheller commented Apr 4, 2023 •

edited

Loading

hugoShaka left a comment •

edited

Loading