Skip to content

athena audit logs - sqs receive#24038

Merged
tobiaszheller merged 1 commit intomasterfrom
tobiaszheller/auditevents-athena-sqs
Apr 25, 2023
Merged

athena audit logs - sqs receive#24038
tobiaszheller merged 1 commit intomasterfrom
tobiaszheller/auditevents-athena-sqs

Conversation

@tobiaszheller
Copy link
Copy Markdown
Contributor

@tobiaszheller tobiaszheller commented Apr 4, 2023

Part of https://github.com/gravitational/teleport.e/issues/894
RFD: #23700

This PR adds receiver of SQS messages for audit logs.
It's using channel to send messages from "receiver workers" to "s3 workers" which will be responsible for writing parquet file to s3. Note that "s3 workers" are not part of this PR and will be added in separate one.

Channel is used, because we want to start writing parquet file, as soon as we receive first events, even though we will listen for whole batch interval.

@github-actions github-actions Bot requested a review from hugoShaka April 4, 2023 12:19
@github-actions github-actions Bot added audit-log Issues related to Teleports Audit Log size/lg labels Apr 4, 2023
Comment thread lib/events/athena/consumer.go Outdated
Comment thread lib/events/athena/consumer.go Outdated
Comment thread lib/events/athena/athena.go
Comment thread lib/events/athena/consumer.go Outdated
Comment thread lib/events/athena/consumer.go Outdated
Comment thread lib/events/athena/consumer.go Outdated
Comment thread lib/events/athena/consumer.go Outdated
Comment thread lib/events/athena/consumer.go Outdated
Comment thread lib/events/athena/consumer_test.go Outdated
Comment thread lib/events/athena/athena.go Outdated
@tobiaszheller
Copy link
Copy Markdown
Contributor Author

@hugoShaka @rosstimothy PTAL

Comment thread lib/events/athena/athena.go Outdated
Comment thread lib/events/athena/consumer.go Outdated
Comment thread lib/events/athena/consumer.go Outdated
Comment thread lib/events/athena/consumer_test.go Outdated
Comment thread lib/events/athena/consumer_test.go Outdated
Comment thread lib/events/athena/consumer_test.go Outdated
Comment thread lib/events/athena/consumer.go Outdated
Comment thread lib/events/athena/consumer.go Outdated
Comment thread lib/events/athena/consumer.go Outdated
Comment thread lib/events/athena/consumer.go
@tobiaszheller
Copy link
Copy Markdown
Contributor Author

@hugoShaka friendly ping

Comment thread lib/events/athena/athena.go Outdated
Comment thread lib/events/athena/consumer.go Outdated
Comment on lines +136 to +138
// TODO(tobiaszheller): come back at some point and rework configuration of runWhileLocked.
// Now it tries every 250ms to acquire lock which can cause pressure on backend.
err = backend.RunWhileLocked(ctx, c.backend, lockName, lockTTL, func(ctx context.Context) error {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any reason a single Auth can't process multiple batches in a row? Can we use a longer TTL and just delete the lock if there is no more work?

Comment thread lib/events/athena/consumer.go Outdated
Comment thread lib/events/athena/consumer.go
Comment thread lib/events/athena/consumer.go Outdated
Copy link
Copy Markdown
Contributor

@hugoShaka hugoShaka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've not much to say about the code. My biggest concern is how this thing will fail and how to know/react. I don't think the RFD discussed backpressure and failure modes; if it did please point me to the discussion. You can disregard or postpone addressing this comment, I'll approve the PR tomorrow anyway.

We have no guarantee the auth can consume items faster than they pile up in the queue. It's not an issue per-se, but when this happens, we need to know how the system is doing, if it is consuming faster or slower than the event input, is stopped, cannot acquire lock, ... I think the following metrics would be a solid starting point:

  • batch processing duration (histogram)
  • batch size (histogram)
  • batch count (histogram) (size vs count is because we can have a lot of small events or a few XXL, large batches can lead to memory pressure)
  • batch processed (counter)
  • last event seen (gauge/timestamp)

This will also allow tuning batch size, flush interval, and measure how the system behaves under load.

Comment thread lib/events/athena/consumer.go Outdated
maxWaitTimeOnReceiveMessageFromSQS = 5 * time.Second
// maxNumberOfWorkers defines how many workers are processing messages
// from queue or writing parquet files to s3.
maxNumberOfWorkers = 5
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why was the number 5 chosen?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have hardcode 5 for now just based on gut feeling. In future probably this number should depend on how many items are in queue. 5 workers on my dev machine were enough to handle max load defined in cloud RFD (250 events/s if I remember correctly).

Comment thread lib/events/athena/consumer.go Outdated
Comment thread lib/events/athena/consumer.go Outdated
Comment thread lib/events/athena/consumer.go Outdated
Comment thread lib/events/athena/consumer.go
Comment thread lib/events/athena/consumer.go Outdated
Comment thread lib/events/athena/consumer.go Outdated
@tobiaszheller
Copy link
Copy Markdown
Contributor Author

I've not much to say about the code. My biggest concern is how this thing will fail and how to know/react. I don't think the RFD discussed backpressure and failure modes; if it did please point me to the discussion. You can disregard or postpone addressing this comment, I'll approve the PR tomorrow anyway.

We have no guarantee the auth can consume items faster than they pile up in the queue. It's not an issue per-se, but when this happens, we need to know how the system is doing, if it is consuming faster or slower than the event input, is stopped, cannot acquire lock, ... I think the following metrics would be a solid starting point:

  • batch processing duration (histogram)
  • batch size (histogram)
  • batch count (histogram) (size vs count is because we can have a lot of small events or a few XXL, large batches can lead to memory pressure)
  • batch processed (counter)
  • last event seen (gauge/timestamp)

This will also allow tuning batch size, flush interval, and measure how the system behaves under load.

@hugoShaka thanks for raising it and sorry for not making it clear in description. I am planning to come back later in next PRs and add multiple metrics and replace debug messages. For metrics PRs I was planning to involve and get insights from Cloud team, because they will be monitoring those. Some metrics are available by AWS out of the box but I am not sure if Cloud team prefer to use AWS ones or if we should publish ours. We will also utilize dead-letter queue if messages cannot be processed.

This PR is already complex so I think pushing metrics to other is reasonable.

Comment thread lib/events/athena/consumer.go Outdated
Comment thread lib/events/athena/consumer.go
Comment thread lib/events/athena/consumer.go Outdated
Comment thread lib/events/athena/consumer.go Outdated
Comment thread lib/events/athena/consumer_test.go
Comment thread lib/events/athena/consumer_test.go
@tobiaszheller
Copy link
Copy Markdown
Contributor Author

@russjones @rosstimothy I have decided to move "locking" part into other PR and keep here only reading from SQS.

@tobiaszheller
Copy link
Copy Markdown
Contributor Author

@rosstimothy
I have added two more improvements to prevent putting to much pressure on CPU when something goes wrong:

Comment thread lib/events/athena/consumer.go Outdated
Comment thread lib/events/athena/consumer.go Outdated
Comment thread lib/events/athena/consumer.go Outdated
Comment thread lib/events/athena/consumer.go Outdated
Comment thread lib/events/athena/consumer.go Outdated
Comment thread lib/events/athena/consumer.go Outdated
Comment thread lib/events/athena/consumer.go Outdated
Comment thread lib/events/athena/consumer.go Outdated
Comment thread lib/events/athena/consumer_test.go Outdated
Comment thread lib/events/athena/consumer_test.go Outdated
@tobiaszheller tobiaszheller force-pushed the tobiaszheller/auditevents-athena-sqs branch from 8b6e9e4 to 5d1dd3b Compare April 24, 2023 16:35
@tobiaszheller tobiaszheller enabled auto-merge April 24, 2023 16:37
@tobiaszheller tobiaszheller added this pull request to the merge queue Apr 24, 2023
@github-merge-queue github-merge-queue Bot removed this pull request from the merge queue due to failed status checks Apr 24, 2023
@tobiaszheller tobiaszheller added this pull request to the merge queue Apr 24, 2023
@github-merge-queue github-merge-queue Bot removed this pull request from the merge queue due to failed status checks Apr 24, 2023
@tobiaszheller tobiaszheller force-pushed the tobiaszheller/auditevents-athena-sqs branch from 5d1dd3b to a7f01d4 Compare April 25, 2023 09:44
@tobiaszheller tobiaszheller enabled auto-merge April 25, 2023 09:44
@tobiaszheller tobiaszheller force-pushed the tobiaszheller/auditevents-athena-sqs branch from a7f01d4 to 7eee9db Compare April 25, 2023 09:56
@tobiaszheller tobiaszheller added this pull request to the merge queue Apr 25, 2023
Merged via the queue into master with commit e93c6a9 Apr 25, 2023
@tobiaszheller tobiaszheller deleted the tobiaszheller/auditevents-athena-sqs branch April 25, 2023 10:36
@public-teleport-github-review-bot
Copy link
Copy Markdown

@tobiaszheller See the table below for backport results.

Branch Result
branch/v13 Failed

@public-teleport-github-review-bot
Copy link
Copy Markdown

@tobiaszheller See the table below for backport results.

Branch Result
branch/v13 Create PR

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

audit-log Issues related to Teleports Audit Log size/lg

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants