Skip to content

Avoid throttling on dynamodb DescribeStream#53790

Merged
espadolini merged 2 commits intomasterfrom
espadolini/dynamodb-throttling-fix
Apr 8, 2025
Merged

Avoid throttling on dynamodb DescribeStream#53790
espadolini merged 2 commits intomasterfrom
espadolini/dynamodb-throttling-fix

Conversation

@espadolini
Copy link
Copy Markdown
Contributor

@espadolini espadolini commented Apr 8, 2025

The DynamoDB backend driver updates its known list of DynamoDB stream shards periodically (every PollStreamPeriod, defaulting to 1 second) by calling the dynamodb:DescribeStream API. Said API is documented to be rate limited (for a given stream) to 10 calls per second, but it's paginated, and certain abnormal DynamoDB workloads (often as a result of bugs, like the one fixed by #53298) can result in the creation of a lot of shards, such that it takes several pages of results from DescribeStream to get a full list. We currently don't limit how quickly we advance through pages, and Teleport deployments on DynamoDB usually run two auths, which can result in so much throttling that the default retry behavior of the AWS SDK ends up surfacing the throttling error anyway.

Such throttling ends up tripping up the backend event stream periodically (even as often as once every 2 or 3 minutes) which leads to the auth cache being reset which leads to all instances resetting their cache and general poor UX (a broken backend event stream can lead to changes only taking effect after a while, breaking web logins and the terraform provider).

This PR adds a forced wait between calls to DescribeStream, and changes the interval of shard refreshing from once every PollStreamPeriod to a PollStreamPeriod between the end of the previous refresh and the beginning of the next.

This change has been tested with a dev build for a cloud tenant affected by the throttling and solved the issue. I have ran the lib/backend/dynamo tests a few times locally (against AWS, seeing as there's still no good dynamodb simulator unfortunately) with no failures.

Example error in the Auth Service logs:

"message":"Poll streams returned with error","component":"dynamodb","error":"operation error DynamoDB Streams: DescribeStream, exceeded maximum number of attempts, 3, https response error StatusCode: 400, RequestID: [cut], api error ThrottlingException: Rate exceeded"

changelog: fixed throttling in the DynamoDB backend event stream for tables with a high amount of stream shards

@public-teleport-github-review-bot public-teleport-github-review-bot Bot removed the request for review from fspmarshall April 8, 2025 14:20
@espadolini espadolini added this pull request to the merge queue Apr 8, 2025
@github-merge-queue github-merge-queue Bot removed this pull request from the merge queue due to failed status checks Apr 8, 2025
@espadolini espadolini added this pull request to the merge queue Apr 8, 2025
Merged via the queue into master with commit 190241b Apr 8, 2025
43 of 47 checks passed
@espadolini espadolini deleted the espadolini/dynamodb-throttling-fix branch April 8, 2025 15:55
@backport-bot-workflows
Copy link
Copy Markdown
Contributor

@espadolini See the table below for backport results.

Branch Result
branch/v15 Create PR
branch/v16 Create PR
branch/v17 Create PR

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants