Avoid throttling on dynamodb DescribeStream#53790
Merged
espadolini merged 2 commits intomasterfrom Apr 8, 2025
Merged
Conversation
rosstimothy
approved these changes
Apr 8, 2025
Tener
approved these changes
Apr 8, 2025
fspmarshall
approved these changes
Apr 8, 2025
Contributor
|
@espadolini See the table below for backport results.
|
This was referenced Apr 8, 2025
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The DynamoDB backend driver updates its known list of DynamoDB stream shards periodically (every
PollStreamPeriod, defaulting to 1 second) by calling thedynamodb:DescribeStreamAPI. Said API is documented to be rate limited (for a given stream) to 10 calls per second, but it's paginated, and certain abnormal DynamoDB workloads (often as a result of bugs, like the one fixed by #53298) can result in the creation of a lot of shards, such that it takes several pages of results fromDescribeStreamto get a full list. We currently don't limit how quickly we advance through pages, and Teleport deployments on DynamoDB usually run two auths, which can result in so much throttling that the default retry behavior of the AWS SDK ends up surfacing the throttling error anyway.Such throttling ends up tripping up the backend event stream periodically (even as often as once every 2 or 3 minutes) which leads to the auth cache being reset which leads to all instances resetting their cache and general poor UX (a broken backend event stream can lead to changes only taking effect after a while, breaking web logins and the terraform provider).
This PR adds a forced wait between calls to
DescribeStream, and changes the interval of shard refreshing from once everyPollStreamPeriodto aPollStreamPeriodbetween the end of the previous refresh and the beginning of the next.This change has been tested with a dev build for a cloud tenant affected by the throttling and solved the issue. I have ran the
lib/backend/dynamotests a few times locally (against AWS, seeing as there's still no good dynamodb simulator unfortunately) with no failures.Example error in the Auth Service logs:
changelog: fixed throttling in the DynamoDB backend event stream for tables with a high amount of stream shards