Skip to content

Improve role fetching and unmarshaling#42007

Merged
espadolini merged 2 commits intomasterfrom
espadolini/improve-role-data-handling
May 24, 2024
Merged

Improve role fetching and unmarshaling#42007
espadolini merged 2 commits intomasterfrom
espadolini/improve-role-data-handling

Conversation

@espadolini
Copy link
Copy Markdown
Contributor

@espadolini espadolini commented May 24, 2024

This PR changes the default page size for ListRoles (that the API client uses by default, and by extent the cache initialization process) from 1000 to 100; roles can require a lot of effort during unmarshaling (as we validate expressions, which requires parsing them) and after a cluster event such as an auth server restart, a lot of nodes will all connect at the same time and get roles 1000 at a time, which can spike the memory used by the auth server. The default iteration page size of 1000 was picked somewhat arbitrarily and works well for resources that are just data with very little effort required during unmarshaling, but roles can be pretty big and hang around for a while, so tuning the page size down can help mitigate these memory requirement bursts.

In addition, this PR changes the version checking for roles to use jsoniter.Get to access the one field we need, rather than using encoding/json's unmarshaler for a whole resource header. A simple benchmark that unmarshals a reasonably "average" role with one where expression shows a 28% improvement in time spent in services.UnmarshalRoleV6, with a tiny reduction in allocations as well.

Benchmark result:
current
BenchmarkUnmarshalRole    210998             27651 ns/op           17377 B/op        331 allocs/op

new
BenchmarkUnmarshalRole    293476             19791 ns/op           16521 B/op        320 allocs/op
CPU profile flame graph for a relative comparison (with FastUnmarshal and ValidateRole being the same in both): Screenshot 2024-05-24 alle 12 31 06 Screenshot 2024-05-24 alle 12 31 26

changelog: reduced memory and cpu usage after control plane restarts in clusters with a high number of roles

@public-teleport-github-review-bot public-teleport-github-review-bot Bot removed the request for review from ibeckermayer May 24, 2024 10:57
@espadolini espadolini requested a review from rosstimothy May 24, 2024 11:35
@espadolini
Copy link
Copy Markdown
Contributor Author

espadolini commented May 24, 2024

For the record: the existing implementation sits at 27651 ns/op, this one with jsoniter.Get at 19791 ns/op, simply replacing json.Unmarshal with utils.FastUnmarshal (i.e. using jsoniter for a full ResourceHeader) is at 22823 ns/op - it's well worth it to apply the same approach to resources that are using json.Unmarshal but maybe not so much if they're already using utils.FastUnmarshal.

@public-teleport-github-review-bot public-teleport-github-review-bot Bot removed the request for review from rosstimothy May 24, 2024 13:36
@espadolini espadolini added this pull request to the merge queue May 24, 2024
Merged via the queue into master with commit bfa3a0e May 24, 2024
@espadolini espadolini deleted the espadolini/improve-role-data-handling branch May 24, 2024 17:35
@public-teleport-github-review-bot
Copy link
Copy Markdown

@espadolini See the table below for backport results.

Branch Result
branch/v14 Failed
branch/v15 Failed

espadolini added a commit that referenced this pull request May 27, 2024
* Reduce default ListRoles page size

* Avoid stdjson in UnmarshalRoleV6
espadolini added a commit that referenced this pull request May 27, 2024
* Reduce default ListRoles page size

* Avoid stdjson in UnmarshalRoleV6
github-merge-queue Bot pushed a commit that referenced this pull request May 28, 2024
* Reduce default ListRoles page size

* Avoid stdjson in UnmarshalRoleV6
github-merge-queue Bot pushed a commit that referenced this pull request May 28, 2024
* Reduce default ListRoles page size

* Avoid stdjson in UnmarshalRoleV6
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants