Skip to content

Conversation

pracucci
Copy link
Contributor

@pracucci pracucci commented Dec 14, 2020

What this PR does:
We've been reported an high memory utilisation in the store-gateway when shuffle-sharding is enabled and the cluster has a large number of tenants and replicas. This memory utilisation increase is caused by the in-memory subring cache used by the shuffle-sharding, which is used to reduce the CPU utilisation due to the expensive computation of the subring.

However, store-gateway, ruler and compactor use Ring.ShuffleShard() only to check whether a tenant/block/rulegroup belongs to their own shard and this operation is run, for each tenant, only at a regular interval (it doesn't depend on QPS).

Disabling subring cache is a possible solution (proposed in this PR), but then CPU utilisation is expected to increase. For this reason, in this PR I've also worked to optimise Ring.ShuffleShard() paying attention to not increase CPU time for Ring.Get() (problem is that it's easy to optimise ShuffleShard() but hard to do without impacting Get()).

BenchmarkRing_ShuffleShard_512Tokens (512 tokens per instance)

benchmark                                   old ns/op     new ns/op     delta
BenchmarkRing_ShuffleShard_512Tokens-12     1396063       389304        -72.11%

benchmark                                   old allocs     new allocs     delta
BenchmarkRing_ShuffleShard_512Tokens-12     60             59             -1.67%

benchmark                                   old bytes     new bytes     delta
BenchmarkRing_ShuffleShard_512Tokens-12     681775        58226         -91.46%

BenchmarkRing_Get

benchmark                                   old ns/op     new ns/op     delta
BenchmarkRing_Get-12                        853           764           -10.43%

benchmark                                   old allocs     new allocs     delta
BenchmarkRing_Get-12                        0              0              +0.00%

benchmark                                   old bytes     new bytes     delta
BenchmarkRing_Get-12                        0             0             +0.00%

BenchmarkRing_ShuffleShard (128 tokens per instance)

BenchmarkRing_ShuffleShard/num_instances_=_50,_num_zones_=_1,_shard_size_=_3-12              100674        32002         -68.21%
BenchmarkRing_ShuffleShard/num_instances_=_50,_num_zones_=_1,_shard_size_=_10-12             393978        152904        -61.19%
BenchmarkRing_ShuffleShard/num_instances_=_50,_num_zones_=_1,_shard_size_=_30-12             1316209       639339        -51.43%
BenchmarkRing_ShuffleShard/num_instances_=_50,_num_zones_=_3,_shard_size_=_3-12              117988        48928         -58.53%
BenchmarkRing_ShuffleShard/num_instances_=_50,_num_zones_=_3,_shard_size_=_10-12             525024        183298        -65.09%
BenchmarkRing_ShuffleShard/num_instances_=_50,_num_zones_=_3,_shard_size_=_30-12             1221208       558804        -54.24%
BenchmarkRing_ShuffleShard/num_instances_=_100,_num_zones_=_1,_shard_size_=_3-12             112511        37656         -66.53%
BenchmarkRing_ShuffleShard/num_instances_=_100,_num_zones_=_1,_shard_size_=_10-12            408735        153631        -62.41%
BenchmarkRing_ShuffleShard/num_instances_=_100,_num_zones_=_1,_shard_size_=_30-12            1328229       655980        -50.61%
BenchmarkRing_ShuffleShard/num_instances_=_100,_num_zones_=_3,_shard_size_=_3-12             121937        49367         -59.51%
BenchmarkRing_ShuffleShard/num_instances_=_100,_num_zones_=_3,_shard_size_=_10-12            466278        176984        -62.04%
BenchmarkRing_ShuffleShard/num_instances_=_100,_num_zones_=_3,_shard_size_=_30-12            1265773       562446        -55.57%
BenchmarkRing_ShuffleShard/num_instances_=_1000,_num_zones_=_1,_shard_size_=_3-12            95741         31072         -67.55%
BenchmarkRing_ShuffleShard/num_instances_=_1000,_num_zones_=_1,_shard_size_=_10-12           376408        143209        -61.95%
BenchmarkRing_ShuffleShard/num_instances_=_1000,_num_zones_=_1,_shard_size_=_30-12           1328692       616708        -53.59%
BenchmarkRing_ShuffleShard/num_instances_=_1000,_num_zones_=_3,_shard_size_=_3-12            114967        47096         -59.04%
BenchmarkRing_ShuffleShard/num_instances_=_1000,_num_zones_=_3,_shard_size_=_10-12           441547        176075        -60.12%
BenchmarkRing_ShuffleShard/num_instances_=_1000,_num_zones_=_3,_shard_size_=_30-12           1288778       551126        -57.24%

Which issue(s) this PR fixes:
N/A

Checklist

  • Tests updated
  • Documentation added
  • CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

pkg/ring/ring.go Outdated
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The config option is "disabled bool" instead of "enabled bool" so that the default will be "enabled" (because the zero value of a bool is false).

Copy link
Contributor

@pstibrany pstibrany left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense to me

@pracucci pracucci marked this pull request as draft December 15, 2020 15:05
@pracucci pracucci force-pushed the do-not-cache-subring-if-not-required branch from be49fee to 438fa45 Compare December 15, 2020 16:55
@pracucci pracucci requested a review from pstibrany December 15, 2020 16:56
pkg/ring/util.go Outdated
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the ring lifecycler it's already guaranteed all tokens are sorted. I've introduced sorting here as well to have same guarantee in tests too.

@pracucci pracucci changed the title Disabled subring cache in store-gateway, ruler and compactor Optimised Ring.ShuffleShard() and disabled subring cache in store-gateway, ruler and compactor Dec 15, 2020
@pracucci pracucci marked this pull request as ready for review December 15, 2020 18:49
Copy link
Contributor

@pstibrany pstibrany left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM so far. (I thought it's still a draft :))

@pracucci pracucci force-pushed the do-not-cache-subring-if-not-required branch from c0d0e97 to 9d2de1d Compare December 16, 2020 11:38
@pracucci pracucci force-pushed the do-not-cache-subring-if-not-required branch from 3e87a3d to 49e56e8 Compare January 7, 2021 10:11
@pracucci pracucci merged commit e4867d8 into cortexproject:master Jan 8, 2021
@pracucci pracucci deleted the do-not-cache-subring-if-not-required branch January 8, 2021 14:07
pracucci added a commit to pracucci/cortex that referenced this pull request Feb 12, 2021
pracucci added a commit that referenced this pull request Feb 12, 2021
pracucci added a commit to pracucci/cortex that referenced this pull request Feb 12, 2021
khaines pushed a commit that referenced this pull request Feb 13, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants