feat(slices): Make Subscription Scheduler filter by slice ID #3338

ayirr7 · 2022-11-03T15:33:34Z

For general context:
https://getsentry.atlassian.net/browse/SNS-1759

Up-to-date Approach for scheduling Subscriptions in a Sliced context:

Getting the partition key from the corresponding EntitySubscription
Mapping the partition key's value (some integer) to slice id (using partitioning functions)
Check if this slice id is equal to the slice id that is passed into the SubscriptionScheduler instance
If equal, keep the Subscription. Otherwise, filter out.
Ensure that tasks are being scheduled on a sliced (physical) subscriptions scheduler topic, if a slice_id is passed in

Updates: There was discussion around changing this approach, however, we are now deciding to move forward with this approach for now and will most likely move around/refactor once changes to EntitySubscription are merged.

codecov-commenter · 2022-11-03T23:15:28Z

Codecov Report

Base: 92.25% // Head: 21.91% // Decreases project coverage by -70.33% ⚠️

Coverage data is based on head (64f18eb) compared to base (a6a41ef).
Patch coverage: 11.66% of modified lines in pull request are covered.

❗ Current head 64f18eb differs from pull request most recent head 61ddcb8. Consider uploading reports for the commit 61ddcb8 to get more accurate results

Additional details and impacted files

@@             Coverage Diff             @@
##           master    #3338       +/-   ##
===========================================
- Coverage   92.25%   21.91%   -70.34%     
===========================================
  Files         725      684       -41     
  Lines       33828    32451     -1377     
===========================================
- Hits        31207     7112    -24095     
- Misses       2621    25339    +22718

Impacted Files	Coverage Δ
snuba/cli/subscriptions_scheduler.py	`0.00% <0.00%> (ø)`
snuba/clickhouse/formatter/expression.py	`33.33% <0.00%> (-61.49%)`	⬇️
snuba/clickhouse/formatter/query.py	`0.00% <0.00%> (-98.81%)`	⬇️
snuba/datasets/entities/factory.py	`0.00% <0.00%> (-92.11%)`	⬇️
snuba/migrations/groups.py	`95.61% <ø> (ø)`
snuba/query/__init__.py	`42.42% <ø> (-51.27%)`	⬇️
snuba/subscriptions/scheduler.py	`0.00% <0.00%> (-97.89%)`	⬇️
snuba/subscriptions/scheduler_consumer.py	`0.00% <0.00%> (-92.91%)`	⬇️
...uba/subscriptions/scheduler_processing_strategy.py	`0.00% <0.00%> (-90.59%)`	⬇️
snuba/web/db_query.py	`0.00% <ø> (-84.80%)`	⬇️
... and 656 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

evanh · 2022-11-07T16:57:08Z

@enochtangg You should look at this, we'll need to factor this into the EntitySubscription changes.

dbanda · 2022-11-09T20:46:46Z

snuba/subscriptions/scheduler.py

+                    partition_key_value = entity_sub.get_partitioning_key()
+
+                    # map the partition key's value to the slice ID
+                    logical_part = map_org_id_to_logical_partition(partition_key_value)


Seems like here you are assuming the partition key for entity_sub is always org_id. Is that always the case? Seems like this is tightly coupled to the implementation of entity_sub and would break if we change the partitioning key. SessionSubscription has an organization attribute, maybe we can use that to make it clear we are getting the org_id? Or maybe map_org_id_to_logical_partition can be refactored to be more general i.e map_patition_key_to_logical_partition

That's a good point, and it makes sense. I think it would be good to change it to map_partition_key_to_logical_partition as you are suggesting.

…uler

nikhars

Please add slice_id tag to the MetricsWrapper when slice_id is passed via the CLI

nikhars · 2022-11-22T18:21:01Z

snuba/subscriptions/scheduler.py

@@ -321,6 +329,36 @@ def __reset_builder(self) -> None:
            # We are transitioning between jittered and immediate mode. We must use the delegate builder.
            self.__builder = self.__delegate_builder

+    def __get_filtered_subscriptions(self) -> List[Subscription]:


I would recommend that you pass the original list of subscriptions to this method rather than relying on self.__subscriptions and rename the method to __filter_subscriptions. Looking at the usage of self.__subscriptions I don't think there is value in having the private instance variable. You could probably get rid of it completely.

Sure, this makes sense. Updated.

nikhars · 2022-11-22T18:23:30Z

snuba/subscriptions/scheduler.py

@@ -321,6 +329,36 @@ def __reset_builder(self) -> None:
            # We are transitioning between jittered and immediate mode. We must use the delegate builder.
            self.__builder = self.__delegate_builder

+    def __get_filtered_subscriptions(self) -> List[Subscription]:
+        if self.__slice_id is not None:


It would be better if we move the slice_id checks in __get_subscriptions and call __get_filtered_subscriptions only in the case of slice_id being enabled.

nikhars · 2022-11-22T18:25:21Z

snuba/datasets/entity_subscriptions/entity_subscription.py

@@ -88,6 +92,9 @@ def get_entity_subscription_conditions_for_snql(
    def to_dict(self) -> Mapping[str, Any]:
        return {"organization": self.organization}

+    def get_partitioning_key(self) -> int:


Let's implement this method for GenericMetricsSetsSubscription and GenericMetricsDistributionsSubscription only for now since they are the only entities which need slicing.

Ok, this makes sense

…uler

nikhars · 2022-12-08T17:10:32Z

snuba/subscriptions/scheduler.py

+            # get the metadata and org_id from the Subscription
+            sub_data = subscription.data
+            sub_metadata = sub_data.metadata
+            org_id = sub_metadata["organization"]


Is there a way to make access to the metadata fields be specific to an Entity? Not all subscriptions would have an organization field. So if tomorrow this method gets called for some other Entity, this would cause an Exception

Good point. I think we can add a condition for checking the EntityKey that corresponds with a Subscription. Where we add this condition depends:

if slice_id is not None, but the entity of the subscription is not within generic metrics, should we just return the regular list of subscriptions?

So, should we just restrict the filter step to Subscriptions that satisfy these conditions: (1) slice_id is not None (2) the EntityKey for the Subscription is within generic metrics

Lets allow filtering to be called when slice_id is enabled. Within the filter method you can restrict it to specific EntityKey.

Ok, that makes sense to me. Just to be clear, does this mean that if we have a slice_id enabled, but a non-generic metrics entity, we return all of the subscriptions (no filtering needed) or something else?

Yes. The filter can just return the original subscriptions list for non-generic metrics entity

…uler

onewland

This isn't currently reading from the correct (slice-specific) commit-log topic.

Either:
https://github.com/getsentry/snuba/blob/master/snuba/subscriptions/scheduler_consumer.py#L260-L266

or
https://github.com/getsentry/snuba/blob/master/snuba/subscriptions/scheduler_consumer.py#L235-L237

needs to take slice_id into account

…k topics

onewland

Mostly good with this approach (and tested it locally) but let's make sure there's a unit test on the filtering before we commit to master

onewland · 2022-12-14T16:37:13Z

snuba/subscriptions/scheduler.py

+            self.__entity_key == EntityKey.GENERIC_METRICS_SETS
+            or self.__entity_key == EntityKey.GENERIC_METRICS_DISTRIBUTIONS


can we check if the storage set for the given entity key is sliced using the global logic rather than hardcoding these here?

the entity key is constant after initialization so we should only have to test for that once

makes sense, changed

onewland · 2022-12-14T16:37:20Z

snuba/subscriptions/scheduler.py

@@ -321,6 +328,36 @@ def __reset_builder(self) -> None:
            # We are transitioning between jittered and immediate mode. We must use the delegate builder.
            self.__builder = self.__delegate_builder

+    def __filter_subscriptions(self) -> List[Subscription]:
+


nit: extra line

onewland · 2022-12-14T16:37:25Z

snuba/subscriptions/scheduler.py

+            self.__entity_key == EntityKey.GENERIC_METRICS_SETS
+            or self.__entity_key == EntityKey.GENERIC_METRICS_DISTRIBUTIONS
+        ):
+


nit: extra line

onewland · 2022-12-14T16:38:39Z

snuba/subscriptions/scheduler.py

+                # get the metadata and org_id from the Subscription
+                sub_data = subscription.data
+                sub_metadata = sub_data.metadata
+                org_id = sub_metadata["organization"]


should we do something if organization is None? maybe we can emit a metric and skip the current subscription?

good point, done

onewland · 2022-12-14T16:39:43Z

snuba/subscriptions/scheduler.py

@@ -321,6 +328,36 @@ def __reset_builder(self) -> None:
            # We are transitioning between jittered and immediate mode. We must use the delegate builder.
            self.__builder = self.__delegate_builder

+    def __filter_subscriptions(self) -> List[Subscription]:


I wonder if this should be split out of SubscriptionScheduler so that we can write unit tests for it. Maybe we could do something like

# def filter(subscriptions, slice_id): # ... SubscriptionFilter = Callable[[Sequence[Subscription], int], Sequence[Subscription]]

and there could be a filter argument to the constructor.

Even if we don't do this, I think one unit test should exist to make sure that the filtering works properly

…uler

nikhars · 2023-01-06T22:00:20Z

snuba/subscriptions/scheduler.py

+    entity_key: EntityKey,
+    metrics: MetricsBackend,
+    slice_id: Optional[int] = None,
+) -> List[Subscription]:


Suggested change

) -> List[Subscription]:

) -> MutableSequence[Subscription]:

FWIW, I think this type is copied from self.__subscriptions so it should probably be updated in both places or neither

I can change it back!

nikhars · 2023-01-06T22:02:30Z

snuba/subscriptions/scheduler.py

+                    if part_slice_id == slice_id:
+                        filtered_subscriptions.append(subscription)
+                else:
+                    metrics.increment("queries_with_orgID=None")


I don't know if datadog supports using = in tag names. But in general, you should avoid using it. You can rename this to something like subscription_with_empty_org_id

agree with this

onewland

Overall looks good except for the little things pointed out by Nikhar and me

onewland · 2023-01-06T22:10:42Z

tests/subscriptions/test_filter_subscriptions.py

+    importlib.reload(scheduler)
+
+    filtered_subs = filter_subscriptions(
+        subs, EntityKey.EVENTS, DummyMetricsBackend(strict=True), 2


A bit of a nit but can we used named rather than positional arguments here? It's just not super obvious that 2 is the org_id

I think it's generally good practice to use named arguments if the count is greater than 3

makes sense! fixed

onewland · 2023-01-06T22:12:58Z

tests/subscriptions/test_filter_subscriptions.py

+# create a list of subscriptions
+expected_subs = [build_subscription(timedelta(minutes=1), 2) for count in range(20)]
+extra_subs = [build_subscription(timedelta(minutes=3), 1) for count in range(10)]
+subs = expected_subs + extra_subs


Let's put these in the test method, or create fixture methods (https://docs.pytest.org/en/6.2.x/fixture.html#) to avoid future coupling

onewland · 2023-01-06T22:14:28Z

snuba/datasets/slicing.py

    """
-    Maps an org_id to a logical partition. Since SENTRY_LOGICAL_PARTITIONS is
+    Maps a partition key to a logical partition. Since SENTRY_LOGICAL_PARTITIONS is
    fixed, an org id will always be mapped to the same logical partition.


this doc comment and the method title still reference org_id, IMO we might just want to remove the rename, but I don't feel strongly

this is true. i can revert it and we can revisit later if needed

Riya Chakraborty added 5 commits November 2, 2022 10:50

Add the cli flag for slice_id

546cbea

Initial parameter updates

a7ae9cd

Add sample code for getting storage from entity in SubscriptionScheduler

3f006b1

Method with getting partitioning key's value from EntitySubscription

dfae23c

Clean up general logic

71462f4

Clean up changes

bc787a4

ayirr7 marked this pull request as ready for review November 4, 2022 22:17

ayirr7 requested a review from a team as a code owner November 4, 2022 22:17

ayirr7 changed the title ~~[wip]: Make Subscription Scheduler filter by slice ID~~ feat(slices): Make Subscription Scheduler filter by slice ID Nov 7, 2022

dbanda reviewed Nov 9, 2022

View reviewed changes

Riya Chakraborty added 4 commits November 21, 2022 15:18

Merge branch 'master' of github.com:getsentry/snuba into sliced-sched…

6e116a2

…uler

add sliced scheduler topic

0709cc8

fix import issue

a62ae4e

Move around function for getting filtered sub

4fd3070

ayirr7 requested review from nikhars and onewland November 21, 2022 22:21

make org_id mapping more general to partition_key

c1f0ed8

nikhars reviewed Nov 22, 2022

View reviewed changes

Riya Chakraborty added 2 commits December 6, 2022 11:12

Address PR comments

17fab78

Merge branch 'master' of github.com:getsentry/snuba into sliced-sched…

611b205

…uler

ayirr7 requested a review from nikhars December 6, 2022 19:58

Riya Chakraborty and others added 4 commits December 6, 2022 12:03

add metrics loggign

74ed52e

Merge branch 'master' of github.com:getsentry/snuba into sliced-sched…

a352221

…uler

Use metadata in subscription data

48d6feb

Add check for generic metrics entity keys

ae4cc71

nikhars reviewed Dec 8, 2022

View reviewed changes

ayirr7 added 2 commits December 13, 2022 10:54

Add back self.__subscriptions

220c209

move entity condition to filter subscriptions

bb39876

ayirr7 added 2 commits December 13, 2022 11:00

Merge branch 'master' of github.com:getsentry/snuba into sliced-sched…

9ae6775

…uler

storage key to storage set key

5c7b14e

onewland reviewed Dec 13, 2022

View reviewed changes

ayirr7 added 4 commits December 13, 2022 16:29

fix missing mappings of logical to physical commit log, scheduled tas…

aa4bd5e

…k topics

fix mypy issue

5eca91e

add slice_id into kafka consumer configuration builder

787f67e

Pass slice id into commit log producer config

b457cdb

onewland reviewed Dec 14, 2022

View reviewed changes

ayirr7 added 5 commits December 15, 2022 12:56

refactor filter subscription logic

0eb68eb

Merge branch 'master' of github.com:getsentry/snuba into sliced-sched…

165c0db

…uler

Add check and action for query with org_id=None

907e8dc

Add unit test scaffold for filtering subscriptions

5927f9d

Add unit test for filter subscriptions, proof of concept

6b6072a

ayirr7 requested a review from onewland January 6, 2023 18:45

ayirr7 added 3 commits January 6, 2023 13:01

Fix mypy issue

bfaca08

fix other mypy issue

f0b6335

fix mypy again

8e6d69b

nikhars reviewed Jan 6, 2023

View reviewed changes

onewland approved these changes Jan 6, 2023

View reviewed changes

ayirr7 added 3 commits January 9, 2023 11:16

change list to mutable sequence type

83d9064

add pytest fixture

61ddcb8

rename slicing parameters

25821d3

nikhars approved these changes Jan 9, 2023

View reviewed changes

ayirr7 merged commit 15d29cd into master Jan 9, 2023

ayirr7 deleted the sliced-scheduler branch January 9, 2023 20:46

		self.__entity_key == EntityKey.GENERIC_METRICS_SETS
		or self.__entity_key == EntityKey.GENERIC_METRICS_DISTRIBUTIONS

feat(slices): Make Subscription Scheduler filter by slice ID #3338

feat(slices): Make Subscription Scheduler filter by slice ID #3338

Conversation

ayirr7 commented Nov 3, 2022 • edited Loading

codecov-commenter commented Nov 3, 2022 • edited Loading

Codecov Report

evanh commented Nov 7, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nikhars left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ayirr7 Dec 8, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

onewland left a comment • edited Loading

Choose a reason for hiding this comment

onewland left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

onewland left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ayirr7 commented Nov 3, 2022 •

edited

Loading

codecov-commenter commented Nov 3, 2022 •

edited

Loading

ayirr7 Dec 8, 2022 •

edited

Loading

onewland left a comment •

edited

Loading