Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(DC): Add Storage selector support for Entity configuration #3545

Merged
merged 11 commits into from
Jan 9, 2023

Conversation

enochtangg
Copy link
Member

Overview

This PR is responsible for adding capabilities for defining storage selectors in entity configuration. Now that query plan builders have been consolidated, we can now support custom or default storage selectors defined on an entity. For context, right now only Errors and Sessions have custom storage selectors based on a query. All other entities simply default to single storage regardless of the query.

What has changed (before and after)?

  • All pluggable entities now explicitly require a storage selector even if there is only one storage specified.
  • DefaultQueryStorageSelector has been created and added to all entities (python defined ones too). This is used to make storage selection explicit.
  • A new snuba/datasets/storages/selectors/ directory was created to house all storage selectors. The custom sessions and errors selectors were also moved here.

Blast Radius

This change is being applied to all entities. Entities must now specify a storage selector (default is provided).

Testing

  • Existing query plan builder tests were updated to use default storage selectors.
  • Entity tests are passing as expected with the selector.
  • A new isolated test for default storage selector was added.

@enochtangg enochtangg requested a review from a team as a code owner December 19, 2022 21:55
@enochtangg enochtangg force-pushed the enoch/add_storage_selector_to_config branch from 4e82b8e to 1aa4071 Compare December 19, 2022 21:59
@enochtangg enochtangg force-pushed the enoch/add_storage_selector_to_config branch from 1aa4071 to 9b7e9d9 Compare December 19, 2022 22:02
@codecov-commenter
Copy link

codecov-commenter commented Dec 19, 2022

Codecov Report

Base: 92.19% // Head: 92.47% // Increases project coverage by +0.27% 🎉

Coverage data is based on head (0e9c8d0) compared to base (c92c14b).
Patch coverage: 97.04% of modified lines in pull request are covered.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #3545      +/-   ##
==========================================
+ Coverage   92.19%   92.47%   +0.27%     
==========================================
  Files         726      732       +6     
  Lines       33790    33905     +115     
==========================================
+ Hits        31152    31352     +200     
+ Misses       2638     2553      -85     
Impacted Files Coverage Δ
snuba/datasets/configuration/json_schema.py 100.00% <ø> (ø)
snuba/datasets/storage.py 94.02% <ø> (+0.69%) ⬆️
snuba/datasets/pluggable_entity.py 93.93% <75.00%> (-1.23%) ⬇️
...ba/datasets/entities/storage_selectors/sessions.py 92.59% <92.59%> (ø)
...nuba/datasets/entities/storage_selectors/errors.py 93.33% <93.33%> (ø)
...ba/datasets/entities/storage_selectors/selector.py 96.15% <96.15%> (ø)
snuba/datasets/cdc/groupassignee_entity.py 100.00% <100.00%> (ø)
snuba/datasets/cdc/groupedmessage_entity.py 100.00% <100.00%> (ø)
snuba/datasets/configuration/entity_builder.py 100.00% <100.00%> (ø)
snuba/datasets/entities/discover.py 100.00% <100.00%> (ø)
... and 24 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

storage_and_mappers_list: Sequence[StorageAndMappers],
) -> StorageAndMappers:
for sm_tuple in storage_and_mappers_list:
if storage == sm_tuple.storage:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This check feels a bit brittle. The problem is that we don't have a proper equality check for storages, so what you are actually checking here is that the instance of the storages is the same (Python compares the memory address of the variables and sees if they are the same).

In theory that should be fine, since we only instantiate each storage once, but it would be better to compare the storage key I think. Storages are aware of their own storage key. storage.storage_key == ...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, updated to compare using storage key instead.

def get_storage_mapping_pair(
self,
storage: ReadableStorage,
storage_and_mappers_list: Sequence[StorageAndMappers],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
storage_and_mappers_list: Sequence[StorageAndMappers],
storage_and_mappers: Sequence[StorageAndMappers],

"""

def __init__(self, storage: str) -> None:
self.storage = get_storage(StorageKey(storage))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you could probably just store the StorageKey here. Then in the select_storage function you could compare the storage key directly: if sm.storage.storage_key == self.storage_key.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.

from snuba.query.query_settings import HTTPQuerySettings
from snuba.query.snql.parser import parse_snql_query

TEST_CASES = [
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would add a couple test cases for the errors and sessions ones as well.

@volokluev
Copy link
Member

I have some questions about the design choices made here:

All pluggable entities now explicitly require a storage selector even if there is only one storage specified.

Why not have the default be implicit and have the entity fail to instantiate if the deafult is not sufficient? You can accomplish this with the __post_init__ method of the dataclass. Seems like one fewer config option for users to think about.

A new snuba/datasets/storages/selectors/ directory was created to house all storage selectors. The custom sessions and errors selectors were also moved here.

Aren't storage selectors part of entities? Wouldn't it make more sense to put it under there?

Copy link
Member

@volokluev volokluev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the UX could be improved here but you've been in the guts of the code more so I look forward to your perspective on it

Comment on lines 72 to 77
for sm in storage_and_mappers_list:
if sm.storage == self.storage:
return sm
raise QueryStorageSelectorError(
"Unable to select storage. Storage not defined."
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this function, the storage_and_mappers_list is only supposed to have one element. I don't think you need to be doing any comparisons here just

assert len(storage_and_mappers_list) == 1
return storage_and_mappers_list[0]

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't need the __init__ function either in that case.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated!

Comment on lines 13 to 15
def __init__(self) -> None:
self.__errors_table = get_writable_storage(StorageKey.ERRORS)
self.__errors_ro_table = get_storage(StorageKey.ERRORS_RO)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This pattern is duplicating knowledge that is already in the configuration and then using get_storage_mapping_pair to do this reverse lookup, it's not very intuitive. I think there are more egonomic ways to do this.

For example, we can provide the guarantee that storage_and_mappers_list is always going to be in the order that it is defined in configuration, that way you can do something like:

   def select_storage(
        self,
        query: Query,
        query_settings: QuerySettings,
        storage_and_mappers_list: List[StorageAndMappers],
    ) -> StorageAndMappers:
        use_readonly_storage = (
            state.get_config("enable_events_readonly_table", False)
            and not query_settings.get_consistent()
        )
        errors_ro, errors = storage_and_mappers_list
        if use_readonly_storage:
            return errors_ro
        return errors

That way you don't need to be worrying about implementing a good reverse mapping function and the coherence of the configuration is preserved

Copy link
Member Author

@enochtangg enochtangg Jan 3, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, this is duplicating knowledge that is already in configuration, but it's also the way we do things today (selectors define storages in its constructor). However, I'm also not a fan of the reverse lookup.

Guaranteeing order is an interesting approach (e.g. sorting storages by some condition). Is what you're suggesting, read-only storages should always be defined first in the array, followed by writable storages? What if there are multiple readable storages? How should they be ordered?

To summarize the problem, it is the job of select_storage() to determine which storage (defined on entity) to use based on input parameters. If the selection choice is always binary (readable or writable), it could probably just check the instance type. However, what if the choice isn't always between readable and writable? Is this something we even need to design for?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated this to remove the reverse lookup thing and to check instance type instead. This solution would not work for multiple readable storages though. But perhaps we don't need to design for that yet.

@enochtangg
Copy link
Member Author

enochtangg commented Jan 3, 2023

@volokluev

Why not have the default be implicit and have the entity fail to instantiate if the deafult is not sufficient? You can accomplish this with the __post_init__ method of the dataclass. Seems like one fewer config option for users to think about.

I'm okay with making this default storage selector implicit too. My initial thought was since storages defined in the entity will eventually be formatted as an array (in a subsequent PR), implicit selection from an array might not be too obvious. I'm fine with either though.

Aren't storage selectors part of entities? Wouldn't it make more sense to put it under there?

You're right, I will move this directory over.

@@ -310,7 +265,8 @@ def __init__(self) -> None:
storages=[storage],
query_pipeline_builder=SimplePipelineBuilder(
query_plan_builder=StorageQueryPlanBuilder(
storages=[StorageAndMappers(storage, TranslationMappers())]
storages=[StorageAndMappers(storage, TranslationMappers())],
selector=DefaultQueryStorageSelector(),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is it possible this is using the DefaultQueryStorageSelector? Shouldn't it be using the sessions selector?

Copy link
Member Author

@enochtangg enochtangg Jan 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ORG_SESSIONS only uses the org materialized sessions storage so the default is possible. SESSIONS on the other hand, requires the storage selector.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah got it. OK cool.

@enochtangg enochtangg merged commit 1d701ee into master Jan 9, 2023
@enochtangg enochtangg deleted the enoch/add_storage_selector_to_config branch January 9, 2023 17:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants