Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SNS-2737: Making changes for a simple readthrough cache without queuing #5992

Merged
merged 15 commits into from
Jun 10, 2024

Conversation

nachivrn
Copy link
Contributor

@nachivrn nachivrn commented Jun 1, 2024

read-through cache has complex logic to manage different states and conditions, queuing incoming queries to prevent duplicate executions. However, internal analysis shows that this complex logic of queuing in cache provides minimal benefit for searches with the same query ID concurrently, as these concurrent searches are relatively infrequent compared to the total volume. Therefore, the advantage of this complex logic does not justify the reduced maintainability of the code.

@nachivrn nachivrn marked this pull request as ready for review June 3, 2024 18:13
@nachivrn nachivrn requested a review from a team as a code owner June 3, 2024 18:13
Copy link
Member

@evanh evanh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please provide a description of your changes in the PR, along with links to any relevant docs and context / impact of the changes.

Also, this isn't a sufficient change, you will also need to ensure that the query_id is being set correctly. When we send a query to Clickhouse, we also have to include a query_id. Clickhouse will reject queries with identical query_ids. Normally the query_id is a hash of the SQL string, and the readthrough cache ensures that duplicate query_ids won't get sent to Clickhouse. https://github.com/getsentry/snuba/blob/master/snuba/web/db_query.py#L307

However with this change, we might see duplicate query_id in Clickhouse. So you will need to ensure that the query_id is overridden while still maintaining the cache keys.

snuba/state/cache/redis/backend.py Outdated Show resolved Hide resolved
@nachivrn
Copy link
Contributor Author

nachivrn commented Jun 3, 2024

@evanh Regarding the comment about queryId, it is set in the db_query file :
https://github.com/getsentry/snuba/blob/master/snuba/web/db_query.py#L387

When concurrent queries are received with the same queryId, due to a cache miss with the simple read-through cache, all concurrent queries will be directed to ClickHouse. One of the queries will be executed while the others will be rejected by ClickHouse. This situation is handled in db_query, which assigns a random value to queryId for the rejected queries.
https://github.com/getsentry/snuba/blob/master/snuba/web/db_query.py#L337

@onewland
Copy link
Contributor

onewland commented Jun 4, 2024

@evanh Regarding the comment about queryId, it is set in the db_query file : https://github.com/getsentry/snuba/blob/master/snuba/web/db_query.py#L387

When concurrent queries are received with the same queryId, due to a cache miss with the simple read-through cache, all concurrent queries will be directed to ClickHouse. One of the queries will be executed while the others will be rejected by ClickHouse. This situation is handled in db_query, which assigns a random value to queryId for the rejected queries. https://github.com/getsentry/snuba/blob/master/snuba/web/db_query.py#L337

Isn't this worse than randomizing the query ID before sending it to ClickHouse, though? It seems like there's a performance (and simplicity) penalty for two round-trips

@evanh
Copy link
Member

evanh commented Jun 4, 2024

Isn't this worse than randomizing the query ID before sending it to ClickHouse, though? It seems like there's a performance (and simplicity) penalty for two round-trips

I agree with Oliver here. The correct flow should be: check cache, and if it's a miss, randomize the query ID and send it to Clickhouse.

@nachivrn
Copy link
Contributor Author

nachivrn commented Jun 5, 2024

@evanh @onewland Thanks for your review comments. The focus of the Jira task was on simplifying the read-through cache by removing the queueing mechanism and adopting a straightforward approach, acknowledging that this introduces a performance penalty due to additional round-trips. The performance issue is recognized and it only affects the small percentage of request and can be addressed separately in a different PR. We will explore optimizations and monitor the system's performance after merging this PR.

Copy link
Member

@volokluev volokluev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should have a test that tests this cache functionality at least one level higher, e.g. in test_db_query

snuba/state/cache/redis/backend.py Outdated Show resolved Hide resolved
tests/state/test_cache.py Show resolved Hide resolved
@nachivrn nachivrn requested a review from a team June 6, 2024 22:20
@getsentry getsentry deleted a comment from codecov bot Jun 7, 2024
@nachivrn nachivrn requested review from volokluev and evanh June 7, 2024 00:29
Copy link
Member

@evanh evanh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couple comments, otherwise looks good.

snuba/state/cache/redis/backend.py Outdated Show resolved Hide resolved
snuba/state/cache/redis/backend.py Outdated Show resolved Hide resolved
snuba/web/db_query.py Outdated Show resolved Hide resolved
tests/web/test_db_query.py Outdated Show resolved Hide resolved
tests/web/test_db_query.py Outdated Show resolved Hide resolved
tests/web/test_db_query.py Outdated Show resolved Hide resolved
tests/web/test_db_query.py Outdated Show resolved Hide resolved
Nachiappan Veerappan Nachiappan and others added 11 commits June 7, 2024 15:40
incremental rollout.This flag should be set to a value
between 0 and 1 where 0 means use read-through cache with
queuing design and 1 use simple read-through cache.
one level higher test_db_query. Additionally adding tests
for function errors in test_cache.
When concurrent request for same query id is received
clickhouse query settings query id will be randomized before
directing the request to clickhouse. In the read-through cache we
will use the computed query id returned from get_query_cache_key
as cache key.
a synchronous call. Additionally also checking if sample_rate is not
set to None.
changing the stats with cache_hit_simple.
Copy link

codecov bot commented Jun 9, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Please upload report for BASE (master@ada4804). Learn more about missing BASE report.

✅ All tests successful. No failed tests found.

Additional details and impacted files
@@            Coverage Diff            @@
##             master    #5992   +/-   ##
=========================================
  Coverage          ?   92.08%           
=========================================
  Files             ?      896           
  Lines             ?    42854           
  Branches          ?        0           
=========================================
  Hits              ?    39460           
  Misses            ?     3394           
  Partials          ?        0           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@nachivrn nachivrn merged commit c173695 into master Jun 10, 2024
30 checks passed
@nachivrn nachivrn deleted the SNS-2737 branch June 10, 2024 18:43
Copy link

sentry-io bot commented Jun 11, 2024

Suspect Issues

This pull request was deployed and Sentry observed the following issues:

  • ‼️ ClickhouseError: DB::Exception: Missing columns: 'exception_main_thread' while processing query: 'SELECT identity(... snql_dataset_query_view__discover__api.auth-tok... View Issue
  • ‼️ ClickhouseError: DB::Exception: There's no column 'events._snuba_gen_8' in table 'events': While processing events... snql_dataset_query_view__events__api.organizati... View Issue
  • ‼️ ClickhouseError: DB::Exception: Not found column countIf(or(equals(op, 'cc.check'), equals(op, 'cc.load'))) in blo... snql_dataset_query_view__spans__api.trace-explo... View Issue
  • ‼️ ClickhouseError: DB::Exception: Received from snuba-transactions-tiger-mz-1-3:9000. DB::Exception: Too many simult... snql_dataset_query_view__discover__api.auth-tok... View Issue
  • ‼️ ClickhouseError: DB::Exception: Attempt to read after eof: while receiving packet from snuba-transactions-tiger-mz... snql_dataset_query_view__discover__api.trace-vi... View Issue

Did you find this useful? React with a 👍 or 👎

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants