Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enforce uniq in HAVING and SELECT #2339

Merged
merged 16 commits into from
Jan 14, 2022
Merged

Conversation

volokluev
Copy link
Member

Context

This is a hacky query processor put in for the purpose of
mitigating clickhouse crashes. When a query is sent to clickhouse
which has a uniq in the HAVING clause but not in the SELECT
clause, the query node will crash. This may be fixed with future versions
of clickhouse but we have not upgraded yet. This exists as a check so that we
return an exception to the user but don't crash the clickhouse query node

Solution

  • add query processor to catch this case, make it throw an exception
  • Add query processor to transaction storage

Feedback desired

  • General approach, is it acceptable?
  • Any other cases I should test for?
  • What's a good way to dry run it? My thought is to send the error to sentry when it happens, hide this behind a runtime config

@volokluev volokluev requested a review from a team as a code owner January 12, 2022 00:40
@codecov-commenter
Copy link

codecov-commenter commented Jan 12, 2022

Codecov Report

Merging #2339 (6629ab9) into master (6f92898) will decrease coverage by 0.00%.
The diff coverage is 92.10%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #2339      +/-   ##
==========================================
- Coverage   92.97%   92.96%   -0.01%     
==========================================
  Files         563      565       +2     
  Lines       25932    26038     +106     
==========================================
+ Hits        24109    24206      +97     
- Misses       1823     1832       +9     
Impacted Files Coverage Δ
snuba/query/expressions.py 91.45% <68.42%> (-2.44%) ⬇️
...nuba/query/processors/uniq_in_select_and_having.py 95.31% <95.31%> (ø)
snuba/datasets/storages/transactions.py 100.00% <100.00%> (ø)
tests/datasets/test_context_promotion.py 100.00% <100.00%> (ø)
tests/datasets/test_tags_hashmap.py 100.00% <100.00%> (ø)
...query/processors/test_uniq_in_select_and_having.py 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 6f92898...6629ab9. Read the comment docs.

Copy link
Contributor

@fpacifici fpacifici left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the approach is fine.
please see my comment inline around alias expansion.

for col in selected_columns:
col.expression.accept(matcher)
if not all(matcher.found_expressions):
should_throw = get_config("throw_on_uniq_select_and_having", False)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding the dry run, I would simply log a warning in Sentry. We should not have too many of these queries to begin with, but you can do a quick estimate of how many queries have a uniq in the HAVING clause from the querylog to ensure they are not hundreds per second.

for i, exp_to_match in enumerate(self.expressions_to_match):
if (
exp_to_match.function_name == exp.function_name
and _expressions_equal_ignore_aliases(exp_to_match, exp)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this the reason why you were looking for a way to generate a matcher from an expression ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I should have responded to your question in Slack, but you can also programmatically build matchers. I don't necessarily think that is the right solution here, but just FYI.

for exp in expressions_to_match:
    matcher = FunctionCallMatch(String(exp.function_name))

snuba/query/processors/uniq_in_select_and_having.py Outdated Show resolved Hide resolved
Copy link
Member

@evanh evanh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know that this was happening only on transactions, but presumably it could also happen on errors, right?

Also, can you clarify how this is intended to work with aliases? If there is an alias of the uniq expression in the select, is that sufficient?

snuba/query/processors/uniq_in_select_and_having.py Outdated Show resolved Hide resolved
for i, exp_to_match in enumerate(self.expressions_to_match):
if (
exp_to_match.function_name == exp.function_name
and _expressions_equal_ignore_aliases(exp_to_match, exp)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I should have responded to your question in Slack, but you can also programmatically build matchers. I don't necessarily think that is the right solution here, but just FYI.

for exp in expressions_to_match:
    matcher = FunctionCallMatch(String(exp.function_name))

@volokluev
Copy link
Member Author

@evanh

I know that this was happening only on transactions, but presumably it could also happen on errors, right?

Good point will add it

Also, can you clarify how this is intended to work with aliases? If there is an alias of the uniq expression in the select, is that sufficient?

It is sufficient for clickhouse yes. This will not crash it for example:

SELECT 
    (project_id AS _snuba_project_id), transaction_name,
    _snuba_count_unique_user -- <----- only the alias will work.
FROM transactions_dist_PROD 
WHERE 
    greaterOrEquals((finish_ts AS _snuba_finish_ts), toDateTime('2021-12-18T19:20:14', 'Universal')) AND 
    less(_snuba_finish_ts, toDateTime('2021-12-29T21:55:02', 'Universal')) AND 
    in((project_id AS _snuba_project_id), [173225]) 
GROUP BY transaction_name, _snuba_project_id 
HAVING 
    less((divide(countIf(notEquals(transaction_status, 0) AND
    notEquals(transaction_status, 1) AND notEquals(transaction_status, 2)), count()) AS _snuba_failure_rate), 1.0) AND 
    greater((uniq(user) AS _snuba_count_unique_user), 1.0)
ORDER BY _snuba_count_unique_user DESC LIMIT 3 OFFSET 0

I originally even had this as a case that was handled but @fpacifici requested I remove it because the parser expands all aliases

#2339 (comment)

(
Literal(None, datetime(2020, 4, 20, 16, 20)),
"datetime(2020-04-20T16:20:00)",
TEST_CASES = [
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the cases have stayed the same, I just pulled them out into a list

@volokluev
Copy link
Member Author

@fpacifici @evanh I've added functional equivalency as a function on Expression. I think this is the easier thing to do than generating matchers from expressions. please take a look when you can

Copy link
Member

@evanh evanh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This works fine, but I wonder if the logic of function_eq should be the default behaviour. I think generally speaking when we do equality the alias shouldn't be part of that.

return False
if len(self.parameters) != len(other.parameters):
return False
for i, param in enumerate(self.parameters):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't need to change this, but you could shave one line off this by doing

for other, param in zip(self.parameters, other.parameters):

@volokluev volokluev merged commit 7fe243a into master Jan 14, 2022
@volokluev volokluev deleted the vlad/uniq_in_select_and_having branch January 14, 2022 16:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants