Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(mql): MQL JOIN subquery generator #6101

Merged
merged 19 commits into from
Jul 15, 2024

Conversation

enochtangg
Copy link
Member

This PR contains the last piece for being able to run multi-entity formula metric queries in snuba. This work was first spec'ed in #5717. The main change of this PR is the metrics subquery generator which now supports "MQL shaped" JOINs (see snuba/query/joins/metrics_subquery_generator.py). For example, the following query:

SELECT g(f(d0.a), f(d1.b)), t() as d0.time, t() as d1.time, d0.gb, d1.gb
FROM d0 INNER JOIN d1 ON ...
WHERE d0.m = 1 AND d1.m = 2
GROUP BY d0.time, d1.time, d0.gb, d1.gb
ORDER BY ...

becomes

SELECT g(d0._snuba_a, d1._snuba_b), d0._snuba_time, d1._snuba_time, d0._snuba_gb, d1._snuba_gb
FROM (
    SELECT f(a) as _snuba_a, t() as _snuba_time, gb as _snuba_gb
    FROM ...
    WHERE m = 1
    GROUP BY _snuba_time, _snuba_gb
) d0 INNER JOIN (
    SELECT g(b) as _snuba_b, t() as _snuba_time, gb as _snuba_gb
    FROM ...
    WHERE m = 2
    GROUP BY _snuba_time, _snuba_gb
) d1 ON ...
ORDER BY ...

See docstring from generate_metrics_subqueries() for the reasoning why subqueries should get processed in this way.

Additional to the subquery generator change, the indexer_mapping resolver was updated to resolve columns in the ON clause. This is because groupby columns get pushed onto the join condition. As a result, tag columns in the join condition must be resolved.

Testing

When this PR is merged, nothing changes. We configure the rollout through the runtime config. First, use run_new_mql_parser which samples a percentage of MQL queries and runs the new parser which now converts formula queries into JOINs. Then, in the entity processing stage of the pipeline, the metrics subquery generator will execute when it sees this query.

SET run_new_mql_parser = 0.01 # 1% of queries try using the new pipeline
Monitor in sentry and DD, and continue increasing run_new_mql_parser until we hit 100%.
After support this is completed, we can remove the old parser and other dead code.

@enochtangg enochtangg requested a review from a team as a code owner July 12, 2024 15:01
Copy link

codecov bot commented Jul 12, 2024

Test Failures Detected: Due to failing tests, we cannot provide coverage reports at this time.

❌ Failed Test Results:

Completed 1702 tests with 1 failed, 1699 passed and 2 skipped.

View the full list of failed tests

pytest

  • Class name: tests.query.parser.test_formula_mql_query
    Test name: test_formula_with_nested_functions

    Traceback (most recent call last):
    File ".../query/parser/test_formula_mql_query.py", line 946, in test_formula_with_nested_functions
    assert eq, reason
    AssertionError: get_from_clause: JoinClause(left_node=IndividualNode(alias='d1', data_source=Entity(key=EntityKey.GENERIC_METRICS_DISTRIBUTIONS, schema=ColumnSet([Column('org_id', schemas.UInt(64, modifiers=None)), Column('project_id', schemas.UInt(64, modifiers=None)), Column('metric_id', schemas.UInt(64, modifiers=None)), Column('timestamp', schemas.DateTime(modifiers=None)), Column('bucketed_time', schemas.DateTime(modifiers=None)), Column('granularity', schemas.UInt(8, modifiers=None)), Column('use_case_id', schemas.String(modifiers=None)), Column('tags', schemas.Nested([Column('key', schemas.UInt(64, modifiers=None)), Column('value', schemas.UInt(64, modifiers=None))], modifiers=None)), Column('tags.raw_value', schemas.Array(schemas.String(modifiers=None), modifiers=SchemaModifiers(nullable=False, readonly=True))), Column('percentiles', schemas.AggregateFunction('quantiles(0.5, 0.75, 0.9, 0.95, 0.99)', schemas.Float(64, modifiers=None), modifiers=None)), Column('min', schemas.AggregateFunction('min', schemas.Float(64, modifiers=None), modifiers=None)), Column('max', schemas.AggregateFunction('max', schemas.Float(64, modifiers=None), modifiers=None)), Column('avg', schemas.AggregateFunction('avg', schemas.Float(64, modifiers=None), modifiers=None)), Column('sum', schemas.AggregateFunction('sum', schemas.Float(64, modifiers=None), modifiers=None)), Column('count', schemas.AggregateFunction('count', schemas.Float(64, modifiers=None), modifiers=None)), Column('histogram_buckets', schemas.AggregateFunction('histogram(250)', schemas.Float(64, modifiers=None), modifiers=None))]), sample=None)), right_node=IndividualNode(alias='d0', data_source=Entity(key=EntityKey.GENERIC_METRICS_DISTRIBUTIONS, schema=ColumnSet([Column('org_id', schemas.UInt(64, modifiers=None)), Column('project_id', schemas.UInt(64, modifiers=None)), Column('metric_id', schemas.UInt(64, modifiers=None)), Column('timestamp', schemas.DateTime(modifiers=None)), Column('bucketed_time', schemas.DateTime(modifiers=None)), Column('granularity', schemas.UInt(8, modifiers=None)), Column('use_case_id', schemas.String(modifiers=None)), Column('tags', schemas.Nested([Column('key', schemas.UInt(64, modifiers=None)), Column('value', schemas.UInt(64, modifiers=None))], modifiers=None)), Column('tags.raw_value', schemas.Array(schemas.String(modifiers=None), modifiers=SchemaModifiers(nullable=False, readonly=True))), Column('percentiles', schemas.AggregateFunction('quantiles(0.5, 0.75, 0.9, 0.95, 0.99)', schemas.Float(64, modifiers=None), modifiers=None)), Column('min', schemas.AggregateFunction('min', schemas.Float(64, modifiers=None), modifiers=None)), Column('max', schemas.AggregateFunction('max', schemas.Float(64, modifiers=None), modifiers=None)), Column('avg', schemas.AggregateFunction('avg', schemas.Float(64, modifiers=None), modifiers=None)), Column('sum', schemas.AggregateFunction('sum', schemas.Float(64, modifiers=None), modifiers=None)), Column('count', schemas.AggregateFunction('count', schemas.Float(64, modifiers=None), modifiers=None)), Column('histogram_buckets', schemas.AggregateFunction('histogram(250)', schemas.Float(64, modifiers=None), modifiers=None))]), sample=None)), keys=[JoinCondition(left=JoinConditionExpression(table_alias='d1', column='d1.time'), right=JoinConditionExpression(table_alias='d0', column='d0.time'))], join_type=<JoinType.INNER: 'INNER'>, join_modifier=None) != JoinClause(left_node=IndividualNode(alias='d1', data_source=Entity(key=EntityKey.GENERIC_METRICS_DISTRIBUTIONS, schema=ColumnSet([Column('org_id', schemas.UInt(64, modifiers=None)), Column('project_id', schemas.UInt(64, modifiers=None)), Column('metric_id', schemas.UInt(64, modifiers=None)), Column('timestamp', schemas.DateTime(modifiers=None)), Column('bucketed_time', schemas.DateTime(modifiers=None)), Column('granularity', schemas.UInt(8, modifiers=None)), Column('use_case_id', schemas.String(modifiers=None)), Column('tags', schemas.Nested([Column('key', schemas.UInt(64, modifiers=None)), Column('value', schemas.UInt(64, modifiers=None))], modifiers=None)), Column('tags.raw_value', schemas.Array(schemas.String(modifiers=None), modifiers=SchemaModifiers(nullable=False, readonly=True))), Column('percentiles', schemas.AggregateFunction('quantiles(0.5, 0.75, 0.9, 0.95, 0.99)', schemas.Float(64, modifiers=None), modifiers=None)), Column('min', schemas.AggregateFunction('min', schemas.Float(64, modifiers=None), modifiers=None)), Column('max', schemas.AggregateFunction('max', schemas.Float(64, modifiers=None), modifiers=None)), Column('avg', schemas.AggregateFunction('avg', schemas.Float(64, modifiers=None), modifiers=None)), Column('sum', schemas.AggregateFunction('sum', schemas.Float(64, modifiers=None), modifiers=None)), Column('count', schemas.AggregateFunction('count', schemas.Float(64, modifiers=None), modifiers=None)), Column('histogram_buckets', schemas.AggregateFunction('histogram(250)', schemas.Float(64, modifiers=None), modifiers=None))]), sample=None)), right_node=IndividualNode(alias='d0', data_source=Entity(key=EntityKey.GENERIC_METRICS_DISTRIBUTIONS, schema=ColumnSet([Column('org_id', schemas.UInt(64, modifiers=None)), Column('project_id', schemas.UInt(64, modifiers=None)), Column('metric_id', schemas.UInt(64, modifiers=None)), Column('timestamp', schemas.DateTime(modifiers=None)), Column('bucketed_time', schemas.DateTime(modifiers=None)), Column('granularity', schemas.UInt(8, modifiers=None)), Column('use_case_id', schemas.String(modifiers=None)), Column('tags', schemas.Nested([Column('key', schemas.UInt(64, modifiers=None)), Column('value', schemas.UInt(64, modifiers=None))], modifiers=None)), Column('tags.raw_value', schemas.Array(schemas.String(modifiers=None), modifiers=SchemaModifiers(nullable=False, readonly=True))), Column('percentiles', schemas.AggregateFunction('quantiles(0.5, 0.75, 0.9, 0.95, 0.99)', schemas.Float(64, modifiers=None), modifiers=None)), Column('min', schemas.AggregateFunction('min', schemas.Float(64, modifiers=None), modifiers=None)), Column('max', schemas.AggregateFunction('max', schemas.Float(64, modifiers=None), modifiers=None)), Column('avg', schemas.AggregateFunction('avg', schemas.Float(64, modifiers=None), modifiers=None)), Column('sum', schemas.AggregateFunction('sum', schemas.Float(64, modifiers=None), modifiers=None)), Column('count', schemas.AggregateFunction('count', schemas.Float(64, modifiers=None), modifiers=None)), Column('histogram_buckets', schemas.AggregateFunction('histogram(250)', schemas.Float(64, modifiers=None), modifiers=None))]), sample=None)), keys=[JoinCondition(left=JoinConditionExpression(table_alias='d1', column='time'), right=JoinConditionExpression(table_alias='d0', column='time'))], join_type=<JoinType.INNER: 'INNER'>, join_modifier=None)
    assert False

@enochtangg enochtangg merged commit b3e81c3 into master Jul 15, 2024
29 checks passed
@enochtangg enochtangg deleted the enocht/basic-metrics-subquery-generator branch July 15, 2024 14:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants