Skip to content

Conversation

@KKould
Copy link
Member

@KKould KKould commented Mar 18, 2025

I hereby agree to the terms of the CLA available at: https://docs.databend.com/dev/policies/cla/

Summary

This rule comes from the subquery optimization mentioned in the wetune paper: https://ipads.se.sjtu.edu.cn:1312/opensource/wetune/-/blob/main/wtune_data/issues/issues?ref_type=heads#L8

It can remove the redundant scan of left in the case of in subquery.

explain
SELECT
  COUNT(topics.id)
FROM
  topics
WHERE
  id IN (
    SELECT
      topic_id
    FROM
      posts AS p
      INNER JOIN topics AS t2 ON t2.id = p.topic_id
    WHERE
      p.deleted_at IS NULL
      AND t2.user_id <> p.user_id
      AND p.user_id = 9627
  )

-[ EXPLAIN ]-----------------------------------
AggregateFinal
├── output columns: [COUNT(topics.id) (#45)]
├── group by: []
├── aggregate functions: [count()]
├── estimated rows: 1.00
└── AggregatePartial
    ├── group by: []
    ├── aggregate functions: [count()]
    ├── estimated rows: 1.00
    └── AggregateFinal
        ├── output columns: [p.topic_id (#48)]
        ├── group by: [topic_id]
        ├── aggregate functions: []
        ├── estimated rows: 0.00
        └── AggregatePartial
            ├── group by: [topic_id]
            ├── aggregate functions: []
            ├── estimated rows: 0.00
            └── HashJoin
                ├── output columns: [p.topic_id (#48)]
                ├── join type: INNER
                ├── build keys: [t2.id (#97)]
                ├── probe keys: [p.topic_id (#48)]
                ├── keys is null equal: [false]
                ├── filters: [t2.user_id (#104) <> p.user_id (#47)]
                ├── estimated rows: 0.00
                ├── TableScan(Build)
                │   ├── table: default.public.topics
                │   ├── output columns: [id (#97), user_id (#104)]
                │   ├── read rows: 0
                │   ├── read size: 0
                │   ├── partitions total: 0
                │   ├── partitions scanned: 0
                │   ├── push downs: [filters: [], limit: NONE]
                │   └── estimated rows: 0.00
                └── Filter(Probe)
                    ├── output columns: [p.user_id (#47), p.topic_id (#48)]
                    ├── filters: [is_true(p.user_id (#47) = 9627), NOT is_not_null(p.deleted_at (#57))]
                    ├── estimated rows: 0.00
                    └── TableScan
                        ├── table: default.public.posts
                        ├── output columns: [user_id (#47), topic_id (#48), deleted_at (#57)]
                        ├── read rows: 0
                        ├── read size: 0
                        ├── partitions total: 0
                        ├── partitions scanned: 0
                        ├── push downs: [filters: [and_filters(posts.user_id (#47) = 9627, NOT is_not_null(posts.deleted_at (#57)))], limit: NONE]
                        └── estimated rows: 0.00

There are currently two issues waiting to be resolved for this optimization

  1. When the column required by the downstream node is not the child expr of the subquery, a schema mapping error will occur.
explain SELECT COUNT(topics.user_id)
FROM topics
WHERE id IN (SELECT topic_id
             FROM posts AS p
                      INNER JOIN topics AS t2 ON t2.id = p.topic_id
             WHERE p.deleted_at IS NULL
               AND t2.user_id <> p.user_id
               AND p.user_id = 9627);
  1. When there are multiple tables in the left or there are complex situations such as limits

Here are some questions about table creation and related sql to help you test(Please manually change txt to sql)
subquery.txt

Tests

  • Unit Test
  • Logic Test
  • Benchmark Test
  • No Test - Explain why

Type of change

  • Bug Fix (non-breaking change which fixes an issue)
  • New Feature (non-breaking change which adds functionality)
  • Breaking Change (fix or feature that could cause existing functionality not to work as expected)
  • Documentation Update
  • Refactoring
  • Performance Improvement
  • Other (please describe):

This change is Reviewable

@KKould KKould marked this pull request as draft March 18, 2025 02:40
@github-actions
Copy link
Contributor

github-actions bot commented Mar 18, 2025

At least one test kind must be checked in the PR description.
@KKould please update it 🙏.

@github-actions github-actions bot added the pr-feature this PR introduces a new feature to the codebase label Mar 18, 2025
@KKould KKould closed this Apr 1, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pr-feature this PR introduces a new feature to the codebase

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant