Apply guarantee rewriter to sql workflow #10456

jayzhan211 · 2024-05-11T08:18:17Z

Is your feature request related to a problem or challenge?

While deprecatingExpr::GetIndexedField, I found there are many test cases that are not covered in sqllogictest, for example, test_inequalities_non_null_bounded. Since we hope to replace the field API with get_field. We could either move the test to datafusion/core/tests or sqllogictest. I prefer the latter, then, I found that guarantee rewrite is not applied to SQL workflow.

statement ok
create table t (c int) as values (1), (3), (5);

query TT
explain select struct(c) from t where c between 3 and 1;
----
logical_plan
01)Projection: struct(t.c)
02)--Filter: t.c >= Int32(3) AND t.c <= Int32(1)
03)----TableScan: t projection=[c]
physical_plan
01)ProjectionExec: expr=[struct(c@0) as struct(t.c)]
02)--RepartitionExec: partitioning=RoundRobinBatch(4), input_partitions=1
03)----CoalesceBatchesExec: target_batch_size=8192
04)------FilterExec: c@0 >= 3 AND c@0 <= 1
05)--------MemoryExec: partitions=1, partition_sizes=[1]

statement ok
drop table t;

I expect that FilterExec should be removed or converted to something like False, since the condition here is always false.

Describe the solution you'd like

Apply guarantee_rewriter to sql workflow.
If the simplification logic can be included in Simplifier is a plus.

Describe alternatives you've considered

No response

Additional context

PR that introduce guarantee rewrite #7467

No response

The text was updated successfully, but these errors were encountered:

yyy1000 · 2024-05-12T00:03:11Z

Update: #10463 is not what this issue expects, I will see what to do next.
I can help it. :) An experimental PR is #10463

yyy1000 · 2024-05-12T00:50:43Z

On a second glance, I feel it's difficult. 😥
When simplifying a logicalplan, it seems impossible to get the underlying data which could making guarantees.

alamb · 2024-05-12T09:53:24Z

I think this may be another example of what @samuelcolvin was suggesting on #10400

I think we could use ExecutionPlan::statistics to get the guarantee information

dmitrybugakov · 2024-05-14T10:36:10Z

@jayzhan211
Do I understand correctly that the best option is to incorporate the guarantee logic into the simplifier based on statistics and remove the old version of the guarantee?

jayzhan211 · 2024-05-14T11:24:17Z

@jayzhan211 Do I understand correctly that the best option is to incorporate the guarantee logic into the simplifier based on statistics and remove the old version of the guarantee?

I think so.

alamb · 2024-05-14T16:51:42Z

old version of the guarantee?

What does "old version of the guarantee?" refer to?

dmitrybugakov · 2024-05-14T17:40:41Z

What does "old version of the guarantee?" refer to?

datafusion/datafusion/optimizer/src/simplify_expressions/guarantees.rs

Line 60 in 1d4b808

impl<'a> TreeNodeRewriter for GuaranteeRewriter<'a> {

alamb · 2024-05-14T19:29:03Z

As I understand it, the usecase for GuaranteeRewriter when @wjones127 (maybe?) added it was for providing external information (outside of information that came from SQL). I don't think we should just remove the ability to do so

I would personally recommend add code that translates Statistics into Guarantees to pass to GuaranteeRewriter

We could discuss reworking how GuaranteeRewriter works as a follow on PR

This change addresses part of apache#10456.

jayzhan211 · 2024-06-09T02:28:10Z

@dmitrybugakov are you working on #10510?

jayzhan211 · 2024-06-10T09:39:11Z

@alamb Is it reasonable to evaluate column in ConstEvaluator and collect statistics for guarantee rewriter or should we avoid evaluation in logical optimization step and compute it in physical planner?

I'm thinking of passing schema and batch to ConstEvaluator to evaluate columns and updating statistics each passes for guarantee rewriter.

alamb · 2024-06-10T17:19:16Z

@alamb Is it reasonable to evaluate column in ConstEvaluator and collect statistics for guarantee rewriter or should we avoid evaluation in logical optimization step and compute it in physical planner?

I don't quite follow what you are proposing here.

As I I understand the idea on this ticket, the idea is to add a pass that knows how to use Statistics to simplify expressions by creating a Simplifier, and pass in the min and max values via with_guarantee

The challenges I see are:

statistics not available in the LogicalPlan but only in in ExecutionPlan via ExecutionPlan::with_statistics
The ExprSimplifier::simplify API is in terms of Exprs (not PhysicalExprs)

One potential thing you could do is use PruningPredicate for FilterExecs and try to prove inputs can never be true. However, that seems like it may not be particularly effective (as the number of queries where a filter will always be false is likely to be limited in importance)

jayzhan211 · 2024-06-11T00:54:58Z

One potential thing you could do is use PruningPredicate for FilterExecs and try to prove inputs can never be true.

It seems quite similar to the comments in #10400.

However, that seems like it may not be particularly effective (as the number of queries where a filter will always be false is likely to be limited in importance)

Maybe I should works on other issue 🤔

alamb · 2024-06-11T01:40:58Z

Maybe I should works on other issue 🤔

Maybe -- what are you interested in working on? Are you blocked on review of anything? I find it hard to keep up with what you are doing these days 🏃

jayzhan211 · 2024-06-11T05:02:57Z

Maybe I should works on other issue 🤔

Maybe -- what are you interested in working on? Are you blocked on review of anything? I find it hard to keep up with what you are doing these days 🏃

I think #8708 is about 80% complete. I'm exploring the next interesting topic.

alamb · 2024-06-11T15:28:15Z

I think #8708 is about 80% complete. I'm exploring the next interesting topic.

Let me know if you would like help breaking down the work and filing some more follow on tickets (to organize getting some additional community help).

Depending on the kind of project you are interested in, here are some ideas (unsolicited) that I would love to help review:

API design / extraction: [Epic] Extract catalog functionality from the core to make it more modular #10782 with Catalog APIs
Grouping performance -- either Improve performance for grouping by variable length columns (strings) #9403 or Improve Memory usage + performance with large numbers of groups / High Cardinality Aggregates #6937 (both are pretty tricky low level optimizations)

Rock on!

jayzhan211 · 2024-06-12T00:09:22Z

Improving grouping performance seems interesting!

alamb · 2024-06-13T15:01:05Z

Improving grouping performance seems interesting!

I think it would be awesome -- thank you. How would you like to proceed? I personally think either #9403 or #6937 are super valuable

For either, I think the key will be to do some sort of POC to make sure we can make performance improve before polishing too much.

Looking forward to working with you more

jayzhan211 added the enhancement New feature or request label May 11, 2024

yyy1000 mentioned this issue May 12, 2024

Add simplifier additional between #10463

Closed

dmitrybugakov added a commit to dmitrybugakov/datafusion that referenced this issue May 14, 2024

Implement conversion from ColumnStatistics to NullableInterval

dd9f55f

This change addresses part of apache#10456.

dmitrybugakov mentioned this issue May 14, 2024

Implement conversion from ColumnStatistics to NullableInterval #10510

Closed

dmitrybugakov added a commit to dmitrybugakov/datafusion that referenced this issue May 14, 2024

Implement conversion from ColumnStatistics to NullableInterval

0d75e43

This change addresses part of apache#10456.

dmitrybugakov added a commit to dmitrybugakov/datafusion that referenced this issue May 15, 2024

Implement conversion from ColumnStatistics to NullableInterval

9ff8cb4

This change addresses part of apache#10456.

dmitrybugakov added a commit to dmitrybugakov/datafusion that referenced this issue May 15, 2024

Implement conversion from ColumnStatistics to NullableInterval

b745673

This change addresses part of apache#10456.

alamb mentioned this issue May 15, 2024

Use min_value and max_value on statistics to avoid ExecutionPlan.execute #10400

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Apply guarantee rewriter to sql workflow #10456

Apply guarantee rewriter to sql workflow #10456

jayzhan211 commented May 11, 2024 •

edited

Loading

yyy1000 commented May 12, 2024 •

edited

Loading

yyy1000 commented May 12, 2024

alamb commented May 12, 2024

dmitrybugakov commented May 14, 2024

jayzhan211 commented May 14, 2024

alamb commented May 14, 2024

dmitrybugakov commented May 14, 2024

alamb commented May 14, 2024

jayzhan211 commented Jun 9, 2024

jayzhan211 commented Jun 10, 2024

alamb commented Jun 10, 2024

jayzhan211 commented Jun 11, 2024 •

edited

Loading

alamb commented Jun 11, 2024

jayzhan211 commented Jun 11, 2024

alamb commented Jun 11, 2024

jayzhan211 commented Jun 12, 2024

alamb commented Jun 13, 2024

Apply guarantee rewriter to sql workflow #10456

Apply guarantee rewriter to sql workflow #10456

Comments

jayzhan211 commented May 11, 2024 • edited Loading

Is your feature request related to a problem or challenge?

Describe the solution you'd like

Describe alternatives you've considered

Additional context

yyy1000 commented May 12, 2024 • edited Loading

yyy1000 commented May 12, 2024

alamb commented May 12, 2024

dmitrybugakov commented May 14, 2024

jayzhan211 commented May 14, 2024

alamb commented May 14, 2024

dmitrybugakov commented May 14, 2024

alamb commented May 14, 2024

jayzhan211 commented Jun 9, 2024

jayzhan211 commented Jun 10, 2024

alamb commented Jun 10, 2024

jayzhan211 commented Jun 11, 2024 • edited Loading

alamb commented Jun 11, 2024

jayzhan211 commented Jun 11, 2024

alamb commented Jun 11, 2024

jayzhan211 commented Jun 12, 2024

alamb commented Jun 13, 2024

jayzhan211 commented May 11, 2024 •

edited

Loading

yyy1000 commented May 12, 2024 •

edited

Loading

jayzhan211 commented Jun 11, 2024 •

edited

Loading