fix: Support Dict types in `in_list` physical plans #10031

advancedxy · 2024-04-10T15:12:45Z

Which issue does this PR close?

Closes #9530.

Rationale for this change

Dictionary types are already supported in the InListExpr, however, the in_list func doesn't handle types being
mixed in with Dictionary types.

What changes are included in this PR?

Relax type check in the in_list function.

Are these changes tested?

Added new test

Are there any user-facing changes?

No

advancedxy

@alamb and @jayzhan211 PTAL when you have time.

I think the dict is already flattened, we can just relax the type check and we are good to go.

advancedxy · 2024-04-10T15:16:46Z

datafusion/physical-expr/src/expressions/in_list.rs

@@ -540,7 +553,7 @@ mod tests {
            &schema
        );

-        // expression: "a not in ("a", "b")"
+        // expression: "a in ("a", "b")"


This seems wrong, just go ahead and fix it.

alamb · 2024-04-10T17:33:17Z

Thank you @advancedxy -- would it be possible to add some end to end tests as well in https://github.com/apache/arrow-datafusion/tree/main/datafusion/sqllogictest?

Perhaps we could extend https://github.com/apache/arrow-datafusion/blob/main/datafusion/sqllogictest/test_files/dictionary.slt

advancedxy · 2024-04-11T02:13:57Z

Thank you @advancedxy -- would it be possible to add some end to end tests as well in https://github.com/apache/arrow-datafusion/tree/main/datafusion/sqllogictest?

Perhaps we could extend https://github.com/apache/arrow-datafusion/blob/main/datafusion/sqllogictest/test_files/dictionary.slt

Of course, let me try to add some E2E tests.

advancedxy · 2024-04-11T02:59:36Z

Of course, let me try to add some E2E tests.

Fixed, please let me if you have any other comments.

comphead · 2024-04-11T15:06:43Z

datafusion/physical-expr/src/expressions/in_list.rs

@@ -415,6 +415,18 @@ impl PartialEq<dyn Any> for InListExpr {
    }
 }

+/// Checks if two types are logically equal, dictionary types are compared by their value types.
+fn is_logically_eq(lhs: &DataType, rhs: &DataType) -> bool {


hm, I think we have a similar method in https://github.com/apache/arrow-datafusion/blob/f55c1d90215614ce531a4103c7dbebf318de1cfd/datafusion/common/src/dfschema.rs#L643 basically the schema struct has a lot of equal checks between datatypes. may want to move them into separate utils struct as they mostly working with datatypes rather than schema

I agree we could use datatype_is_logically_equal instead here. It also seems like a good idea to make that function more discoverable -- I also think it would be fine to do as a follow on PR

Let me promote and use datatype_is_logically_equal in a follow up PR.

I noticed that method earlier, but thought it might be a bit overkill for cases here in the in_list.

alamb

The tests don't seem to cover the new code.

When I reverted the code change locally like this:

diff --git a/datafusion/physical-expr/src/expressions/in_list.rs b/datafusion/physical-expr/src/expressions/in_list.rs
index ad2da57b5..5d9fb19af 100644
--- a/datafusion/physical-expr/src/expressions/in_list.rs
+++ b/datafusion/physical-expr/src/expressions/in_list.rs
@@ -415,18 +415,6 @@ impl PartialEq<dyn Any> for InListExpr {
     }
 }

-/// Checks if two types are logically equal, dictionary types are compared by their value types.
-fn is_logically_eq(lhs: &DataType, rhs: &DataType) -> bool {
-    match (lhs, rhs) {
-        (DataType::Dictionary(_, v1), DataType::Dictionary(_, v2)) => {
-            v1.as_ref().eq(v2.as_ref())
-        }
-        (DataType::Dictionary(_, l), _) => l.as_ref().eq(rhs),
-        (_, DataType::Dictionary(_, r)) => lhs.eq(r.as_ref()),
-        _ => lhs.eq(rhs),
-    }
-}
-
 /// Creates a unary expression InList
 pub fn in_list(
     expr: Arc<dyn PhysicalExpr>,
@@ -438,7 +426,7 @@ pub fn in_list(
     let expr_data_type = expr.data_type(schema)?;
     for list_expr in list.iter() {
         let list_expr_data_type = list_expr.data_type(schema)?;
-        if !is_logically_eq(&expr_data_type, &list_expr_data_type) {
+        if !expr_data_type.eq(&list_expr_data_type) {
             return internal_err!(
                 "The data type inlist should be same, the value type is {expr_data_type}, one of list expr type is {list_expr_data_type}"
             );

The tests still pass 🤔

$ cargo test -p datafusion-physical-expr -- in_list_utf8_with_dict_types
    Finished test [unoptimized + debuginfo] target(s) in 0.09s
     Running unittests src/lib.rs (target/debug/deps/datafusion_physical_expr-671316f98d0b9705)

running 1 test
test expressions::in_list::tests::in_list_utf8_with_dict_types ... ok

test result: ok. 1 passed; 0 failed; 0 ignored; 0 measured; 1574 filtered out; finished in 0.00s

I worry that this means we might break this feature in the future but not know...

alamb · 2024-04-11T19:05:24Z

datafusion/physical-expr/src/expressions/in_list.rs

@@ -415,6 +415,18 @@ impl PartialEq<dyn Any> for InListExpr {
    }
 }

+/// Checks if two types are logically equal, dictionary types are compared by their value types.
+fn is_logically_eq(lhs: &DataType, rhs: &DataType) -> bool {


I agree we could use datatype_is_logically_equal instead here. It also seems like a good idea to make that function more discoverable -- I also think it would be fine to do as a follow on PR

advancedxy · 2024-04-12T02:39:43Z

The tests still pass 🤔

I worry that this means we might break this feature in the future but not know...

Nice catch, and thanks for your careful review.
I think the problem is that we cast exprs to match each other's types in test. I update the related test code in 07a95b7. Does that look good to you?

alamb

Thanks @advancedxy -- I still think there is something not right about this PR

I don't think this PR will change any external behavior (see my comments)

Also the sqllogictests still pass without the code change

I am beginning to wonder if we really understand the problem or not.

I would suggest starting at this problem from the other end -- can we write a test that causes the error seen in comet (aka do whatever comet is doing). Once we have that then we can change to code to fix it

alamb · 2024-04-13T12:56:21Z

datafusion/physical-expr/src/expressions/in_list.rs

+            ],
+        ];
+        for list in lists.iter() {
+            in_list_raw!(


I don't understand this change -- the tests now call in_list_raw directly. But there is no way that in_list_raw will be called outside of this module 🤔

But there is no way that in_list_raw will be called outside of this module

The function in_list is public, so it's possible that in_list is called directly without any type coercion, which is exactly what Comet tries to do in the first place to leverage the static filter optimization. Since there's no type coercion, it's possible the value type and the list type are mixed with dictionary types. I think this test simulates exactly the issue I encountered in the Comet's case.

For the sqllogictest part, jayzhan11 has already pointed out in #9530 (comment), the type coercion is happened in the optimization phase.

In datafusion, we have done it in the optimization step, when we reach in_list here, we can ensure the types are consistent, so we just go ahead without type checking.

I think that's why the E2E tests already pass without this PR's fix.

Got it -- what I don't understand is how these validate the Comet use case. I expect them to call in_list (instead they calling the in_list_raw! macro)

What I was expected to see was a test that mirrors what comet does: call in_list with a Dictionary column but string literals (that haven't ben type cerced).

Given this case current errors, we have no test coverage, even if the in_list implementation does actually support it.

Sorry to be so pedantic about this, but I think it is somewhat subtle so making sure we get it right (and don't accidentally break it in the future) I think is important

Got it -- what I don't understand is how these validate the Comet use case. I expect them to call in_list (instead they calling the in_list_raw! macro)

There might be some misunderstanding here. in_list and in_list_raw are both test macros, the main difference is that the former performs type coercion first and the latter does not. These two macros both call the in_list method though. The comet case calls the in_list method(not the test macro) directly without type coercion, which is exactly the in_list_utf8_with_dict_types test trying to simulate by calling the in_list_raw test macro.

Sorry to be so pedantic about this, but I think it is somewhat subtle so making sure we get it right (and don't accidentally break it in the future) I think is important

No worries. I think it's exactly the spirit we need towards a better software engineering practice. And totally agree that it's important to make sure we get it right and won't break it in the future.

There might be some misunderstanding here. in_list and in_list_raw are both test macros, the main difference is that the former performs type coercion first and the latter does not.

Ahh! Yes I was missing exactly this point. Sorry about that. Makes total sense.

advancedxy · 2024-04-13T13:33:32Z

Thanks @advancedxy -- I still think there is something not right about this PR

I don't think this PR will change any external behavior (see my comments)

Also the sqllogictests still pass without the code change

I am beginning to wonder if we really understand the problem or not.

I would suggest starting at this problem from the other end -- can we write a test that causes the error seen in comet (aka do whatever comet is doing). Once we have that then we can change to code to fix it

Thanks for your feedback and suggestions. I replied to your comments, which I think should explain your concerns and the current test in this PR should already do what Comet is trying to do. Please let me know if you have any further questions ior concerns. Thanks again.

alamb

Thank you @advancedxy -- yes I was confused about the macro. Thanks for bearing with me !

alamb · 2024-04-13T20:00:01Z

datafusion/physical-expr/src/expressions/in_list.rs

+            ],
+        ];
+        for list in lists.iter() {
+            in_list_raw!(


There might be some misunderstanding here. in_list and in_list_raw are both test macros, the main difference is that the former performs type coercion first and the latter does not.

Ahh! Yes I was missing exactly this point. Sorry about that. Makes total sense.

fix: Relax type check with dict types in in_list

f8e5cdf

github-actions bot added the physical-expr Physical Expressions label Apr 10, 2024

refine comments

70f5481

advancedxy commented Apr 10, 2024

View reviewed changes

fix style, refine comments and address reviewer's comments

9c279f7

github-actions bot added the sqllogictest SQL Logic Tests (.slt) label Apr 11, 2024

refine comments

5e593bb

alamb mentioned this pull request Apr 11, 2024

DataFusion weekly project plan (Andrew Lamb) - April 8, 2024 #10002

Closed

9 tasks

comphead reviewed Apr 11, 2024

View reviewed changes

alamb reviewed Apr 11, 2024

View reviewed changes

address comments

07a95b7

advancedxy force-pushed the issue_9530 branch from 318723e to 07a95b7 Compare April 12, 2024 02:35

alamb reviewed Apr 13, 2024

View reviewed changes

alamb approved these changes Apr 13, 2024

View reviewed changes

alamb changed the title ~~fix: Relax type check with dict types in in_list~~ fix: Support Dict types in in_list physical plans Apr 13, 2024

alamb merged commit d698d9d into apache:main Apr 13, 2024
24 checks passed

advancedxy deleted the issue_9530 branch April 16, 2024 07:56

advancedxy mentioned this pull request Jun 11, 2024

chore: Make DFSchema::datatype_is_logically_equal function public #10867

Merged

This pull request was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: Support Dict types in `in_list` physical plans #10031

fix: Support Dict types in `in_list` physical plans #10031

advancedxy commented Apr 10, 2024

advancedxy left a comment

advancedxy Apr 10, 2024

alamb commented Apr 10, 2024

advancedxy commented Apr 11, 2024

advancedxy commented Apr 11, 2024

comphead Apr 11, 2024

alamb Apr 11, 2024

advancedxy Apr 12, 2024

alamb left a comment

alamb Apr 11, 2024

advancedxy commented Apr 12, 2024 •

edited

Loading

alamb left a comment

alamb Apr 13, 2024

advancedxy Apr 13, 2024

alamb Apr 13, 2024

advancedxy Apr 13, 2024

alamb Apr 13, 2024

advancedxy commented Apr 13, 2024

alamb left a comment

alamb Apr 13, 2024

fix: Support Dict types in in_list physical plans #10031

fix: Support Dict types in in_list physical plans #10031

Conversation

advancedxy commented Apr 10, 2024

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

advancedxy left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alamb commented Apr 10, 2024

advancedxy commented Apr 11, 2024

advancedxy commented Apr 11, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alamb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

advancedxy commented Apr 12, 2024 • edited Loading

alamb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

advancedxy commented Apr 13, 2024

alamb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fix: Support Dict types in `in_list` physical plans #10031

fix: Support Dict types in `in_list` physical plans #10031

advancedxy commented Apr 12, 2024 •

edited

Loading