Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Support Dict types in in_list physical plans #10031

Merged
merged 5 commits into from
Apr 13, 2024

Conversation

advancedxy
Copy link
Contributor

Which issue does this PR close?

Closes #9530.

Rationale for this change

Dictionary types are already supported in the InListExpr, however, the in_list func doesn't handle types being
mixed in with Dictionary types.

What changes are included in this PR?

Relax type check in the in_list function.

Are these changes tested?

Added new test

Are there any user-facing changes?

No

@github-actions github-actions bot added the physical-expr Physical Expressions label Apr 10, 2024
Copy link
Contributor Author

@advancedxy advancedxy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alamb and @jayzhan211 PTAL when you have time.

I think the dict is already flattened, we can just relax the type check and we are good to go.

@@ -540,7 +553,7 @@ mod tests {
&schema
);

// expression: "a not in ("a", "b")"
// expression: "a in ("a", "b")"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems wrong, just go ahead and fix it.

@alamb
Copy link
Contributor

alamb commented Apr 10, 2024

Thank you @advancedxy -- would it be possible to add some end to end tests as well in https://github.com/apache/arrow-datafusion/tree/main/datafusion/sqllogictest?

Perhaps we could extend https://github.com/apache/arrow-datafusion/blob/main/datafusion/sqllogictest/test_files/dictionary.slt

@advancedxy
Copy link
Contributor Author

Thank you @advancedxy -- would it be possible to add some end to end tests as well in https://github.com/apache/arrow-datafusion/tree/main/datafusion/sqllogictest?

Perhaps we could extend https://github.com/apache/arrow-datafusion/blob/main/datafusion/sqllogictest/test_files/dictionary.slt

Of course, let me try to add some E2E tests.

@github-actions github-actions bot added the sqllogictest SQL Logic Tests (.slt) label Apr 11, 2024
@advancedxy
Copy link
Contributor Author

Of course, let me try to add some E2E tests.

Fixed, please let me if you have any other comments.

@@ -415,6 +415,18 @@ impl PartialEq<dyn Any> for InListExpr {
}
}

/// Checks if two types are logically equal, dictionary types are compared by their value types.
fn is_logically_eq(lhs: &DataType, rhs: &DataType) -> bool {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hm, I think we have a similar method in https://github.com/apache/arrow-datafusion/blob/f55c1d90215614ce531a4103c7dbebf318de1cfd/datafusion/common/src/dfschema.rs#L643 basically the schema struct has a lot of equal checks between datatypes. may want to move them into separate utils struct as they mostly working with datatypes rather than schema

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree we could use datatype_is_logically_equal instead here. It also seems like a good idea to make that function more discoverable -- I also think it would be fine to do as a follow on PR

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me promote and use datatype_is_logically_equal in a follow up PR.

I noticed that method earlier, but thought it might be a bit overkill for cases here in the in_list.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tests don't seem to cover the new code.

When I reverted the code change locally like this:

diff --git a/datafusion/physical-expr/src/expressions/in_list.rs b/datafusion/physical-expr/src/expressions/in_list.rs
index ad2da57b5..5d9fb19af 100644
--- a/datafusion/physical-expr/src/expressions/in_list.rs
+++ b/datafusion/physical-expr/src/expressions/in_list.rs
@@ -415,18 +415,6 @@ impl PartialEq<dyn Any> for InListExpr {
     }
 }

-/// Checks if two types are logically equal, dictionary types are compared by their value types.
-fn is_logically_eq(lhs: &DataType, rhs: &DataType) -> bool {
-    match (lhs, rhs) {
-        (DataType::Dictionary(_, v1), DataType::Dictionary(_, v2)) => {
-            v1.as_ref().eq(v2.as_ref())
-        }
-        (DataType::Dictionary(_, l), _) => l.as_ref().eq(rhs),
-        (_, DataType::Dictionary(_, r)) => lhs.eq(r.as_ref()),
-        _ => lhs.eq(rhs),
-    }
-}
-
 /// Creates a unary expression InList
 pub fn in_list(
     expr: Arc<dyn PhysicalExpr>,
@@ -438,7 +426,7 @@ pub fn in_list(
     let expr_data_type = expr.data_type(schema)?;
     for list_expr in list.iter() {
         let list_expr_data_type = list_expr.data_type(schema)?;
-        if !is_logically_eq(&expr_data_type, &list_expr_data_type) {
+        if !expr_data_type.eq(&list_expr_data_type) {
             return internal_err!(
                 "The data type inlist should be same, the value type is {expr_data_type}, one of list expr type is {list_expr_data_type}"
             );

The tests still pass 🤔

$ cargo test -p datafusion-physical-expr -- in_list_utf8_with_dict_types
    Finished test [unoptimized + debuginfo] target(s) in 0.09s
     Running unittests src/lib.rs (target/debug/deps/datafusion_physical_expr-671316f98d0b9705)

running 1 test
test expressions::in_list::tests::in_list_utf8_with_dict_types ... ok

test result: ok. 1 passed; 0 failed; 0 ignored; 0 measured; 1574 filtered out; finished in 0.00s

I worry that this means we might break this feature in the future but not know...

@@ -415,6 +415,18 @@ impl PartialEq<dyn Any> for InListExpr {
}
}

/// Checks if two types are logically equal, dictionary types are compared by their value types.
fn is_logically_eq(lhs: &DataType, rhs: &DataType) -> bool {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree we could use datatype_is_logically_equal instead here. It also seems like a good idea to make that function more discoverable -- I also think it would be fine to do as a follow on PR

@advancedxy
Copy link
Contributor Author

advancedxy commented Apr 12, 2024

The tests still pass 🤔

I worry that this means we might break this feature in the future but not know...

Nice catch, and thanks for your careful review.
I think the problem is that we cast exprs to match each other's types in test. I update the related test code in 07a95b7. Does that look good to you?

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @advancedxy -- I still think there is something not right about this PR

I don't think this PR will change any external behavior (see my comments)

Also the sqllogictests still pass without the code change

I am beginning to wonder if we really understand the problem or not.

I would suggest starting at this problem from the other end -- can we write a test that causes the error seen in comet (aka do whatever comet is doing). Once we have that then we can change to code to fix it

],
];
for list in lists.iter() {
in_list_raw!(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand this change -- the tests now call in_list_raw directly. But there is no way that in_list_raw will be called outside of this module 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But there is no way that in_list_raw will be called outside of this module

The function in_list is public, so it's possible that in_list is called directly without any type coercion, which is exactly what Comet tries to do in the first place to leverage the static filter optimization. Since there's no type coercion, it's possible the value type and the list type are mixed with dictionary types. I think this test simulates exactly the issue I encountered in the Comet's case.

For the sqllogictest part, jayzhan11 has already pointed out in #9530 (comment), the type coercion is happened in the optimization phase.

In datafusion, we have done it in the optimization step, when we reach in_list here, we can ensure the types are consistent, so we just go ahead without type checking.

I think that's why the E2E tests already pass without this PR's fix.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it -- what I don't understand is how these validate the Comet use case. I expect them to call in_list (instead they calling the in_list_raw! macro)

What I was expected to see was a test that mirrors what comet does: call in_list with a Dictionary column but string literals (that haven't ben type cerced).

Given this case current errors, we have no test coverage, even if the in_list implementation does actually support it.

Sorry to be so pedantic about this, but I think it is somewhat subtle so making sure we get it right (and don't accidentally break it in the future) I think is important

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it -- what I don't understand is how these validate the Comet use case. I expect them to call in_list (instead they calling the in_list_raw! macro)

There might be some misunderstanding here. in_list and in_list_raw are both test macros, the main difference is that the former performs type coercion first and the latter does not. These two macros both call the in_list method though. The comet case calls the in_list method(not the test macro) directly without type coercion, which is exactly the in_list_utf8_with_dict_types test trying to simulate by calling the in_list_raw test macro.

Sorry to be so pedantic about this, but I think it is somewhat subtle so making sure we get it right (and don't accidentally break it in the future) I think is important

No worries. I think it's exactly the spirit we need towards a better software engineering practice. And totally agree that it's important to make sure we get it right and won't break it in the future.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There might be some misunderstanding here. in_list and in_list_raw are both test macros, the main difference is that the former performs type coercion first and the latter does not.

Ahh! Yes I was missing exactly this point. Sorry about that. Makes total sense.

@advancedxy
Copy link
Contributor Author

Thanks @advancedxy -- I still think there is something not right about this PR

I don't think this PR will change any external behavior (see my comments)

Also the sqllogictests still pass without the code change

I am beginning to wonder if we really understand the problem or not.

I would suggest starting at this problem from the other end -- can we write a test that causes the error seen in comet (aka do whatever comet is doing). Once we have that then we can change to code to fix it

Thanks for your feedback and suggestions. I replied to your comments, which I think should explain your concerns and the current test in this PR should already do what Comet is trying to do. Please let me know if you have any further questions ior concerns. Thanks again.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @advancedxy -- yes I was confused about the macro. Thanks for bearing with me !

],
];
for list in lists.iter() {
in_list_raw!(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There might be some misunderstanding here. in_list and in_list_raw are both test macros, the main difference is that the former performs type coercion first and the latter does not.

Ahh! Yes I was missing exactly this point. Sorry about that. Makes total sense.

@alamb alamb changed the title fix: Relax type check with dict types in in_list fix: Support Dict types in in_list physical plans Apr 13, 2024
@alamb alamb merged commit d698d9d into apache:main Apr 13, 2024
24 checks passed
@advancedxy advancedxy deleted the issue_9530 branch April 16, 2024 07:56
This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
physical-expr Physical Expressions sqllogictest SQL Logic Tests (.slt)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

in_list doesn't handle value being a Dictionary type when check data type matches
3 participants