Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

make_array doesn't support mixed types with dictionaries #841

Closed
Kimahriman opened this issue Aug 18, 2024 · 5 comments · Fixed by #865
Closed

make_array doesn't support mixed types with dictionaries #841

Kimahriman opened this issue Aug 18, 2024 · 5 comments · Fixed by #865
Labels
bug Something isn't working

Comments

@Kimahriman
Copy link
Contributor

Kimahriman commented Aug 18, 2024

Describe the bug

Discovered this working on new array functions. DataFusion's make_array doesn't play nice with underlying Dictionary types. Kinda yet another issue related to apache/datafusion#11513 imo.

Steps to reproduce

Two separate cases that are easy to recreate with existing unit test setup:

checkSparkAnswerAndOperator(df.select(array(col("_13"), col("_13"))))

produces

  org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 11.0 failed 1 times, most recent failure: Lost task 0.0 in stage 11.0 (TID 11) (10.10.0.29 executor driver): org.apache.comet.CometNativeException: Invalid argument error: column types must match schema types, expected List(Field { name: "item", data_type: Utf8, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }) but found List(Field { name: "item", data_type: Dictionary(Int32, Utf8), nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }) at column index 0

Letting DataFusion infer the return type instead of specifying it results in

org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 11.0 failed 1 times, most recent failure: Lost task 0.0 in stage 11.0 (TID 11) (10.10.0.29 executor driver): java.lang.NullPointerException: Cannot invoke "org.apache.comet.shaded.arrow.vector.dictionary.DictionaryProvider.lookup(long)" because "dictionaryProvider" is null

which seems like an internal Comet issue? Haven't dug into this but presumably is fixable.

And doing a mixed dictionary/non-dictonary like

checkSparkAnswerAndOperator(df.select(array(col("_8"), col("_13"))))

produces

org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 8.0 failed 1 times, most recent failure: Lost task 0.0 in stage 8.0 (TID 8) (10.10.0.29 executor driver): org.apache.comet.CometNativeException: assertion `left == right` failed: Arrays with inconsistent types passed to MutableArrayData
  left: Utf8
 right: Dictionary(Int32, Utf8)

in datafusion/functions-nested/src/make_array.rs:231

Expected behavior

Not sure what the expected behavior or fix is. Either implement this function from scratch with better dictionary handling, or add some wrapper around invoking the UDF to flatten dictionary encoded arrays

Additional context

No response

@Kimahriman Kimahriman added the bug Something isn't working label Aug 18, 2024
@Kimahriman Kimahriman changed the title make_array doesn't support dictionary types make_array doesn't support mixed types with dictionaries Aug 18, 2024
@andygrove
Copy link
Member

@Kimahriman I've also been running into many issues around dictionary-encoded types. We have logic in CopyExce for unpacking them.

@Kimahriman
Copy link
Contributor Author

Does it make sense to make an Expression equivalent of CopyExec? Then for things like DataFusion scalar UDFs used directly you could just wrap it a CopyArrays expression that just unpacks dictionaries if they exist? Then you wouldn't need custom Rust code for handling it, can just be done in the Scala query serde. Downside is if you have multiple expressions in a single project you might end up unpacking the same thing multiple times.

@eejbyfeldt
Copy link
Contributor

Letting DataFusion infer the return type instead of specifying it results in

org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 11.0 failed 1 times, most recent failure: Lost task 0.0 in stage 11.0 (TID 11) (10.10.0.29 executor driver): java.lang.NullPointerException: Cannot invoke "org.apache.comet.shaded.arrow.vector.dictionary.DictionaryProvider.lookup(long)" because "dictionaryProvider" is null

I belive this would be resolved by forwarding the dictionaryProvider into the CometListVector similarly to what was done with the CometMapVector and CometStructVector in this PR: https://github.com/apache/datafusion-comet/pull/789/files

Not sure what the expected behavior or fix is. Either implement this function from scratch with better dictionary handling, or add some wrapper around invoking the UDF to flatten dictionary encoded arrays

To me this seems like a bug in the upstream datafusion implementation. Would it not be better to address the bug there? e.g make that implementation have correct behavior around dictionary encoded types.

Does it make sense to make an Expression equivalent of CopyExec? Then for things like DataFusion scalar UDFs used directly you could just wrap it a CopyArrays expression that just unpacks dictionaries if they exist? Then you wouldn't need custom Rust code for handling it, can just be done in the Scala query serde. Downside is if you have multiple expressions in a single project you might end up unpacking the same thing multiple times.

Seems like that would cause an unnecessary copy in the case of the make_array. To me it seems like we should just be "unpacking the dictionary as part of the data getting writing into the new array data structure. And other expressions might be able to do other optimizations by having data in the dictionary format for example #504

@Kimahriman
Copy link
Contributor Author

I belive this would be resolved by forwarding the dictionaryProvider into the CometListVector similarly to what was done with the CometMapVector and CometStructVector in this PR: https://github.com/apache/datafusion-comet/pull/789/files

I figured it would be simple, I can look into at least fixing that.

To me this seems like a bug in the upstream datafusion implementation. Would it not be better to address the bug there? e.g make that implementation have correct behavior around dictionary encoded types.

I agree this is mostly a DataFusion bug, but it's at least partially a comet thing for choosing to make the dictionaries in the first place. Mostly I guess I'm asking if it's worth coming up with a workaround for this specific issue or more generally any dictionary related issue that will let things work until DataFusion handles things correctly, which might be non-trivial to fix.

Seems like that would cause an unnecessary copy in the case of the make_array. To me it seems like we should just be "unpacking the dictionary as part of the data getting writing into the new array data structure. And other expressions might be able to do other optimizations by having data in the dictionary format for example #504

I agree that is the best case scenario. Since doing the unpacking as part of creating the array data structure might be a non-trivial fix (at least for someone like me who's just learning about DataFusion), is it worth making a way to opt-in to a workaround that will pre-unravel a dictionary before sending it into a DataFusion function that Comet expressions can opt-in to until a better more permanent fix is figured out.

@Kimahriman
Copy link
Contributor Author

Ok I learned a little more about DataFusion so I think I understand what the options are now. ScalarUDFs support type coercion, which will automatically cast each expression to the right type, and that gets inserted in analysis. Because of this, there's nothing technically "wrong" or buggy in DataFusion. Possibly just sub-optimal of doing a cast instead of smartly handling mixed types. Obviously Comet isn't using the DataFusion analyzer, so that will never happen automatically here. And the type coercion theoretically handles the differences between dictionaries/non-dictionaries.

Based on this, it seems like there's two options:

  • Add some form of support for handling ScalarUDF type coercion automatically
  • Don't use ScalarUDFs when type coercion may be needed and only use custom implemented expressions

This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants