`make_array` doesn't support mixed types with dictionaries #841

Kimahriman · 2024-08-18T16:45:05Z

Describe the bug

Discovered this working on new array functions. DataFusion's make_array doesn't play nice with underlying Dictionary types. Kinda yet another issue related to apache/datafusion#11513 imo.

Steps to reproduce

Two separate cases that are easy to recreate with existing unit test setup:

checkSparkAnswerAndOperator(df.select(array(col("_13"), col("_13"))))

produces

  org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 11.0 failed 1 times, most recent failure: Lost task 0.0 in stage 11.0 (TID 11) (10.10.0.29 executor driver): org.apache.comet.CometNativeException: Invalid argument error: column types must match schema types, expected List(Field { name: "item", data_type: Utf8, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }) but found List(Field { name: "item", data_type: Dictionary(Int32, Utf8), nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }) at column index 0

Letting DataFusion infer the return type instead of specifying it results in

org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 11.0 failed 1 times, most recent failure: Lost task 0.0 in stage 11.0 (TID 11) (10.10.0.29 executor driver): java.lang.NullPointerException: Cannot invoke "org.apache.comet.shaded.arrow.vector.dictionary.DictionaryProvider.lookup(long)" because "dictionaryProvider" is null

which seems like an internal Comet issue? Haven't dug into this but presumably is fixable.

And doing a mixed dictionary/non-dictonary like

checkSparkAnswerAndOperator(df.select(array(col("_8"), col("_13"))))

produces

org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 8.0 failed 1 times, most recent failure: Lost task 0.0 in stage 8.0 (TID 8) (10.10.0.29 executor driver): org.apache.comet.CometNativeException: assertion `left == right` failed: Arrays with inconsistent types passed to MutableArrayData
  left: Utf8
 right: Dictionary(Int32, Utf8)

in datafusion/functions-nested/src/make_array.rs:231

Expected behavior

Not sure what the expected behavior or fix is. Either implement this function from scratch with better dictionary handling, or add some wrapper around invoking the UDF to flatten dictionary encoded arrays

Additional context

No response

The text was updated successfully, but these errors were encountered:

andygrove · 2024-08-19T13:33:50Z

@Kimahriman I've also been running into many issues around dictionary-encoded types. We have logic in CopyExce for unpacking them.

Kimahriman · 2024-08-19T14:14:08Z

Does it make sense to make an Expression equivalent of CopyExec? Then for things like DataFusion scalar UDFs used directly you could just wrap it a CopyArrays expression that just unpacks dictionaries if they exist? Then you wouldn't need custom Rust code for handling it, can just be done in the Scala query serde. Downside is if you have multiple expressions in a single project you might end up unpacking the same thing multiple times.

eejbyfeldt · 2024-08-20T07:12:35Z

Letting DataFusion infer the return type instead of specifying it results in

org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 11.0 failed 1 times, most recent failure: Lost task 0.0 in stage 11.0 (TID 11) (10.10.0.29 executor driver): java.lang.NullPointerException: Cannot invoke "org.apache.comet.shaded.arrow.vector.dictionary.DictionaryProvider.lookup(long)" because "dictionaryProvider" is null

I belive this would be resolved by forwarding the dictionaryProvider into the CometListVector similarly to what was done with the CometMapVector and CometStructVector in this PR: https://github.com/apache/datafusion-comet/pull/789/files

Not sure what the expected behavior or fix is. Either implement this function from scratch with better dictionary handling, or add some wrapper around invoking the UDF to flatten dictionary encoded arrays

To me this seems like a bug in the upstream datafusion implementation. Would it not be better to address the bug there? e.g make that implementation have correct behavior around dictionary encoded types.

Does it make sense to make an Expression equivalent of CopyExec? Then for things like DataFusion scalar UDFs used directly you could just wrap it a CopyArrays expression that just unpacks dictionaries if they exist? Then you wouldn't need custom Rust code for handling it, can just be done in the Scala query serde. Downside is if you have multiple expressions in a single project you might end up unpacking the same thing multiple times.

Seems like that would cause an unnecessary copy in the case of the make_array. To me it seems like we should just be "unpacking the dictionary as part of the data getting writing into the new array data structure. And other expressions might be able to do other optimizations by having data in the dictionary format for example #504

Kimahriman · 2024-08-20T16:53:12Z

I belive this would be resolved by forwarding the dictionaryProvider into the CometListVector similarly to what was done with the CometMapVector and CometStructVector in this PR: https://github.com/apache/datafusion-comet/pull/789/files

I figured it would be simple, I can look into at least fixing that.

To me this seems like a bug in the upstream datafusion implementation. Would it not be better to address the bug there? e.g make that implementation have correct behavior around dictionary encoded types.

I agree this is mostly a DataFusion bug, but it's at least partially a comet thing for choosing to make the dictionaries in the first place. Mostly I guess I'm asking if it's worth coming up with a workaround for this specific issue or more generally any dictionary related issue that will let things work until DataFusion handles things correctly, which might be non-trivial to fix.

Seems like that would cause an unnecessary copy in the case of the make_array. To me it seems like we should just be "unpacking the dictionary as part of the data getting writing into the new array data structure. And other expressions might be able to do other optimizations by having data in the dictionary format for example #504

I agree that is the best case scenario. Since doing the unpacking as part of creating the array data structure might be a non-trivial fix (at least for someone like me who's just learning about DataFusion), is it worth making a way to opt-in to a workaround that will pre-unravel a dictionary before sending it into a DataFusion function that Comet expressions can opt-in to until a better more permanent fix is figured out.

Kimahriman · 2024-08-21T00:54:19Z

Ok I learned a little more about DataFusion so I think I understand what the options are now. ScalarUDFs support type coercion, which will automatically cast each expression to the right type, and that gets inserted in analysis. Because of this, there's nothing technically "wrong" or buggy in DataFusion. Possibly just sub-optimal of doing a cast instead of smartly handling mixed types. Obviously Comet isn't using the DataFusion analyzer, so that will never happen automatically here. And the type coercion theoretically handles the differences between dictionaries/non-dictionaries.

Based on this, it seems like there's two options:

Add some form of support for handling ScalarUDF type coercion automatically
Don't use ScalarUDFs when type coercion may be needed and only use custom implemented expressions

Kimahriman added the bug Something isn't working label Aug 18, 2024

Kimahriman changed the title ~~make_array doesn't support dictionary types~~ make_array doesn't support mixed types with dictionaries Aug 18, 2024

Kimahriman mentioned this issue Aug 23, 2024

fix: Support type coercion for ScalarUDFs #865

Merged

andygrove closed this as completed in #865 Aug 29, 2024

This issue was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`make_array` doesn't support mixed types with dictionaries #841

`make_array` doesn't support mixed types with dictionaries #841

Kimahriman commented Aug 18, 2024 •

edited

Loading

andygrove commented Aug 19, 2024

Kimahriman commented Aug 19, 2024

eejbyfeldt commented Aug 20, 2024

Kimahriman commented Aug 20, 2024

Kimahriman commented Aug 21, 2024

make_array doesn't support mixed types with dictionaries #841

make_array doesn't support mixed types with dictionaries #841

Comments

Kimahriman commented Aug 18, 2024 • edited Loading

Describe the bug

Steps to reproduce

Expected behavior

Additional context

andygrove commented Aug 19, 2024

Kimahriman commented Aug 19, 2024

eejbyfeldt commented Aug 20, 2024

Kimahriman commented Aug 20, 2024

Kimahriman commented Aug 21, 2024

`make_array` doesn't support mixed types with dictionaries #841

`make_array` doesn't support mixed types with dictionaries #841

Kimahriman commented Aug 18, 2024 •

edited

Loading