chore: Bump arrow-rs to 53.1.0 and datafusion #1001

kazuyukitanimura · 2024-10-08T00:17:12Z

Which issue does this PR close?

Rationale for this change

Arrow-rs 53.1.0 includes performance improvements

What changes are included in this PR?

Bumping arrow-rs to 53.1.0 and datafusion to a revision

How are these changes tested?

existing tests

kazuyukitanimura · 2024-10-09T22:41:47Z

@alamb @jayzhan211
I am still investigating but looks like there is a regression in DataFusion related to apache/datafusion#12753

assertion `left == right` failed: Arrays with inconsistent types passed to MutableArrayData
  left: Struct([Field { name: "a", data_type: Utf8, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }])
 right: Struct([Field { name: "a", data_type: Dictionary(Int32, Utf8), nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }])
        at comet::errors::init::{{closure}}(/__w/datafusion-comet/datafusion-comet/native/core/src/errors.rs:151)
        at <alloc::boxed::Box<F,A> as core::ops::function::Fn<Args>>::call(/rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/alloc/src/boxed.rs:2036)
        at std::panicking::rust_panic_with_hook(/rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/std/src/panicking.rs:799)
        at std::panicking::begin_panic_handler::{{closure}}(/rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/std/src/panicking.rs:664)
        at std::sys_common::backtrace::__rust_end_short_backtrace(/rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/std/src/sys_common/backtrace.rs:171)
        at rust_begin_unwind(/rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/std/src/panicking.rs:652)
        at core::panicking::panic_fmt(/rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/core/src/panicking.rs:72)
        at core::panicking::assert_failed_inner(/rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/core/src/panicking.rs:404)
        at core::panicking::assert_failed(/rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/core/src/panicking.rs:364)
        at arrow_data::transform::MutableArrayData::with_capacities(/usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/arrow-data-53.1.0/src/transform/mod.rs:418)
        at datafusion_functions_nested::make_array::array_array(/usr/local/cargo/git/checkouts/datafusion-c36d6291a88e48f3/577e4bb/datafusion/functions-nested/src/make_array.rs:223)
        at datafusion_functions_nested::make_array::make_array_inner(/usr/local/cargo/git/checkouts/datafusion-c36d6291a88e48f3/577e4bb/datafusion/functions-nested/src/make_array.rs:153)
        at core::ops::function::Fn::call(/rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/core/src/ops/function.rs:79)
        at datafusion_functions_nested::utils::make_scalar_function::{{closure}}(/usr/local/cargo/git/checkouts/datafusion-c36d6291a88e48f3/577e4bb/datafusion/functions-nested/src/utils.rs:83)
        at <datafusion_functions_nested::make_array::MakeArray as datafusion_expr::udf::ScalarUDFImpl>::invoke(/usr/local/cargo/git/checkouts/datafusion-c36d6291a88e48f3/577e4bb/datafusion/functions-nested/src/make_array.rs:96)
        at datafusion_expr::udf::ScalarUDF::invoke(/usr/local/cargo/git/checkouts/datafusion-c36d6291a88e48f3/577e4bb/datafusion/expr/src/udf.rs:197)
        at <datafusion_physical_expr::scalar_function::ScalarFunctionExpr as datafusion_physical_expr_common::physical_expr::PhysicalExpr>::evaluate(/usr/local/cargo/git/checkouts/datafusion-c36d6291a88e48f3/577e4bb/datafusion/physical-expr/src/scalar_function.rs:145)
        at datafusion_physical_plan::projection::ProjectionStream::batch_project::{{closure}}(/usr/local/cargo/git/checkouts/datafusion-c36d6291a88e48f3/577e4bb/datafusion/physical-plan/src/projection.rs:294)

Especially https://github.com/apache/datafusion/pull/12753/files#diff-f1e354d4fe26237064d8194e10a6008efa4f88e2b68b8a8352086a5d011180b8R108 type_union_resolution returns None for array that contains both Utf8 and Dictionary(Int32, Utf8)

Previously, coerce_types was coercing such cases. Not sure if this is an intentional change...

kazuyukitanimura · 2024-10-09T23:36:39Z

Just realized that type_union_resolution_coercion not handling struct...

jayzhan211 · 2024-10-09T23:44:35Z

We need to support Struct, but I'm not sure whether we should discard dictionary.

I assume if you have dictionary type at the first place, you expect to do some optimization based on dictionary type, so if we have primitive and dictionary type, I think it makes sense to keep dictionary type and coerce the primitive type to dictionary type.

Previous code utilize comparison coercion, in this case we just care about comparison therefore we don't need dictionary type at all.

kazuyukitanimura · 2024-10-10T06:49:46Z

Confirmed it was about adding structure support. Opened apache/datafusion#12843
Ideal to fix it before the next release, otherwise it is a regression. @jayzhan211 @alamb
For this specific test case to pass, I just needed to add

.or_else(|| struct_coercion(lhs_type, rhs_type))

at the end of type_union_resolution_coercion. Dictionary is fine, it seems to be working.

kazuyukitanimura · 2024-10-11T17:54:10Z

@andygrove @comphead @huaxingao @mbutrovich @parthchandra @viirya

comphead

this looks good to me, thanks @kazuyukitanimura I'm just wondering should we have a main pointing to DF with some specific commit? Is there a reason for that, like make a Comet release on top of DF with sepcific version?

kazuyukitanimura · 2024-10-11T21:48:28Z

Thanks @comphead I think the next DF release will happen soon, so we can bump in the next PR or if they release it first, I can update this PR.

andygrove · 2024-10-14T14:24:33Z

Thanks @comphead I think the next DF release will happen soon, so we can bump in the next PR or if they release it first, I can update this PR.

I think we can start the DF 43 release prep this week

parthchandra

lgtm

comphead

lgtm

kazuyukitanimura · 2024-10-14T22:44:55Z

Thanks, merged @jayzhan211 @comphead @andygrove @parthchandra @viirya

## Which issue does this PR close? ## Rationale for this change Arrow-rs 53.1.0 includes performance improvements ## What changes are included in this PR? Bumping arrow-rs to 53.1.0 and datafusion to a revision ## How are these changes tested? existing tests

## Which issue does this PR close?  Closes #. ## Rationale for this change  ## What changes are included in this PR?  ``` cb3e977 perf: Add experimental feature to replace SortMergeJoin with ShuffledHashJoin (apache#1007) 3df9d5c fix: Make comet-git-info.properties optional (apache#1027) 4033687 chore: Reserve memory for native shuffle writer per partition (apache#1022) bd541d6 (public/main) remove hard-coded version number from Dockerfile (apache#1025) e3ac6cf feat: Implement bloom_filter_agg (apache#987) 8d097d5 (origin/main) chore: Revert "chore: Reserve memory for native shuffle writer per partition (apache#988)" (apache#1020) 591f45a chore: Bump arrow-rs to 53.1.0 and datafusion (apache#1001) e146cfa chore: Reserve memory for native shuffle writer per partition (apache#988) abd9f85 fix: Fallback to Spark if named_struct contains duplicate field names (apache#1016) 22613e9 remove legacy comet-spark-shell (apache#1013) d40c802 clarify that Maven central only has jars for Linux (apache#1009) 837c256 docs: Various documentation improvements (apache#1005) 0667c60 chore: Make parquet reader options Comet options instead of Hadoop options (apache#968) 0028f1e fix: Fallback to Spark if scan has meta columns (apache#997) b131cc3 feat: Support `GetArrayStructFields` expression (apache#993) 3413397 docs: Update tuning guide (apache#995) afd28b9 Quality of life fixes for easier hacking (apache#982) 18150fb chore: Don't transform the HashAggregate to CometHashAggregate if Comet shuffle is disabled (apache#991) a1599e2 chore: Update for 0.3.0 release, prepare for 0.4.0 development (apache#970) ``` ## How are these changes tested?

kazuyukitanimura added 3 commits October 7, 2024 13:08

chore: Bump arrow-rs to 53.1.0 and datafusion

e6a3c45

fix

bdd46a5

Merge remote-tracking branch 'upstream/main' into bump-arrow-rs

7e19880

kazuyukitanimura mentioned this pull request Oct 10, 2024

Regression on coercing Array of Structs apache/datafusion#12843

Closed

chore: Bump arrow-rs and datafusion

a1a0294

kazuyukitanimura requested review from andygrove, comphead, huaxingao and viirya October 11, 2024 17:54

comphead reviewed Oct 11, 2024

View reviewed changes

parthchandra approved these changes Oct 14, 2024

View reviewed changes

comphead approved these changes Oct 14, 2024

View reviewed changes

viirya approved these changes Oct 14, 2024

View reviewed changes

kazuyukitanimura merged commit 591f45a into apache:main Oct 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

chore: Bump arrow-rs to 53.1.0 and datafusion #1001

chore: Bump arrow-rs to 53.1.0 and datafusion #1001

Uh oh!

kazuyukitanimura commented Oct 8, 2024 •

edited

Loading

Uh oh!

kazuyukitanimura commented Oct 9, 2024

Uh oh!

kazuyukitanimura commented Oct 9, 2024

Uh oh!

jayzhan211 commented Oct 9, 2024 •

edited

Loading

Uh oh!

kazuyukitanimura commented Oct 10, 2024

Uh oh!

kazuyukitanimura commented Oct 11, 2024

Uh oh!

comphead left a comment

Uh oh!

kazuyukitanimura commented Oct 11, 2024

Uh oh!

andygrove commented Oct 14, 2024

Uh oh!

parthchandra left a comment

Uh oh!

comphead left a comment

Uh oh!

kazuyukitanimura commented Oct 14, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

chore: Bump arrow-rs to 53.1.0 and datafusion #1001

chore: Bump arrow-rs to 53.1.0 and datafusion #1001

Uh oh!

Conversation

kazuyukitanimura commented Oct 8, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

How are these changes tested?

Uh oh!

kazuyukitanimura commented Oct 9, 2024

Uh oh!

kazuyukitanimura commented Oct 9, 2024

Uh oh!

jayzhan211 commented Oct 9, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kazuyukitanimura commented Oct 10, 2024

Uh oh!

kazuyukitanimura commented Oct 11, 2024

Uh oh!

comphead left a comment

Choose a reason for hiding this comment

Uh oh!

kazuyukitanimura commented Oct 11, 2024

Uh oh!

andygrove commented Oct 14, 2024

Uh oh!

parthchandra left a comment

Choose a reason for hiding this comment

Uh oh!

comphead left a comment

Choose a reason for hiding this comment

Uh oh!

kazuyukitanimura commented Oct 14, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

kazuyukitanimura commented Oct 8, 2024 •

edited

Loading

jayzhan211 commented Oct 9, 2024 •

edited

Loading