Skip to content

Conversation

@LuciferYang
Copy link
Contributor

@LuciferYang LuciferYang commented Jan 19, 2024

What changes were proposed in this pull request?

This pr aims to upgrade Arrow from 14.0.2 to 15.0.0, this version fixes the compatibility issue with Netty 4.1.104.Final(GH-39265).

Additionally, since the arrow-vector module uses eclipse-collections to replace netty-common as a compile-level dependency, Apache Spark has added a dependency on eclipse-collections after upgrading to use Arrow 15.0.0.

Why are the changes needed?

The new version brings the following major changes:

Bug Fixes
GH-34610 - [Java] Fix valueCount and field name when loading/transferring NullVector
GH-38242 - [Java] Fix incorrect internal struct accounting for DenseUnionVector#getBufferSizeFor
GH-38254 - [Java] Add reusable buffer getters to char/binary vectors
GH-38366 - [Java] Fix Murmur hash on buffers less than 4 bytes
GH-38387 - [Java] Fix JDK8 compilation issue with TestAllTypes
GH-38614 - [Java] Add VarBinary and VarCharWriter helper methods to more writers
GH-38725 - [Java] decompression in Lz4CompressionCodec.java does not set writer index

New Features and Improvements
GH-38511 - [Java] Add getTransferPair(Field, BufferAllocator, CallBack) for StructVector and MapVector
GH-14936 - [Java] Remove netty dependency from arrow-vector
GH-38990 - [Java] Upgrade to flatc version 23.5.26
GH-39265 - [Java] Make it run well with the netty newest version 4.1.104

The full release notes as follows:

Does this PR introduce any user-facing change?

No

How was this patch tested?

Pass GitHub Actions

Was this patch authored or co-authored using generative AI tooling?

No

@github-actions github-actions bot added the BUILD label Jan 19, 2024
Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

val name = v.data.getName
name.startsWith("pmml-model-") || name.startsWith("scala-collection-compat_") ||
name.startsWith("jsr305-") || name.startsWith("netty-") || name == "unused-1.0.0.jar"
val validPrefixes = Set("spark-connect", "unused-", "guava-", "failureaccess-",
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this modification is not made, the connect server assembly jar will not be able to start after upgrade arrow 15 because too many unnecessary jar files are included in the assembly jar, which has caused some conflicts after upgrading to Arrow 15.

But I think this change is quite universal, so I will submit this change in a separate pr first.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@LuciferYang LuciferYang marked this pull request as draft January 19, 2024 09:58
dropwizard-metrics-hadoop-metrics2-reporter/0.1.2//dropwizard-metrics-hadoop-metrics2-reporter-0.1.2.jar
flatbuffers-java/1.12.0//flatbuffers-java-1.12.0.jar
eclipse-collections-api/11.1.0//eclipse-collections-api-11.1.0.jar
eclipse-collections/11.1.0//eclipse-collections-11.1.0.jar
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a question. Is this inevitable new dependencies?

Copy link
Contributor Author

@LuciferYang LuciferYang Jan 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In apache/arrow#38493, netty-common is replaced by eclipse-collections for arrow-vector module. Let's test exclude it, I guess there will be test failures.

If there are no test failures, I will further confirm it from the code later.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for trying.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://github.com/LuciferYang/spark/actions/runs/7608110702/job/20716682222
image

From the test results, we must add this dependency, it is used in the initialization of StructVector.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for confirming.

@LuciferYang LuciferYang changed the title [SPARK-46718][BUILD] Test arrow 15 [SPARK-46718][BUILD] Upgrade Arrow to 15.0.0 Jan 23, 2024
@LuciferYang LuciferYang marked this pull request as ready for review January 23, 2024 03:09
Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM. Thank you, @LuciferYang .
Merged to master for Apache Spark 4.0.0.

@LuciferYang
Copy link
Contributor Author

Thanks @dongjoon-hyun ~

@LuciferYang LuciferYang deleted the SPARK-46718 branch May 1, 2025 13:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants