[DO NOT MERGE] [HUDI-6198] Run gh actions with Spark 3.4.0 #8682

mansipp · 2023-05-10T20:56:34Z

Change Logs

Changes to support Spark 3.3.0

Impact

Upgrade Spark to 3.4.0

Risk level (write none, low medium or high below)

Low

Documentation Update

Need doc update

Contributor's checklist

Read through contributor's guide
Change Logs and Impact were stated clearly
Adequate tests were added if applicable
CI passed

danny0405 · 2023-05-11T03:07:34Z

.github/workflows/bot.yml

-
          - scalaProfile: "scala-2.12"
-            sparkProfile: "spark3.3"
+            sparkProfile: "spark3.4"


Why we have so many changes?

mansipp · 2023-05-15T16:58:24Z

@wzx140 @danny0405
Can you please provide more information on how record size is measured? There are some unit test failures.
Unit tests: https://github.com/apache/hudi/actions/runs/4962970537/jobs/8881734351?pr=8682

* Disable vectorized reader for spark 3.3.2 only * Keep compile version to be Spark 3.3.1 --------- Co-authored-by: Rahil Chertara <rchertar@amazon.com>

yihua · 2023-05-16T21:39:18Z

BTW, for bundle validation in Java CI, we need to create a new docker image based on Spark 3.4; otherwise, it won't work in Java CI. I'll upload the docker image to the Docker Hub.

hudi-bot · 2023-05-17T20:48:53Z

CI report:

c23f6ed UNKNOWN
d3756a6 Azure: FAILURE
fc1649b UNKNOWN
bf9a30c UNKNOWN
6d916f2 UNKNOWN
9bcf1ed UNKNOWN
28c025a UNKNOWN
7cc4d1b UNKNOWN
0846e31 UNKNOWN
373dd9f UNKNOWN
504d795 UNKNOWN
19a907a UNKNOWN

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run azure re-run the last Azure build

rahil-c · 2023-05-22T23:34:29Z

Hi @danny0405 @xiarixiaoyao, we are trying to upgrade spark to 3.4.0 in hudi. However we are facing issues with several functional test failures due to another casting exception. For example when running the test
TestAvroSchemaResolutionSupport#testArrayOfMapsChangeValueType we hit the following issue

java.lang.ClassCastException: org.apache.spark.sql.catalyst.expressions.UnsafeRow cannot be cast to org.apache.spark.sql.vectorized.ColumnarBatch

java.lang.ClassCastException: org.apache.spark.sql.catalyst.expressions.UnsafeRow cannot be cast to org.apache.spark.sql.vectorized.ColumnarBatch
2023-05-16T01:46:35.0110639Z  at org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.next(DataSourceScanExec.scala:600)
2023-05-16T01:46:35.0110882Z  at org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.next(DataSourceScanExec.scala:589)
2023-05-16T01:46:35.0111237Z  at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown Source)
2023-05-16T01:46:35.0111621Z  at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
2023-05-16T01:46:35.0111933Z  at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
2023-05-16T01:46:35.0112206Z  at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:760)
2023-05-16T01:46:35.0112432Z  at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:388)
2023-05-16T01:46:35.0112618Z  at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:888)
2023-05-16T01:46:35.0112814Z  at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:888)
2023-05-16T01:46:35.0113021Z  at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
2023-05-16T01:46:35.0113220Z  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:364)
2023-05-16T01:46:35.0113374Z  at org.apache.spark.rdd.RDD.iterator(RDD.scala:328)
2023-05-16T01:46:35.0113580Z  at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:92)

I think we can get past this by having vectorized reader disabled as well as code gen disabled but I do not think these are acceptable workarounds. Was wondering if we can get your thoughts, would be happy to sync offline at some point to provide findings as well.

yihua · 2023-06-09T18:33:18Z

Closing this. The Hudi support on Spark 3.4.0 on the master branch is #8885.

mansipp added 4 commits May 10, 2023 13:47

Spark 3.4.0 upgrade

5369ed0

spark 3.4 profile

0d87576

bot.yml changes

c23f6ed

bot.yml flink test change

d3756a6

danny0405 reviewed May 11, 2023

View reviewed changes

mansipp changed the title ~~[HUDI-6198] Spark 3.4.0 Upgrade~~ [DO NOT MERGE] [HUDI-6198] Run gh actions with Spark 3.4.0 May 11, 2023

rahil-c and others added 3 commits May 16, 2023 11:58

[HUDI-5868] Make hudi-spark compatible against Spark 3.3.2 (apache#8082)

7db547f

* Disable vectorized reader for spark 3.3.2 only * Keep compile version to be Spark 3.3.1 --------- Co-authored-by: Rahil Chertara <rchertar@amazon.com>

Temp change

81109c4

added functinal test logic to bot.yml

7cc4d1b

mansipp force-pushed the mansipp/spark340-upgarde-oss branch from 28c025a to 7cc4d1b Compare May 16, 2023 19:02

run functional tests

373dd9f

mansipp force-pushed the mansipp/spark340-upgarde-oss branch from 0846e31 to 373dd9f Compare May 17, 2023 17:09

Stop sparkSession after each test in TestHiveTableSchemaEvolution class

c910f83

mansipp force-pushed the mansipp/spark340-upgarde-oss branch 2 times, most recently from 504d795 to e53004e Compare May 17, 2023 20:03

Run both functional and unit tests

19a907a

mansipp force-pushed the mansipp/spark340-upgarde-oss branch from e53004e to 19a907a Compare May 17, 2023 20:04

yihua closed this Jun 9, 2023

hudi-bot mentioned this pull request Dec 9, 2025

Support Spark 3.4.0 #15948

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[DO NOT MERGE] [HUDI-6198] Run gh actions with Spark 3.4.0 #8682

[DO NOT MERGE] [HUDI-6198] Run gh actions with Spark 3.4.0 #8682

Uh oh!

mansipp commented May 10, 2023 •

edited

Loading

Uh oh!

danny0405 May 11, 2023

Uh oh!

mansipp commented May 15, 2023 •

edited

Loading

Uh oh!

yihua commented May 16, 2023

Uh oh!

hudi-bot commented May 17, 2023

Uh oh!

rahil-c commented May 22, 2023

Uh oh!

yihua commented Jun 9, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

[DO NOT MERGE] [HUDI-6198] Run gh actions with Spark 3.4.0 #8682

[DO NOT MERGE] [HUDI-6198] Run gh actions with Spark 3.4.0 #8682

Uh oh!

Conversation

mansipp commented May 10, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Change Logs

Impact

Risk level (write none, low medium or high below)

Documentation Update

Contributor's checklist

Uh oh!

danny0405 May 11, 2023

Choose a reason for hiding this comment

Uh oh!

mansipp commented May 15, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yihua commented May 16, 2023

Uh oh!

hudi-bot commented May 17, 2023

CI report:

Uh oh!

rahil-c commented May 22, 2023

Uh oh!

yihua commented Jun 9, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

mansipp commented May 10, 2023 •

edited

Loading

mansipp commented May 15, 2023 •

edited

Loading