Skip to content

Conversation

@mansipp
Copy link
Contributor

@mansipp mansipp commented May 10, 2023

Change Logs

Changes to support Spark 3.3.0

Impact

Upgrade Spark to 3.4.0

Risk level (write none, low medium or high below)

Low

Documentation Update

Need doc update

Contributor's checklist

  • Read through contributor's guide
  • Change Logs and Impact were stated clearly
  • Adequate tests were added if applicable
  • CI passed


- scalaProfile: "scala-2.12"
sparkProfile: "spark3.3"
sparkProfile: "spark3.4"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why we have so many changes?

@mansipp mansipp changed the title [HUDI-6198] Spark 3.4.0 Upgrade [DO NOT MERGE] [HUDI-6198] Run gh actions with Spark 3.4.0 May 11, 2023
@mansipp
Copy link
Contributor Author

mansipp commented May 15, 2023

@wzx140 @danny0405
Can you please provide more information on how record size is measured? There are some unit test failures.
Unit tests: https://github.com/apache/hudi/actions/runs/4962970537/jobs/8881734351?pr=8682

rahil-c and others added 3 commits May 16, 2023 11:58
* Disable vectorized reader for spark 3.3.2 only
* Keep compile version to be Spark 3.3.1

---------

Co-authored-by: Rahil Chertara <rchertar@amazon.com>
@mansipp mansipp force-pushed the mansipp/spark340-upgarde-oss branch from 28c025a to 7cc4d1b Compare May 16, 2023 19:02
@yihua
Copy link
Contributor

yihua commented May 16, 2023

BTW, for bundle validation in Java CI, we need to create a new docker image based on Spark 3.4; otherwise, it won't work in Java CI. I'll upload the docker image to the Docker Hub.

@mansipp mansipp force-pushed the mansipp/spark340-upgarde-oss branch from 0846e31 to 373dd9f Compare May 17, 2023 17:09
@mansipp mansipp force-pushed the mansipp/spark340-upgarde-oss branch 2 times, most recently from 504d795 to e53004e Compare May 17, 2023 20:03
@mansipp mansipp force-pushed the mansipp/spark340-upgarde-oss branch from e53004e to 19a907a Compare May 17, 2023 20:04
@hudi-bot
Copy link
Collaborator

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

@rahil-c
Copy link
Collaborator

rahil-c commented May 22, 2023

Hi @danny0405 @xiarixiaoyao, we are trying to upgrade spark to 3.4.0 in hudi. However we are facing issues with several functional test failures due to another casting exception. For example when running the test
TestAvroSchemaResolutionSupport#testArrayOfMapsChangeValueType we hit the following issue

java.lang.ClassCastException: org.apache.spark.sql.catalyst.expressions.UnsafeRow cannot be cast to org.apache.spark.sql.vectorized.ColumnarBatch

java.lang.ClassCastException: org.apache.spark.sql.catalyst.expressions.UnsafeRow cannot be cast to org.apache.spark.sql.vectorized.ColumnarBatch
2023-05-16T01:46:35.0110639Z  at org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.next(DataSourceScanExec.scala:600)
2023-05-16T01:46:35.0110882Z  at org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.next(DataSourceScanExec.scala:589)
2023-05-16T01:46:35.0111237Z  at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown Source)
2023-05-16T01:46:35.0111621Z  at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
2023-05-16T01:46:35.0111933Z  at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
2023-05-16T01:46:35.0112206Z  at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:760)
2023-05-16T01:46:35.0112432Z  at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:388)
2023-05-16T01:46:35.0112618Z  at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:888)
2023-05-16T01:46:35.0112814Z  at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:888)
2023-05-16T01:46:35.0113021Z  at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
2023-05-16T01:46:35.0113220Z  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:364)
2023-05-16T01:46:35.0113374Z  at org.apache.spark.rdd.RDD.iterator(RDD.scala:328)
2023-05-16T01:46:35.0113580Z  at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:92)

I think we can get past this by having vectorized reader disabled as well as code gen disabled but I do not think these are acceptable workarounds. Was wondering if we can get your thoughts, would be happy to sync offline at some point to provide findings as well.

@yihua
Copy link
Contributor

yihua commented Jun 9, 2023

Closing this. The Hudi support on Spark 3.4.0 on the master branch is #8885.

@yihua yihua closed this Jun 9, 2023
@hudi-bot hudi-bot mentioned this pull request Dec 9, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants