[HUDI-3088] Use Spark 3.2 as default Spark version #8445

xushiyan · 2023-04-13T04:16:17Z

Change Logs

Update default dev profile to spark3.2
Update pom dependencies to make build pass
Fix failing test cases with the default profile
Add IT job in GH actions CI
Update test job setup in CI

Impact

Default dev profile change. Local repo setup should be updated accordingly.

Risk level

Medium.

Use bundle validate to verify artifacts.

Documentation Update

Update build instructions in docs where applicable

Contributor's checklist

Read through contributor's guide
Change Logs and Impact were stated clearly
Adequate tests were added if applicable
CI passed

xushiyan · 2023-05-23T09:19:53Z

hudi-cli/pom.xml

    <dependency>
      <groupId>org.apache.spark</groupId>
      <artifactId>spark-core_${scala.binary.version}</artifactId>
+      <exclusions>
+        <exclusion>
+          <groupId>org.apache.hadoop</groupId>
+          <artifactId>hadoop-client-api</artifactId>
+        </exclusion>
+        <exclusion>
+          <groupId>org.apache.hadoop</groupId>
+          <artifactId>hadoop-client-runtime</artifactId>
+        </exclusion>
+      </exclusions>
    </dependency>


removing these won't affect using cli/spark/utilities bundles as spark will be provided

xushiyan · 2023-05-23T09:20:33Z

hudi-utilities/pom.xml

      <groupId>org.apache.spark</groupId>
      <artifactId>spark-core_${scala.binary.version}</artifactId>
      <exclusions>
+        <exclusion>
+          <groupId>org.apache.hadoop</groupId>
+          <artifactId>hadoop-client-api</artifactId>
+        </exclusion>
+        <exclusion>
+          <groupId>org.apache.hadoop</groupId>
+          <artifactId>hadoop-client-runtime</artifactId>
+        </exclusion>
        <exclusion>


xushiyan · 2023-05-23T09:20:46Z

hudi-utilities/pom.xml

      <groupId>org.apache.spark</groupId>
      <artifactId>spark-streaming-kafka-0-10_${scala.binary.version}</artifactId>
      <version>${spark.version}</version>
+      <exclusions>
+        <exclusion>
+          <groupId>org.apache.hadoop</groupId>
+          <artifactId>hadoop-client-api</artifactId>
+        </exclusion>
+        <exclusion>
+          <groupId>org.apache.hadoop</groupId>
+          <artifactId>hadoop-client-runtime</artifactId>
+        </exclusion>
+      </exclusions>
    </dependency>


xushiyan · 2023-05-23T09:41:21Z

azure-pipelines-20230430.yml

          - script: |
              grep "testcase" */target/surefire-reports/*.xml */*/target/surefire-reports/*.xml | awk -F'"' ' { print $6,$4,$2 } ' | sort -nr | head -n 100
            displayName: Top 100 long-running testcases
-      - job: IT


IT will still be running with spark2.4 as per the docker demo setup, and moved to GH actions

codope · 2023-05-23T15:38:05Z

...hudi-client-common/src/test/java/org/apache/hudi/io/storage/TestHoodieHFileReaderWriter.java

-  @Test
-  public void testWriteReadWithEvolvedSchema() throws Exception {
-    // Disable the test with evolved schema for HFile since it's not supported
+  public void testWriteReadWithEvolvedSchema(String evolvedSchemaPath) throws Exception {


should we just remove it for now? there's already a tracking jira.

it's disabled with a message. should be a good reminder there.

codope · 2023-05-23T15:39:50Z

...t/hudi-spark-client/src/test/java/org/apache/hudi/index/hbase/TestSparkHoodieHBaseIndex.java

    // Initialize HbaseMiniCluster
+    System.setProperty("zookeeper.preAllocSize", "100");
+    System.setProperty("zookeeper.maxCnxns", "60");
+    System.setProperty("zookeeper.4lw.commands.whitelist", "*");


why do we need these configs in this PR?

this is caused by spark version upgrade and changed zookeeper's version.

trying to run in CI without it. locally seems fine.

"zookeeper.4lw.commands.whitelist", "*" will be required to start the test service properly with new zookeeper version used. added it back

codope · 2023-05-23T15:41:10Z

...-common/src/test/java/org/apache/hudi/common/testutils/minicluster/ZookeeperTestService.java

    // set env and directly in order to handle static init/gc issues
    System.setProperty("zookeeper.preAllocSize", "100");
+    System.setProperty("zookeeper.maxCnxns", "60");
+    System.setProperty("zookeeper.4lw.commands.whitelist", "*");


same question as above. if it's a test issue then maybe add it in a separate PR?

codope · 2023-05-23T15:43:23Z

hudi-integ-test/pom.xml

+    <dependency>
+      <groupId>org.apache.parquet</groupId>
+      <artifactId>parquet-avro</artifactId>
+      <version>${parquet.version}</version>
+      <scope>test</scope>
+    </dependency>
+
+    <dependency>
+      <groupId>org.apache.parquet</groupId>
+      <artifactId>parquet-hadoop</artifactId>
+      <version>${parquet.version}</version>
+      <scope>test</scope>
+    </dependency>


why are these dependencies needed here? also, if we run integ tests with hudi-spark3.x-bundle, wouldn't they already be present in classpath?

maybe parquet-hadoop is not required. i'll have to re-run to confirm

this actually affects Flink's IT. consider this is only test deps, did not dig further

2023-05-24T08:46:49.7944885Z [ERROR] Tests run: 128, Failures: 0, Errors: 4, Skipped: 0, Time elapsed: 1,026.49 s <<< FAILURE! - in org.apache.hudi.table.ITTestHoodieDataSource 2023-05-24T08:46:49.7945946Z [ERROR] testUpdateDelete{String, HoodieTableType}[1] Time elapsed: 2.196 s <<< ERROR! 2023-05-24T08:46:49.7947263Z org.apache.flink.table.api.TableException: Unsupported query: update t1 set age=18 where uuid in('id1', 'id2') 2023-05-24T08:46:49.7947914Z at org.apache.hudi.table.ITTestHoodieDataSource.execInsertSql(ITTestHoodieDataSource.java:2095) 2023-05-24T08:46:49.7948557Z at org.apache.hudi.table.ITTestHoodieDataSource.testUpdateDelete(ITTestHoodieDataSource.java:1971)

this actually affects Flink's IT. consider this is only test deps, did not dig further

scratch this. this is caused by a newly added testcase that only runs with flink 1.17

codope · 2023-05-23T15:47:13Z

...on/src/test/java/org/apache/hudi/spark3/internal/TestHoodieDataSourceInternalBatchWrite.java

  @MethodSource("bulkInsertTypeParams")
  public void testDataSourceWriter(boolean populateMetaFields) throws Exception {
-    testDataSourceWriterInternal(Collections.EMPTY_MAP, Collections.EMPTY_MAP, populateMetaFields);
+    testDataSourceWriterInternal(Collections.emptyMap(), Collections.emptyMap(), populateMetaFields);


just asking, does it make any difference? I believe both are the same thing right?

as per the api's javadoc, using the api should be preferred

Using this method is likely to have comparable cost to using the like-named field. (Unlike this method, the field does not provide type safety.)

codope · 2023-05-23T15:54:07Z

...utilities/src/test/java/org/apache/hudi/utilities/deltastreamer/TestHoodieDeltaStreamer.java

-    int numRecords = 15;
-    int numPartitions = 3;
+    int numRecords = 30;
+    int numPartitions = 2;


why change these values?

this has sth to do with org.apache.spark.streaming.kafka010.KafkaTestUtils which only utilizes 2 partitions (probably limited by spark partitions due to 1:1 mapping). hence doing more than 2 will not give expected even records per partitions. this can be pushed as follow up to improve the test setup.

filed the improvement ticket https://issues.apache.org/jira/browse/HUDI-6266

codope · 2023-05-23T15:55:14Z

hudi-utilities/src/test/java/org/apache/hudi/utilities/sources/TestJsonKafkaSource.java

    assertEquals(appendList, withKafkaOffsetColumns.subList(withKafkaOffsetColumns.size() - 3, withKafkaOffsetColumns.size()));
+
+    dfNoOffsetInfo.unpersist();
+    dfWithOffsetInfo.unpersist();


codope · 2023-05-23T15:56:11Z

hudi-utilities/src/test/java/org/apache/hudi/utilities/sources/TestAvroKafkaSource.java

-    int numPartitions = 3;
-    int numMessages = 15;
+    int numPartitions = 2;
+    int numMessages = 30;


same here, why change it? is it reduce the test time (lesser partitions lesser i/o)?

xushiyan · 2023-05-25T16:32:34Z

pom.xml

-          <plugin>
-            <groupId>org.jacoco</groupId>
-            <artifactId>jacoco-maven-plugin</artifactId>
-            <executions>
-              <execution>
-                <goals>
-                  <goal>prepare-agent</goal>
-                </goals>


this step from jacoco plugin under integration-tests profile caused mystic class loading issue in flink 1.17. need to be cautious with seemingly innocuous plugin! cc @danny0405

[ERROR] testMergeOnReadInputFormatLogFileOnlyIteratorGetUnMergedLogFileIterator Time elapsed: 0.006 s <<< ERROR! java.lang.NoClassDefFoundError: Could not initialize class org.apache.calcite.rel.metadata.DefaultRelMetadataProvider at org.apache.hudi.table.ITTestSchemaEvolution.setUp(ITTestSchemaEvolution.java:81)

hudi-bot · 2023-05-25T21:17:59Z

CI report:

1f9f158 UNKNOWN
8bc8b9c Azure: SUCCESS

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run azure re-run the last Azure build

xushiyan mentioned this pull request Apr 13, 2023

[HUDI-3088] Use Spark 3.2 as default Spark version #7327

Closed

4 tasks

xushiyan force-pushed the HUDI-3088-default-spark32-2 branch from 2557433 to 07e889c Compare April 15, 2023 09:14

xushiyan force-pushed the HUDI-3088-default-spark32-2 branch from 07e889c to b13036a Compare May 14, 2023 14:38

xushiyan mentioned this pull request May 15, 2023

[HUDI-6209] Move test deps to tests-common #8708

Merged

4 tasks

xushiyan force-pushed the HUDI-3088-default-spark32-2 branch 5 times, most recently from 2b4ad62 to 203caaf Compare May 16, 2023 12:42

xushiyan mentioned this pull request May 17, 2023

[HUDI-5394] Fix tests for RowCustomColumnsSortPartitioner #8741

Merged

4 tasks

xushiyan force-pushed the HUDI-3088-default-spark32-2 branch from 1f9f158 to 78d54f5 Compare May 22, 2023 08:30

apache deleted a comment from hudi-bot May 23, 2023

xushiyan commented May 23, 2023

View reviewed changes

xushiyan force-pushed the HUDI-3088-default-spark32-2 branch from fe494c5 to 5f5ce7f Compare May 23, 2023 09:40

xushiyan commented May 23, 2023

View reviewed changes

codope reviewed May 23, 2023

View reviewed changes

xushiyan force-pushed the HUDI-3088-default-spark32-2 branch 3 times, most recently from 53b3c28 to 5299e7b Compare May 24, 2023 17:08

xushiyan added 4 commits May 25, 2023 12:19

[HUDI-3088] Use spark3.2 as default profile

ca046c5

fix spark3.0 profile

0c381e9

try remove parquet-hadoop from integ-test pom

b2aa6f4

try revert zookeeper props

0ad7f75

xushiyan force-pushed the HUDI-3088-default-spark32-2 branch from 5299e7b to 70af3bf Compare May 25, 2023 04:19

fix zookeeper config

a1b15ed

xushiyan force-pushed the HUDI-3088-default-spark32-2 branch 3 times, most recently from 8011027 to 4e93aac Compare May 25, 2023 05:01

codope approved these changes May 25, 2023

View reviewed changes

xushiyan added the priority:critical Production degraded; pipelines stalled label May 25, 2023

xushiyan added release-0.14.0 dependencies Dependency updates labels May 25, 2023

fix flink IT setup

8bc8b9c

xushiyan force-pushed the HUDI-3088-default-spark32-2 branch from 540705c to 8bc8b9c Compare May 25, 2023 16:26

xushiyan commented May 25, 2023

View reviewed changes

xushiyan merged commit 516c3d5 into apache:master May 26, 2023

xushiyan deleted the HUDI-3088-default-spark32-2 branch May 26, 2023 01:18

This was referenced May 26, 2023

[HUDI-6272] Fix POM to properly skip tests #8823

Merged

[HUDI-6273] Fix default maven build and README #8822

Merged

yihua mentioned this pull request Jun 8, 2023

[HUDI-6198] Support Hudi on Spark 3.4.0 #8885

Merged

4 tasks

yihua mentioned this pull request Mar 7, 2024

Make Spark 3.2 the default profile #6105

Closed

5 tasks

hudi-bot mentioned this pull request Nov 30, 2025

Improve hudi-utilities tests with KafkaTestUtils to allow more partitions #15969

Open

[HUDI-3088] Use Spark 3.2 as default Spark version #8445

[HUDI-3088] Use Spark 3.2 as default Spark version #8445

Uh oh!

Conversation

xushiyan commented Apr 13, 2023

Change Logs

Impact

Risk level

Documentation Update

Contributor's checklist

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

xushiyan May 25, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hudi-bot commented May 25, 2023

CI report:

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

xushiyan May 25, 2023 •

edited

Loading