[WIP][HUDI-3088] Use Spark 3.2 as default Spark version #4752

XuQianJin-Stars · 2022-02-06T11:17:20Z

Tips

Thank you very much for contributing to Apache Hudi.
Please review https://hudi.apache.org/contribute/how-to-contribute before opening a pull request.

What is the purpose of the pull request

(For example: This pull request adds quick-start document.)

Brief change log

(for example:)

Modify AnnotationLocation checkstyle rule in checkstyle.xml

Verify this pull request

(Please pick either of the following options)

This pull request is a trivial rework / code cleanup without any test coverage.

(or)

This pull request is already covered by existing tests, such as (please describe tests).

(or)

This change added tests and can be verified as follows:

(example:)

Added integration tests for end-to-end.
Added HoodieClientWriteTest to verify the change.
Manually verified the change by running a job locally.

Committer checklist

Has a corresponding JIRA in PR title & commit
Commit message is descriptive of the change
CI is green
Necessary doc changes done or have another open PR
For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.

pom.xml

hudi-utilities/src/test/java/org/apache/hudi/utilities/sources/helpers/TestKafkaOffsetGen.java

xushiyan · 2022-02-08T16:42:52Z

@hudi-bot run azure

xushiyan · 2022-02-09T08:40:59Z

hudi-utilities/src/test/java/org/apache/hudi/utilities/testutils/UtilitiesTestBase.java

      return records.stream().map(Helpers::toJsonString).toArray(String[]::new);
    }

+    public static Tuple2<String, String>[] jsonifyRecordsByPartitions(List<HoodieRecord> records, int partitions) {


java class org.apache.hudi.common.util.collection.Pair is preferred here over Tuple2

xushiyan · 2022-02-09T08:42:31Z

hudi-utilities/src/test/java/org/apache/hudi/utilities/testutils/UtilitiesTestBase.java

+    public static Tuple2<String, String>[] jsonifyRecordsByPartitions(List<HoodieRecord> records, int partitions) {
+      Tuple2<String, String>[] data = new Tuple2[records.size()];
+      for (int i = 0; i < records.size(); i++) {
+        int key = i % partitions;


shouldn't it be better with org.apache.hudi.common.model.HoodieKey#recordKey from each HoodieRecord ?

xushiyan · 2022-02-10T15:48:37Z

the first job passing..nice progress. there was a recent fix that might help pass some tests in spark 3.2. d971974

YannByron · 2022-02-16T13:43:16Z

hudi-common/src/main/java/org/apache/hudi/avro/HoodieAvroUtils.java

      // So when returnNullIfNotFound is true, catch this exception.
      if (!returnNullIfNotFound) {
-        throw e;
+        LOG.warn("Failed to get nested field Value ", e);


why change this?

That should not be changed.

huberylee · 2022-03-01T05:57:06Z

@hudi-bot run azure

xushiyan · 2022-03-07T02:21:49Z

azure-pipelines.yml

  MAVEN_CACHE_FOLDER: $(Pipeline.Workspace)/.m2/repository
  MAVEN_OPTS: '-Dmaven.repo.local=$(MAVEN_CACHE_FOLDER) -Dcheckstyle.skip=true -Drat.skip=true -Djacoco.skip=true'
-  SPARK_VERSION: '2.4.4'
+  SPARK_VERSION: '3.2.0'


this needs to upgrade to 3.2.1 to align with master version

xushiyan · 2022-03-07T02:22:34Z

hudi-common/pom.xml

    <dependency>
      <groupId>org.apache.avro</groupId>
      <artifactId>avro</artifactId>
+      <version>${avro.version}</version>


do we need this? thought it inherits the same version from the root pom

xushiyan · 2022-03-07T02:23:01Z

hudi-common/src/main/java/org/apache/hudi/avro/HoodieAvroUtils.java

 * Helper class to do common stuff across Avro.
 */
 public class HoodieAvroUtils {
+  private static final Logger LOG = LogManager.getLogger(HoodieAvroUtils.class);


LOG is not used. can we revert?

xushiyan · 2022-03-07T02:27:54Z

...-common/src/main/java/org/apache/hudi/common/model/debezium/AbstractDebeziumAvroPayload.java

-      Object value = record.get(DebeziumConstants.FLATTENED_OP_COL_NAME);
+      Object value;
+      if (record.getSchema().getField(DebeziumConstants.FLATTENED_OP_COL_NAME) == null) {
+        value = null;
+      } else {
+        value = record.get(DebeziumConstants.FLATTENED_OP_COL_NAME);
+      }


the new check does not look clean; it looks like whenever we try to get a column's value, we need to check its nullability? are we able to retrieve the value the without this kind of check

xushiyan · 2022-03-07T02:30:53Z

...rc/test/java/org/apache/hudi/common/model/TestOverwriteNonDefaultsWithLatestAvroPayload.java

-            new Schema.Field("id", Schema.createUnion(Schema.create(Schema.Type.STRING), Schema.create(Schema.Type.NULL)), "", JsonProperties.NULL_VALUE),
-            new Schema.Field("name", Schema.createUnion(Schema.create(Schema.Type.STRING), Schema.create(Schema.Type.NULL)), "", JsonProperties.NULL_VALUE),
-            new Schema.Field("age", Schema.createUnion(Schema.create(Schema.Type.STRING), Schema.create(Schema.Type.NULL)), "", JsonProperties.NULL_VALUE),
-            new Schema.Field("job", Schema.createUnion(Schema.create(Schema.Type.STRING), Schema.create(Schema.Type.NULL)), "", JsonProperties.NULL_VALUE)
+            new Schema.Field("id", Schema.createUnion(Schema.create(Schema.Type.STRING), Schema.create(Schema.Type.NULL)), "", null),
+            new Schema.Field("name", Schema.createUnion(Schema.create(Schema.Type.STRING), Schema.create(Schema.Type.NULL)), "", null),
+            new Schema.Field("age", Schema.createUnion(Schema.create(Schema.Type.STRING), Schema.create(Schema.Type.NULL)), "", null),
+            new Schema.Field("job", Schema.createUnion(Schema.create(Schema.Type.STRING), Schema.create(Schema.Type.NULL)), "", null)


we should prefer to use org.apache.avro.Schema.Field#NULL_DEFAULT_VALUE whenever we want to set default as null

xushiyan · 2022-03-07T02:36:52Z

hudi-spark-datasource/hudi-spark/src/test/java/org/apache/hudi/functional/TestOrcBootstrap.java

+  @Disabled
  @Test
  public void testMetadataBootstrapNonpartitionedCOW() throws Exception {
    testBootstrapCommon(false, false, EffectiveMode.METADATA_BOOTSTRAP_MODE);
  }


you can disable for the whole class instead of annotating each method, and add a note in the annotation like @Disabled("<describe the reason>")

xushiyan · 2022-03-07T02:39:24Z

...rk-datasource/hudi-spark3/src/test/java/org/apache/hudi/spark3/internal/TestReflectUtil.java

-        scala.collection.immutable.List.empty(),
+        JavaConverters.collectionAsScalaIterableConverter(new ArrayList<String>()).asScala().toSeq(),


so you need a seq why not scala.collection.immutable.List.empty().toSeq() ?

xushiyan · 2022-03-07T02:40:21Z

hudi-utilities/pom.xml

+    <dependency>
+      <groupId>org.apache.avro</groupId>
+      <artifactId>avro</artifactId>
+      <version>1.8.2</version>
+      <scope>provided</scope>
+    </dependency>


concern on the impact here: why we set an old version just for utilities?

xushiyan · 2022-03-07T02:41:07Z

hudi-utilities/src/test/java/org/apache/hudi/utilities/sources/TestHoodieIncrSource.java

    cleanupResources();
  }

+  @Disabled


should be helpful to set a reason

xushiyan · 2022-03-07T02:42:01Z

packaging/hudi-presto-bundle/pom.xml

    <dependency>
      <groupId>org.apache.avro</groupId>
      <artifactId>avro</artifactId>
+      <version>${avro.version}</version>


same avro version question

hudi-bot · 2022-03-16T08:47:09Z

CI report:

e04440e UNKNOWN
5562250 UNKNOWN
935d2ad UNKNOWN
6586874 UNKNOWN
8ff60fa UNKNOWN
e1ad8d9 UNKNOWN
7b9e92c UNKNOWN
2b87e93 Azure: FAILURE

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run azure re-run the last Azure build

yihua · 2022-04-27T02:15:55Z

Closing this one, in favor of #5402 which is going to have all changes for Hadoop/Hive/Spark 3.x upgrades.

XuQianJin-Stars changed the title ~~[HUDI-3088] Use Spark 3.2 as default Spark version~~ [WIP][HUDI-3088] Use Spark 3.2 as default Spark version Feb 6, 2022

nsivabalan added the priority:critical Production degraded; pipelines stalled label Feb 8, 2022

nsivabalan assigned xushiyan Feb 8, 2022

nsivabalan mentioned this pull request Feb 8, 2022

[HUDI-3088] Use Spark 3.2 as default Spark version #4710

Closed

xushiyan reviewed Feb 8, 2022

View reviewed changes

pom.xml Outdated Show resolved Hide resolved

hudi-utilities/src/test/java/org/apache/hudi/utilities/sources/helpers/TestKafkaOffsetGen.java Outdated Show resolved Hide resolved

nsivabalan added priority:blocker Production down; release blocker and removed priority:critical Production degraded; pipelines stalled labels Feb 8, 2022

apache deleted a comment from hudi-bot Feb 8, 2022

xushiyan force-pushed the HUDI-3088 branch from 29db570 to aec0210 Compare February 9, 2022 06:27

xushiyan reviewed Feb 9, 2022

View reviewed changes

XuQianJin-Stars force-pushed the HUDI-3088 branch from e25be08 to e7933bc Compare February 10, 2022 04:09

XuQianJin-Stars force-pushed the HUDI-3088 branch from 6677a97 to 097c4d8 Compare February 11, 2022 02:12

XuQianJin-Stars requested a review from xushiyan February 14, 2022 02:52

YannByron reviewed Feb 16, 2022

View reviewed changes

YannByron mentioned this pull request Feb 16, 2022

[HUDI-3423] upgrade spark to 3.2.1 #4815

Merged

5 tasks

XuQianJin-Stars force-pushed the HUDI-3088 branch from 631437f to a2f0214 Compare February 23, 2022 02:19

apache deleted a comment from hudi-bot Feb 24, 2022

apache deleted a comment from hudi-bot Mar 6, 2022

XuQianJin-Stars force-pushed the HUDI-3088 branch from a1cb5d3 to e0b8ffa Compare March 7, 2022 01:28

xushiyan reviewed Mar 7, 2022

View reviewed changes

XuQianJin-Stars force-pushed the HUDI-3088 branch from e0b8ffa to f89956b Compare March 8, 2022 23:13

XuQianJin-Stars mentioned this pull request Mar 9, 2022

[WIP][HUDI-3218] Upgrading avro to 1.10.2 #4566

Closed

5 tasks

XuQianJin-Stars force-pushed the HUDI-3088 branch 3 times, most recently from 7b9e92c to 76a681e Compare March 10, 2022 02:19

[HUDI-3088] Use Spark 3.2 as default Spark version

2b87e93

XuQianJin-Stars force-pushed the HUDI-3088 branch from 95832e1 to 2b87e93 Compare March 16, 2022 06:20

yihua added priority:critical Production degraded; pipelines stalled and removed priority:blocker Production down; release blocker labels Mar 31, 2022

yihua closed this Apr 27, 2022

rahil-c mentioned this pull request Jun 16, 2022

[HUDI-2955] Support Hadoop 3.x Hive 3.x and Spark 3.2.x default (rebase) #5786

Closed

5 tasks

rahil-c mentioned this pull request Jul 21, 2022

[HUDI-4429] Make Spark 3.1 the default profile #6151

Closed

5 tasks

		scala.collection.immutable.List.empty(),
		JavaConverters.collectionAsScalaIterableConverter(new ArrayList<String>()).asScala().toSeq(),

[WIP][HUDI-3088] Use Spark 3.2 as default Spark version #4752

[WIP][HUDI-3088] Use Spark 3.2 as default Spark version #4752

Uh oh!

Conversation

XuQianJin-Stars commented Feb 6, 2022

Tips

What is the purpose of the pull request

Brief change log

Verify this pull request

Committer checklist

Uh oh!

Uh oh!

Uh oh!

xushiyan commented Feb 8, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

xushiyan commented Feb 10, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

huberylee commented Mar 1, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hudi-bot commented Mar 16, 2022

CI report:

Uh oh!

yihua commented Apr 27, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

xushiyan commented Feb 10, 2022 •

edited

Loading