[HUDI-4429] Make Spark 3.1 the default profile #6151

rahil-c · 2022-07-20T16:21:14Z

Tips

Thank you very much for contributing to Apache Hudi.
Please review https://hudi.apache.org/contribute/how-to-contribute before opening a pull request.

What is the purpose of the pull request

(For example: This pull request adds quick-start document.)

Brief change log

(for example:)

Modify AnnotationLocation checkstyle rule in checkstyle.xml

Verify this pull request

(Please pick either of the following options)

This pull request is a trivial rework / code cleanup without any test coverage.

(or)

This pull request is already covered by existing tests, such as (please describe tests).

(or)

This change added tests and can be verified as follows:

(example:)

Added integration tests for end-to-end.
Added HoodieClientWriteTest to verify the change.
Manually verified the change by running a job locally.

Committer checklist

Has a corresponding JIRA in PR title & commit
Commit message is descriptive of the change
CI is green
Necessary doc changes done or have another open PR
For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.

rahil-c · 2022-07-20T17:25:49Z

docker/hoodie/hadoop/build_docker_images.sh

@@ -0,0 +1,19 @@
+docker build base -t apachehudi/hudi-hadoop_2.8.4-base


have to upload final docker images to apachehudi dockerhub

rahil-c · 2022-07-20T21:37:46Z

Refer to this green azure ci run: https://dev.azure.com/apache-hudi-ci-org/apache-hudi-ci/_build/results?buildId=10078&view=logs&j=47cf0f2a-901e-5ca1-f652-e53b6abbf660&t=35a68570-76b2-5f68-d601-1bf50f7fbd97

all sections passed

yihua · 2022-07-21T18:59:25Z

azure-pipelines.yml

      - job: UT_FT_1
        displayName: UT FT common & flink & UT client/spark-client
-        timeoutInMinutes: '120'
+        timeoutInMinutes: '150'


Could you revert the unnecessary timeout change?

I think in general ive been seeing the azure ci go over the timeout at 120 min (outside of this pr), I can revert these changes but would it be safer to keep it? Or is this more of a concern of resource usuage for the azure ci in general?

I see the successful CI runs finished within 2 hours so there is no need to increase the timeout. We can always retry failed jobs.

yihua · 2022-07-21T19:00:15Z

azure-pipelines.yml

+            continueOnError: true
+            retryCountOnTaskFailure: 1


Remove this all similar changes?

I still think that having the continueOnError and retryCount is useful, otherwise in general people still have to keep triggering azure ci to see the next set of failures, or if theres some azure agent connection issue then have to rerun which also queues up the many builds.

Understood. What you state only applies to your PR which affects most tests. For other PRs, it's good to fail early on legitimate test errors so that the CI resources can be used to run for other PRs.

yihua · 2022-07-21T19:00:36Z

azure-pipelines.yml

      - job: UT_FT_2
        displayName: FT client/spark-client
-        timeoutInMinutes: '120'
+        timeoutInMinutes: '150'


similar here and below.

yihua · 2022-07-21T19:10:13Z

azure-pipelines.yml

              options: $(MVN_OPTS_INSTALL) -Pintegration-tests
              publishJUnitResults: false
              jdkVersionOption: '1.8'
-          - task: Maven@3


Instead of deleting this, could you add a property to disable this task? cc @xushiyan for help.

I believe if add a condition and have it set to false then it should disable this section https://docs.microsoft.com/en-us/azure/devops/pipelines/process/tasks?view=azure-devops&tabs=yaml will give it a try.

then let's comment these lines out without deleting them to remind reenabling them again.

yihua · 2022-07-21T19:24:23Z

docker/compose/docker-compose_hadoop284_hive233_spark313.yml

+      - ${HUDI_WS}:/var/hoodie/ws
+
+  adhoc-2:
+    image: rchertara/hudi-hadoop_2.8.4-hive_2.3.3-sparkadhoc_3.1.3:image


if the images are finalized, let's upload the images to apachehudi docker account and change the reference here.

for the apachehudi account do i need a special acccess to load images? Or is there a simple way to transfer the images from my account to apache hudi account

I have the permission and I'll upload the images myself.

yihua · 2022-07-21T20:05:05Z

hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/HoodieCopyOnWriteTableInputFormat.java

 * NOTE: This class is invariant of the underlying file-format of the files being read
 */
 public class HoodieCopyOnWriteTableInputFormat extends HoodieTableInputFormat {
+  private static final Logger LOG = LogManager.getLogger(HoodieCopyOnWriteTableInputFormat.class);


Is this still needed?

this actually got merged now. #6161, so if i rebase it will basically be the same and is needed in general.

yihua · 2022-07-21T20:05:42Z

hudi-integ-test/prepare_integration_suite.sh

    scala=$scala
  fi
-  echo "spark-submit --packages org.apache.spark:spark-avro_${scala}:2.4.4 \
+  echo "spark-submit --packages org.apache.spark:spark-avro_${scala}:3.1.3 \


--packages org.apache.spark:spark-avro_${scala}:3.1.3 \ is no longer needed. We should delete that.

will remove

yihua · 2022-07-21T20:06:44Z

hudi-client/hudi-spark-client/pom.xml

+      <exclusions>
+        <exclusion>
+          <groupId>org.apache.orc</groupId>
+          <artifactId>orc-core</artifactId>
+        </exclusion>
+      </exclusions>


Will this break ORC support in Spark and Hudi?

Im not sure if this is more of test issue or production issue. For example ive seen orc related tests fail for this dependency conflict

Java.lang.NoSuchMethodError: org.apache.orc.TypeDescription.createRowBatch(I)Lorg/apache/hadoop/hive/ql/exec/vector/VectorizedRowBatch;

My understanding is that it seems to do with the hive 2 orc https://github.com/apache/hive/blob/rel/release-2.3.1/pom.xml and the spark 3 orc https://github.com/apache/spark/blob/v3.1.3/pom.xml#L139 being different versions that dont work well together.

In the original spark 3.2 pr #4752 the same orc issues were present and we made a call then to disable the orc related tests.

cc @xushiyan are you familiar with the specifics of this conflict?

Got it, could you manually verify if ORC format still works with Spark bundle?

yihua · 2022-07-21T20:08:45Z

hudi-utilities/pom.xml

+    <dependency>
+      <groupId>org.apache.spark</groupId>
+      <artifactId>spark-hive_${scala.binary.version}</artifactId>
+      <exclusions>
+        <exclusion>
+          <groupId>*</groupId>
+          <artifactId>*</artifactId>
+        </exclusion>
+      </exclusions>
+    </dependency>


No point adding this since all artifacts are excluded?

I can try removing again and testing without it but i think for some reason this helped resolve some test failures in this module.

Removing this dependency seems to cause failures

[ERROR] testBuildHiveSyncConfig{boolean}[1] Time elapsed: 0.017 s <<< ERROR! java.lang.NoClassDefFoundError: org/apache/spark/sql/hive/HiveExternalCatalog at org.apache.hudi.DataSourceUtils.buildHiveSyncConfig(DataSourceUtils.java:322) at org.apache.hudi.TestDataSourceUtils.testBuildHiveSyncConfig(TestDataSourceUtils.java:261) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498)

For now opting to keep this dependency.

yihua · 2022-07-21T20:10:08Z

packaging/hudi-spark-bundle/pom.xml

+                  <include>com.fasterxml.jackson.core:jackson-annotations</include>
+                  <include>com.fasterxml.jackson.core:jackson-core</include>
+                  <include>com.fasterxml.jackson.core:jackson-databind</include>
+                  <include>com.fasterxml.jackson.dataformat:jackson-dataformat-yaml</include>
+                  <include>com.fasterxml.jackson.module:jackson-module-scala_${scala.binary.version}</include>


wondering why we add this?

When running the IT tests with the spark3 was running into this dependency conflict below

Exception in thread "main" java.lang.NoSuchMethodError: com.fasterxml.jackson.databind.JsonMappingException.<init>(Ljava/io/Closeable;Ljava/lang/String;)V at com.fasterxml.jackson.module.scala.JacksonModule.setupModule(JacksonModule.scala:61) at com.fasterxml.jackson.module.scala.JacksonModule.setupModule$(JacksonModule.scala:46) at com.fasterxml.jackson.module.scala.DefaultScalaModule.setupModule(DefaultScalaModule.scala:17) at com.fasterxml.jackson.databind.ObjectMapper.registerModule(ObjectMapper.java:718) at org.apache.spark.rdd.RDDOperationScope$.<init>(RDDOperationScope.scala:82) at org.apache.spark.rdd.RDDOperationScope$.<clinit>(RDDOperationScope.scala) at org.apache.spark.SparkContext.withScope(SparkContext.scala:792) at org.apache.spark.SparkContext.parallelize(SparkContext.scala:809) at org.apache.spark.api.java.JavaSparkContext.parallelize(JavaSparkContext.scala:136) at HoodieJavaApp.run(HoodieJavaApp.java:141) at HoodieJavaApp.main(HoodieJavaApp.java:111)

so in hudi-spark pom we define the following depedency which should ideally provide the class and not result in this class not found

<dependency> <groupId>com.fasterxml.jackson.module</groupId> <artifactId>jackson-module-scala_${scala.binary.version}</artifactId> <version>${fasterxml.jackson.module.scala.version}</version> </dependency>

this jackson module scala contains several jackson dependencies like jackson-data bind etc.
From the mvn logs however it seems it was not getting included in the bundle in several areas and was being excluded. So in order to get past this conflict added it in the bundle.

[INFO] Excluding com.fasterxml.jackson.module:jackson-module-scala_2.12:jar:2.10.0 from the shaded jar.

Is this for fixing testing only? We should avoid introducing new changes for production code and bundling. If really necessary, could you add these to test scope only or integ-test-bundle?

rahil-c · 2022-07-22T23:01:48Z

not sure why java ci is complaining about the logger since this got merged https://github.com/apache/hudi/pull/6161/checks

…ve orc test cases

yihua · 2022-07-23T02:16:14Z

hudi-spark-datasource/hudi-spark/pom.xml

+      <exclusions>
+        <exclusion>
+          <groupId>*</groupId>
+          <artifactId>*</artifactId>
+        </exclusion>
+      </exclusions>


Check whether this affects Spark bundle.

yihua · 2022-07-23T02:19:13Z

hudi-spark-datasource/hudi-spark/src/test/java/org/apache/hudi/functional/TestOrcBootstrap.java

    return AvroOrcUtils.createAvroSchemaWithDefaultValue(orcSchema, "test_orc_record", null, true);
  }

+  @Disabled("Disable due to hive's orc conflict.")


Could we re-enable these tests in Spark 2.4 in Github CI?

Maybe we can add a @Tag like Spark2_4only for this class.

yihua · 2022-07-23T02:28:44Z

hudi-utilities/pom.xml

+        <exclusion>
+          <groupId>org.apache.hadoop</groupId>
+          <artifactId>hadoop-client</artifactId>
+        </exclusion>


Could you clarify if this is needed? Any implication on Spark bundle (e.g., missing Hadoop-related classes)? Is this for test only?

yihua · 2022-07-23T02:35:18Z

hudi-utilities/src/test/java/org/apache/hudi/utilities/TestHoodieIndexer.java

 import static org.junit.jupiter.api.Assertions.assertFalse;
 import static org.junit.jupiter.api.Assertions.assertTrue;

+@Disabled


If this is due to multiple Spark context exception, we should use SparkClientFunctionalTestHarness to rewrite this test and avoid initializing the spark context again, to fix the tests.

yihua · 2022-07-23T02:40:48Z

packaging/hudi-spark-bundle/pom.xml

+                  <include>com.fasterxml.jackson.core:jackson-annotations</include>
+                  <include>com.fasterxml.jackson.core:jackson-core</include>
+                  <include>com.fasterxml.jackson.core:jackson-databind</include>
+                  <include>com.fasterxml.jackson.dataformat:jackson-dataformat-yaml</include>
+                  <include>com.fasterxml.jackson.module:jackson-module-scala_${scala.binary.version}</include>


Is this for fixing testing only? We should avoid introducing new changes for production code and bundling. If really necessary, could you add these to test scope only or integ-test-bundle?

hudi-bot · 2022-07-23T04:26:43Z

CI report:

0d73313 Azure: FAILURE

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run azure re-run the last Azure build

yihua · 2022-07-23T05:42:07Z

As we discussed, there is a risk of landing this if there are any changes on the bundles at this point. Before landing the PR:
(1) we should try to avoid any dependency change for production code and bundling. Adjusting dependency for tests is ok and should be limited to tests only. We shouldn't change the compile pom for merely fixing tests.
(2) for any disabled tests in Azure CI, try to find a way to run them in Github CI to maintain the coverage.
(3) make sure root pom changes for switching profiles do not change any behavior for building all bundles.

xushiyan · 2022-08-20T23:23:03Z

@rahil-c close this? we are going to use spark 3.2 or 3.3 as default?

bvaradar · 2023-03-14T03:10:25Z

@yihua : should this PR be closed in light of #6117 ?

yihua · 2023-03-30T17:08:19Z

@yihua : should this PR be closed in light of #6117 ?

There were blockers to make Spark 3.2 as the default profile, while making Spark 3.1 as the default profile was more tangible. @rahil-c is it still true? If so, we can close this.

xushiyan · 2023-03-30T21:43:42Z

the last status of this work is done in #7327

i'll close this one in favor of that

rahil-c commented Jul 20, 2022

View reviewed changes

rahil-c changed the title ~~Rahil c/spark3.1 profile clone~~ [HUDI-4429] Make Spark3.1 the default profile clone Jul 20, 2022

rahil-c changed the title ~~[HUDI-4429] Make Spark3.1 the default profile clone~~ [HUDI-4429] Make Spark3.1 the default profile Jul 20, 2022

yihua self-assigned this Jul 20, 2022

yihua added dependencies Dependency updates priority:blocker Production down; release blocker engine:spark Spark integration labels Jul 20, 2022

apache deleted a comment from hudi-bot Jul 21, 2022

yihua reviewed Jul 21, 2022

View reviewed changes

yihua changed the title ~~[HUDI-4429] Make Spark3.1 the default profile~~ [HUDI-4429] Make Spark 3.1 the default profile Jul 22, 2022

rahil-c force-pushed the rahil-c/spark3.1-profile-clone branch from e90c823 to c25f0b8 Compare July 22, 2022 22:01

Rahil Chertara added 17 commits July 22, 2022 18:08

make spark 3.1.x default profile

0e62275

fix spark 3.1 profile

f7d4397

Try azure ci run with spark 3.1 profile

da56c2e

disable java ci for now

f78e44a

disable java ci for now 2

57dd0b7

Resolve more test failures

fe49e30

Change Docker setup to use recent new images

d1b2bc2

Disable orc tests

6eeba7a

resolve more test failures

b243410

Increase retry on IT section

22b9627

Add back all profiles

6801ead

Add back TestHoodieSparkQuickstart

8a9f57e

Add back TestHoodieFlinkQuickstart

27b6e72

Add back testDatasourceInsertForTableTypeBaseFileMetaFields, but remo…

a988ee1

…ve orc test cases

fix minor DockerFile comments

594c8aa

remove 2.13 exclude

2528860

add back test testMergeOnReadSnapshotRelationWithDeltaLogsFallback

673db73

Rahil Chertara added 4 commits July 22, 2022 18:08

Disable IT unit tests

0675f67

remove spark avro line from script

9a65e5c

Address comments

e8ea433

Rename docker files back to apachehudi

0d73313

rahil-c force-pushed the rahil-c/spark3.1-profile-clone branch from a4c2f0b to 0d73313 Compare July 23, 2022 01:09

yihua reviewed Jul 23, 2022

View reviewed changes

xushiyan added priority:critical Production degraded; pipelines stalled and removed priority:blocker Production down; release blocker labels Jul 23, 2022

yihua added the big-needle-movers label Sep 14, 2022

xushiyan closed this Mar 30, 2023

		@@ -0,0 +1,19 @@
		docker build base -t apachehudi/hudi-hadoop_2.8.4-base

[HUDI-4429] Make Spark 3.1 the default profile #6151

[HUDI-4429] Make Spark 3.1 the default profile #6151

Uh oh!

Conversation

rahil-c commented Jul 20, 2022

Tips

What is the purpose of the pull request

Brief change log

Verify this pull request

Committer checklist

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rahil-c commented Jul 20, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rahil-c Jul 22, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rahil-c commented Jul 22, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

rahil-c Jul 22, 2022 •

edited

Loading