Skip to content

Conversation

@rahil-c
Copy link
Collaborator

@rahil-c rahil-c commented Jul 20, 2022

Tips

What is the purpose of the pull request

(For example: This pull request adds quick-start document.)

Brief change log

(for example:)

  • Modify AnnotationLocation checkstyle rule in checkstyle.xml

Verify this pull request

(Please pick either of the following options)

This pull request is a trivial rework / code cleanup without any test coverage.

(or)

This pull request is already covered by existing tests, such as (please describe tests).

(or)

This change added tests and can be verified as follows:

(example:)

  • Added integration tests for end-to-end.
  • Added HoodieClientWriteTest to verify the change.
  • Manually verified the change by running a job locally.

Committer checklist

  • Has a corresponding JIRA in PR title & commit

  • Commit message is descriptive of the change

  • CI is green

  • Necessary doc changes done or have another open PR

  • For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.

@@ -0,0 +1,19 @@
docker build base -t apachehudi/hudi-hadoop_2.8.4-base
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

have to upload final docker images to apachehudi dockerhub

@rahil-c rahil-c changed the title Rahil c/spark3.1 profile clone [HUDI-4429] Make Spark3.1 the default profile clone Jul 20, 2022
@rahil-c rahil-c changed the title [HUDI-4429] Make Spark3.1 the default profile clone [HUDI-4429] Make Spark3.1 the default profile Jul 20, 2022
@rahil-c
Copy link
Collaborator Author

rahil-c commented Jul 20, 2022

@yihua yihua self-assigned this Jul 20, 2022
@yihua yihua added dependencies Dependency updates priority:blocker Production down; release blocker engine:spark Spark integration labels Jul 20, 2022
@apache apache deleted a comment from hudi-bot Jul 21, 2022
- job: UT_FT_1
displayName: UT FT common & flink & UT client/spark-client
timeoutInMinutes: '120'
timeoutInMinutes: '150'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you revert the unnecessary timeout change?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think in general ive been seeing the azure ci go over the timeout at 120 min (outside of this pr), I can revert these changes but would it be safer to keep it? Or is this more of a concern of resource usuage for the azure ci in general?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see the successful CI runs finished within 2 hours so there is no need to increase the timeout. We can always retry failed jobs.

Comment on lines 97 to 98
continueOnError: true
retryCountOnTaskFailure: 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove this all similar changes?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still think that having the continueOnError and retryCount is useful, otherwise in general people still have to keep triggering azure ci to see the next set of failures, or if theres some azure agent connection issue then have to rerun which also queues up the many builds.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Understood. What you state only applies to your PR which affects most tests. For other PRs, it's good to fail early on legitimate test errors so that the CI resources can be used to run for other PRs.

- job: UT_FT_2
displayName: FT client/spark-client
timeoutInMinutes: '120'
timeoutInMinutes: '150'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

similar here and below.

options: $(MVN_OPTS_INSTALL) -Pintegration-tests
publishJUnitResults: false
jdkVersionOption: '1.8'
- task: Maven@3
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of deleting this, could you add a property to disable this task? cc @xushiyan for help.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe if add a condition and have it set to false then it should disable this section https://docs.microsoft.com/en-us/azure/devops/pipelines/process/tasks?view=azure-devops&tabs=yaml will give it a try.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

then let's comment these lines out without deleting them to remind reenabling them again.

- ${HUDI_WS}:/var/hoodie/ws

adhoc-2:
image: rchertara/hudi-hadoop_2.8.4-hive_2.3.3-sparkadhoc_3.1.3:image
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if the images are finalized, let's upload the images to apachehudi docker account and change the reference here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for the apachehudi account do i need a special acccess to load images? Or is there a simple way to transfer the images from my account to apache hudi account

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have the permission and I'll upload the images myself.

* NOTE: This class is invariant of the underlying file-format of the files being read
*/
public class HoodieCopyOnWriteTableInputFormat extends HoodieTableInputFormat {
private static final Logger LOG = LogManager.getLogger(HoodieCopyOnWriteTableInputFormat.class);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this still needed?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this actually got merged now. #6161, so if i rebase it will basically be the same and is needed in general.

scala=$scala
fi
echo "spark-submit --packages org.apache.spark:spark-avro_${scala}:2.4.4 \
echo "spark-submit --packages org.apache.spark:spark-avro_${scala}:3.1.3 \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

--packages org.apache.spark:spark-avro_${scala}:3.1.3 \ is no longer needed. We should delete that.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will remove

Comment on lines +61 to +66
<exclusions>
<exclusion>
<groupId>org.apache.orc</groupId>
<artifactId>orc-core</artifactId>
</exclusion>
</exclusions>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this break ORC support in Spark and Hudi?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Im not sure if this is more of test issue or production issue. For example ive seen orc related tests fail for this dependency conflict

Java.lang.NoSuchMethodError: org.apache.orc.TypeDescription.createRowBatch(I)Lorg/apache/hadoop/hive/ql/exec/vector/VectorizedRowBatch;

My understanding is that it seems to do with the hive 2 orc https://github.com/apache/hive/blob/rel/release-2.3.1/pom.xml and the spark 3 orc https://github.com/apache/spark/blob/v3.1.3/pom.xml#L139 being different versions that dont work well together.

In the original spark 3.2 pr #4752 the same orc issues were present and we made a call then to disable the orc related tests.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @xushiyan are you familiar with the specifics of this conflict?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, could you manually verify if ORC format still works with Spark bundle?

Comment on lines +248 to +257
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-hive_${scala.binary.version}</artifactId>
<exclusions>
<exclusion>
<groupId>*</groupId>
<artifactId>*</artifactId>
</exclusion>
</exclusions>
</dependency>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No point adding this since all artifacts are excluded?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can try removing again and testing without it but i think for some reason this helped resolve some test failures in this module.

Copy link
Collaborator Author

@rahil-c rahil-c Jul 22, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removing this dependency seems to cause failures

[ERROR] testBuildHiveSyncConfig{boolean}[1]  Time elapsed: 0.017 s  <<< ERROR!
java.lang.NoClassDefFoundError: org/apache/spark/sql/hive/HiveExternalCatalog
	at org.apache.hudi.DataSourceUtils.buildHiveSyncConfig(DataSourceUtils.java:322)
	at org.apache.hudi.TestDataSourceUtils.testBuildHiveSyncConfig(TestDataSourceUtils.java:261)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)

For now opting to keep this dependency.

Comment on lines +98 to +102
<include>com.fasterxml.jackson.core:jackson-annotations</include>
<include>com.fasterxml.jackson.core:jackson-core</include>
<include>com.fasterxml.jackson.core:jackson-databind</include>
<include>com.fasterxml.jackson.dataformat:jackson-dataformat-yaml</include>
<include>com.fasterxml.jackson.module:jackson-module-scala_${scala.binary.version}</include>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wondering why we add this?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When running the IT tests with the spark3 was running into this dependency conflict below

Exception in thread "main" java.lang.NoSuchMethodError: com.fasterxml.jackson.databind.JsonMappingException.<init>(Ljava/io/Closeable;Ljava/lang/String;)V
	at com.fasterxml.jackson.module.scala.JacksonModule.setupModule(JacksonModule.scala:61)
	at com.fasterxml.jackson.module.scala.JacksonModule.setupModule$(JacksonModule.scala:46)
	at com.fasterxml.jackson.module.scala.DefaultScalaModule.setupModule(DefaultScalaModule.scala:17)
	at com.fasterxml.jackson.databind.ObjectMapper.registerModule(ObjectMapper.java:718)
	at org.apache.spark.rdd.RDDOperationScope$.<init>(RDDOperationScope.scala:82)
	at org.apache.spark.rdd.RDDOperationScope$.<clinit>(RDDOperationScope.scala)
	at org.apache.spark.SparkContext.withScope(SparkContext.scala:792)
	at org.apache.spark.SparkContext.parallelize(SparkContext.scala:809)
	at org.apache.spark.api.java.JavaSparkContext.parallelize(JavaSparkContext.scala:136)
	at HoodieJavaApp.run(HoodieJavaApp.java:141)
	at HoodieJavaApp.main(HoodieJavaApp.java:111)

so in hudi-spark pom we define the following depedency which should ideally provide the class and not result in this class not found

   <dependency>
      <groupId>com.fasterxml.jackson.module</groupId>
      <artifactId>jackson-module-scala_${scala.binary.version}</artifactId>
      <version>${fasterxml.jackson.module.scala.version}</version>
    </dependency>

this jackson module scala contains several jackson dependencies like jackson-data bind etc.
From the mvn logs however it seems it was not getting included in the bundle in several areas and was being excluded. So in order to get past this conflict added it in the bundle.

[INFO] Excluding com.fasterxml.jackson.module:jackson-module-scala_2.12:jar:2.10.0 from the shaded jar.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this for fixing testing only? We should avoid introducing new changes for production code and bundling. If really necessary, could you add these to test scope only or integ-test-bundle?

@yihua yihua changed the title [HUDI-4429] Make Spark3.1 the default profile [HUDI-4429] Make Spark 3.1 the default profile Jul 22, 2022
@rahil-c rahil-c force-pushed the rahil-c/spark3.1-profile-clone branch from e90c823 to c25f0b8 Compare July 22, 2022 22:01
@rahil-c
Copy link
Collaborator Author

rahil-c commented Jul 22, 2022

not sure why java ci is complaining about the logger since this got merged https://github.com/apache/hudi/pull/6161/checks

@rahil-c rahil-c force-pushed the rahil-c/spark3.1-profile-clone branch from a4c2f0b to 0d73313 Compare July 23, 2022 01:09
Comment on lines +335 to +340
<exclusions>
<exclusion>
<groupId>*</groupId>
<artifactId>*</artifactId>
</exclusion>
</exclusions>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Check whether this affects Spark bundle.

return AvroOrcUtils.createAvroSchemaWithDefaultValue(orcSchema, "test_orc_record", null, true);
}

@Disabled("Disable due to hive's orc conflict.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we re-enable these tests in Spark 2.4 in Github CI?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can add a @Tag like Spark2_4only for this class.

Comment on lines +230 to +233
<exclusion>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-client</artifactId>
</exclusion>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you clarify if this is needed? Any implication on Spark bundle (e.g., missing Hadoop-related classes)? Is this for test only?

import static org.junit.jupiter.api.Assertions.assertFalse;
import static org.junit.jupiter.api.Assertions.assertTrue;

@Disabled
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is due to multiple Spark context exception, we should use SparkClientFunctionalTestHarness to rewrite this test and avoid initializing the spark context again, to fix the tests.

Comment on lines +98 to +102
<include>com.fasterxml.jackson.core:jackson-annotations</include>
<include>com.fasterxml.jackson.core:jackson-core</include>
<include>com.fasterxml.jackson.core:jackson-databind</include>
<include>com.fasterxml.jackson.dataformat:jackson-dataformat-yaml</include>
<include>com.fasterxml.jackson.module:jackson-module-scala_${scala.binary.version}</include>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this for fixing testing only? We should avoid introducing new changes for production code and bundling. If really necessary, could you add these to test scope only or integ-test-bundle?

@xushiyan xushiyan added priority:critical Production degraded; pipelines stalled and removed priority:blocker Production down; release blocker labels Jul 23, 2022
@hudi-bot
Copy link
Collaborator

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

@yihua
Copy link
Contributor

yihua commented Jul 23, 2022

As we discussed, there is a risk of landing this if there are any changes on the bundles at this point. Before landing the PR:
(1) we should try to avoid any dependency change for production code and bundling. Adjusting dependency for tests is ok and should be limited to tests only. We shouldn't change the compile pom for merely fixing tests.
(2) for any disabled tests in Azure CI, try to find a way to run them in Github CI to maintain the coverage.
(3) make sure root pom changes for switching profiles do not change any behavior for building all bundles.

@xushiyan
Copy link
Member

@rahil-c close this? we are going to use spark 3.2 or 3.3 as default?

@bvaradar
Copy link
Contributor

@yihua : should this PR be closed in light of #6117 ?

@yihua
Copy link
Contributor

yihua commented Mar 30, 2023

@yihua : should this PR be closed in light of #6117 ?

There were blockers to make Spark 3.2 as the default profile, while making Spark 3.1 as the default profile was more tangible. @rahil-c is it still true? If so, we can close this.

@xushiyan
Copy link
Member

the last status of this work is done in #7327

i'll close this one in favor of that

@xushiyan xushiyan closed this Mar 30, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

big-needle-movers dependencies Dependency updates engine:spark Spark integration priority:critical Production degraded; pipelines stalled

Projects

Status: 🏁 Triaged
Status: ✅ Done

Development

Successfully merging this pull request may close these issues.

5 participants