-
Notifications
You must be signed in to change notification settings - Fork 2.5k
[HUDI-4429] Make Spark 3.1 the default profile #6151
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| @@ -0,0 +1,19 @@ | |||
| docker build base -t apachehudi/hudi-hadoop_2.8.4-base | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
have to upload final docker images to apachehudi dockerhub
|
Refer to this green azure ci run: https://dev.azure.com/apache-hudi-ci-org/apache-hudi-ci/_build/results?buildId=10078&view=logs&j=47cf0f2a-901e-5ca1-f652-e53b6abbf660&t=35a68570-76b2-5f68-d601-1bf50f7fbd97 all sections passed |
azure-pipelines.yml
Outdated
| - job: UT_FT_1 | ||
| displayName: UT FT common & flink & UT client/spark-client | ||
| timeoutInMinutes: '120' | ||
| timeoutInMinutes: '150' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you revert the unnecessary timeout change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think in general ive been seeing the azure ci go over the timeout at 120 min (outside of this pr), I can revert these changes but would it be safer to keep it? Or is this more of a concern of resource usuage for the azure ci in general?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see the successful CI runs finished within 2 hours so there is no need to increase the timeout. We can always retry failed jobs.
azure-pipelines.yml
Outdated
| continueOnError: true | ||
| retryCountOnTaskFailure: 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove this all similar changes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still think that having the continueOnError and retryCount is useful, otherwise in general people still have to keep triggering azure ci to see the next set of failures, or if theres some azure agent connection issue then have to rerun which also queues up the many builds.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Understood. What you state only applies to your PR which affects most tests. For other PRs, it's good to fail early on legitimate test errors so that the CI resources can be used to run for other PRs.
azure-pipelines.yml
Outdated
| - job: UT_FT_2 | ||
| displayName: FT client/spark-client | ||
| timeoutInMinutes: '120' | ||
| timeoutInMinutes: '150' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
similar here and below.
| options: $(MVN_OPTS_INSTALL) -Pintegration-tests | ||
| publishJUnitResults: false | ||
| jdkVersionOption: '1.8' | ||
| - task: Maven@3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of deleting this, could you add a property to disable this task? cc @xushiyan for help.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe if add a condition and have it set to false then it should disable this section https://docs.microsoft.com/en-us/azure/devops/pipelines/process/tasks?view=azure-devops&tabs=yaml will give it a try.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
then let's comment these lines out without deleting them to remind reenabling them again.
| - ${HUDI_WS}:/var/hoodie/ws | ||
|
|
||
| adhoc-2: | ||
| image: rchertara/hudi-hadoop_2.8.4-hive_2.3.3-sparkadhoc_3.1.3:image |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if the images are finalized, let's upload the images to apachehudi docker account and change the reference here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for the apachehudi account do i need a special acccess to load images? Or is there a simple way to transfer the images from my account to apache hudi account
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have the permission and I'll upload the images myself.
| * NOTE: This class is invariant of the underlying file-format of the files being read | ||
| */ | ||
| public class HoodieCopyOnWriteTableInputFormat extends HoodieTableInputFormat { | ||
| private static final Logger LOG = LogManager.getLogger(HoodieCopyOnWriteTableInputFormat.class); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this still needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this actually got merged now. #6161, so if i rebase it will basically be the same and is needed in general.
| scala=$scala | ||
| fi | ||
| echo "spark-submit --packages org.apache.spark:spark-avro_${scala}:2.4.4 \ | ||
| echo "spark-submit --packages org.apache.spark:spark-avro_${scala}:3.1.3 \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
--packages org.apache.spark:spark-avro_${scala}:3.1.3 \ is no longer needed. We should delete that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will remove
| <exclusions> | ||
| <exclusion> | ||
| <groupId>org.apache.orc</groupId> | ||
| <artifactId>orc-core</artifactId> | ||
| </exclusion> | ||
| </exclusions> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will this break ORC support in Spark and Hudi?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Im not sure if this is more of test issue or production issue. For example ive seen orc related tests fail for this dependency conflict
Java.lang.NoSuchMethodError: org.apache.orc.TypeDescription.createRowBatch(I)Lorg/apache/hadoop/hive/ql/exec/vector/VectorizedRowBatch;
My understanding is that it seems to do with the hive 2 orc https://github.com/apache/hive/blob/rel/release-2.3.1/pom.xml and the spark 3 orc https://github.com/apache/spark/blob/v3.1.3/pom.xml#L139 being different versions that dont work well together.
In the original spark 3.2 pr #4752 the same orc issues were present and we made a call then to disable the orc related tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @xushiyan are you familiar with the specifics of this conflict?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it, could you manually verify if ORC format still works with Spark bundle?
| <dependency> | ||
| <groupId>org.apache.spark</groupId> | ||
| <artifactId>spark-hive_${scala.binary.version}</artifactId> | ||
| <exclusions> | ||
| <exclusion> | ||
| <groupId>*</groupId> | ||
| <artifactId>*</artifactId> | ||
| </exclusion> | ||
| </exclusions> | ||
| </dependency> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No point adding this since all artifacts are excluded?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can try removing again and testing without it but i think for some reason this helped resolve some test failures in this module.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removing this dependency seems to cause failures
[ERROR] testBuildHiveSyncConfig{boolean}[1] Time elapsed: 0.017 s <<< ERROR!
java.lang.NoClassDefFoundError: org/apache/spark/sql/hive/HiveExternalCatalog
at org.apache.hudi.DataSourceUtils.buildHiveSyncConfig(DataSourceUtils.java:322)
at org.apache.hudi.TestDataSourceUtils.testBuildHiveSyncConfig(TestDataSourceUtils.java:261)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
For now opting to keep this dependency.
| <include>com.fasterxml.jackson.core:jackson-annotations</include> | ||
| <include>com.fasterxml.jackson.core:jackson-core</include> | ||
| <include>com.fasterxml.jackson.core:jackson-databind</include> | ||
| <include>com.fasterxml.jackson.dataformat:jackson-dataformat-yaml</include> | ||
| <include>com.fasterxml.jackson.module:jackson-module-scala_${scala.binary.version}</include> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wondering why we add this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When running the IT tests with the spark3 was running into this dependency conflict below
Exception in thread "main" java.lang.NoSuchMethodError: com.fasterxml.jackson.databind.JsonMappingException.<init>(Ljava/io/Closeable;Ljava/lang/String;)V
at com.fasterxml.jackson.module.scala.JacksonModule.setupModule(JacksonModule.scala:61)
at com.fasterxml.jackson.module.scala.JacksonModule.setupModule$(JacksonModule.scala:46)
at com.fasterxml.jackson.module.scala.DefaultScalaModule.setupModule(DefaultScalaModule.scala:17)
at com.fasterxml.jackson.databind.ObjectMapper.registerModule(ObjectMapper.java:718)
at org.apache.spark.rdd.RDDOperationScope$.<init>(RDDOperationScope.scala:82)
at org.apache.spark.rdd.RDDOperationScope$.<clinit>(RDDOperationScope.scala)
at org.apache.spark.SparkContext.withScope(SparkContext.scala:792)
at org.apache.spark.SparkContext.parallelize(SparkContext.scala:809)
at org.apache.spark.api.java.JavaSparkContext.parallelize(JavaSparkContext.scala:136)
at HoodieJavaApp.run(HoodieJavaApp.java:141)
at HoodieJavaApp.main(HoodieJavaApp.java:111)
so in hudi-spark pom we define the following depedency which should ideally provide the class and not result in this class not found
<dependency>
<groupId>com.fasterxml.jackson.module</groupId>
<artifactId>jackson-module-scala_${scala.binary.version}</artifactId>
<version>${fasterxml.jackson.module.scala.version}</version>
</dependency>
this jackson module scala contains several jackson dependencies like jackson-data bind etc.
From the mvn logs however it seems it was not getting included in the bundle in several areas and was being excluded. So in order to get past this conflict added it in the bundle.
[INFO] Excluding com.fasterxml.jackson.module:jackson-module-scala_2.12:jar:2.10.0 from the shaded jar.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this for fixing testing only? We should avoid introducing new changes for production code and bundling. If really necessary, could you add these to test scope only or integ-test-bundle?
e90c823 to
c25f0b8
Compare
|
not sure why java ci is complaining about the logger since this got merged https://github.com/apache/hudi/pull/6161/checks |
…ve orc test cases
a4c2f0b to
0d73313
Compare
| <exclusions> | ||
| <exclusion> | ||
| <groupId>*</groupId> | ||
| <artifactId>*</artifactId> | ||
| </exclusion> | ||
| </exclusions> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Check whether this affects Spark bundle.
| return AvroOrcUtils.createAvroSchemaWithDefaultValue(orcSchema, "test_orc_record", null, true); | ||
| } | ||
|
|
||
| @Disabled("Disable due to hive's orc conflict.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we re-enable these tests in Spark 2.4 in Github CI?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we can add a @Tag like Spark2_4only for this class.
| <exclusion> | ||
| <groupId>org.apache.hadoop</groupId> | ||
| <artifactId>hadoop-client</artifactId> | ||
| </exclusion> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you clarify if this is needed? Any implication on Spark bundle (e.g., missing Hadoop-related classes)? Is this for test only?
| import static org.junit.jupiter.api.Assertions.assertFalse; | ||
| import static org.junit.jupiter.api.Assertions.assertTrue; | ||
|
|
||
| @Disabled |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this is due to multiple Spark context exception, we should use SparkClientFunctionalTestHarness to rewrite this test and avoid initializing the spark context again, to fix the tests.
| <include>com.fasterxml.jackson.core:jackson-annotations</include> | ||
| <include>com.fasterxml.jackson.core:jackson-core</include> | ||
| <include>com.fasterxml.jackson.core:jackson-databind</include> | ||
| <include>com.fasterxml.jackson.dataformat:jackson-dataformat-yaml</include> | ||
| <include>com.fasterxml.jackson.module:jackson-module-scala_${scala.binary.version}</include> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this for fixing testing only? We should avoid introducing new changes for production code and bundling. If really necessary, could you add these to test scope only or integ-test-bundle?
|
As we discussed, there is a risk of landing this if there are any changes on the bundles at this point. Before landing the PR: |
|
@rahil-c close this? we are going to use spark 3.2 or 3.3 as default? |
|
the last status of this work is done in #7327 i'll close this one in favor of that |
Tips
What is the purpose of the pull request
(For example: This pull request adds quick-start document.)
Brief change log
(for example:)
Verify this pull request
(Please pick either of the following options)
This pull request is a trivial rework / code cleanup without any test coverage.
(or)
This pull request is already covered by existing tests, such as (please describe tests).
(or)
This change added tests and can be verified as follows:
(example:)
Committer checklist
Has a corresponding JIRA in PR title & commit
Commit message is descriptive of the change
CI is green
Necessary doc changes done or have another open PR
For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.