-
Notifications
You must be signed in to change notification settings - Fork 2.5k
[HUDI-3088] Use Spark 3.2 as default Spark version #8445
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
2557433 to
07e889c
Compare
07e889c to
b13036a
Compare
2b4ad62 to
203caaf
Compare
1f9f158 to
78d54f5
Compare
| <dependency> | ||
| <groupId>org.apache.spark</groupId> | ||
| <artifactId>spark-core_${scala.binary.version}</artifactId> | ||
| <exclusions> | ||
| <exclusion> | ||
| <groupId>org.apache.hadoop</groupId> | ||
| <artifactId>hadoop-client-api</artifactId> | ||
| </exclusion> | ||
| <exclusion> | ||
| <groupId>org.apache.hadoop</groupId> | ||
| <artifactId>hadoop-client-runtime</artifactId> | ||
| </exclusion> | ||
| </exclusions> | ||
| </dependency> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removing these won't affect using cli/spark/utilities bundles as spark will be provided
| <groupId>org.apache.spark</groupId> | ||
| <artifactId>spark-core_${scala.binary.version}</artifactId> | ||
| <exclusions> | ||
| <exclusion> | ||
| <groupId>org.apache.hadoop</groupId> | ||
| <artifactId>hadoop-client-api</artifactId> | ||
| </exclusion> | ||
| <exclusion> | ||
| <groupId>org.apache.hadoop</groupId> | ||
| <artifactId>hadoop-client-runtime</artifactId> | ||
| </exclusion> | ||
| <exclusion> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
| <groupId>org.apache.spark</groupId> | ||
| <artifactId>spark-streaming-kafka-0-10_${scala.binary.version}</artifactId> | ||
| <version>${spark.version}</version> | ||
| <exclusions> | ||
| <exclusion> | ||
| <groupId>org.apache.hadoop</groupId> | ||
| <artifactId>hadoop-client-api</artifactId> | ||
| </exclusion> | ||
| <exclusion> | ||
| <groupId>org.apache.hadoop</groupId> | ||
| <artifactId>hadoop-client-runtime</artifactId> | ||
| </exclusion> | ||
| </exclusions> | ||
| </dependency> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
fe494c5 to
5f5ce7f
Compare
| - script: | | ||
| grep "testcase" */target/surefire-reports/*.xml */*/target/surefire-reports/*.xml | awk -F'"' ' { print $6,$4,$2 } ' | sort -nr | head -n 100 | ||
| displayName: Top 100 long-running testcases | ||
| - job: IT |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IT will still be running with spark2.4 as per the docker demo setup, and moved to GH actions
| @Test | ||
| public void testWriteReadWithEvolvedSchema() throws Exception { | ||
| // Disable the test with evolved schema for HFile since it's not supported | ||
| public void testWriteReadWithEvolvedSchema(String evolvedSchemaPath) throws Exception { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we just remove it for now? there's already a tracking jira.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's disabled with a message. should be a good reminder there.
| // Initialize HbaseMiniCluster | ||
| System.setProperty("zookeeper.preAllocSize", "100"); | ||
| System.setProperty("zookeeper.maxCnxns", "60"); | ||
| System.setProperty("zookeeper.4lw.commands.whitelist", "*"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we need these configs in this PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is caused by spark version upgrade and changed zookeeper's version.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
trying to run in CI without it. locally seems fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"zookeeper.4lw.commands.whitelist", "*" will be required to start the test service properly with new zookeeper version used. added it back
| // set env and directly in order to handle static init/gc issues | ||
| System.setProperty("zookeeper.preAllocSize", "100"); | ||
| System.setProperty("zookeeper.maxCnxns", "60"); | ||
| System.setProperty("zookeeper.4lw.commands.whitelist", "*"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same question as above. if it's a test issue then maybe add it in a separate PR?
hudi-integ-test/pom.xml
Outdated
| <dependency> | ||
| <groupId>org.apache.parquet</groupId> | ||
| <artifactId>parquet-avro</artifactId> | ||
| <version>${parquet.version}</version> | ||
| <scope>test</scope> | ||
| </dependency> | ||
|
|
||
| <dependency> | ||
| <groupId>org.apache.parquet</groupId> | ||
| <artifactId>parquet-hadoop</artifactId> | ||
| <version>${parquet.version}</version> | ||
| <scope>test</scope> | ||
| </dependency> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why are these dependencies needed here? also, if we run integ tests with hudi-spark3.x-bundle, wouldn't they already be present in classpath?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe parquet-hadoop is not required. i'll have to re-run to confirm
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this actually affects Flink's IT. consider this is only test deps, did not dig further
2023-05-24T08:46:49.7944885Z [ERROR] Tests run: 128, Failures: 0, Errors: 4, Skipped: 0, Time elapsed: 1,026.49 s <<< FAILURE! - in org.apache.hudi.table.ITTestHoodieDataSource
2023-05-24T08:46:49.7945946Z [ERROR] testUpdateDelete{String, HoodieTableType}[1] Time elapsed: 2.196 s <<< ERROR!
2023-05-24T08:46:49.7947263Z org.apache.flink.table.api.TableException: Unsupported query: update t1 set age=18 where uuid in('id1', 'id2')
2023-05-24T08:46:49.7947914Z at org.apache.hudi.table.ITTestHoodieDataSource.execInsertSql(ITTestHoodieDataSource.java:2095)
2023-05-24T08:46:49.7948557Z at org.apache.hudi.table.ITTestHoodieDataSource.testUpdateDelete(ITTestHoodieDataSource.java:1971)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this actually affects Flink's IT. consider this is only test deps, did not dig further
scratch this. this is caused by a newly added testcase that only runs with flink 1.17
| @MethodSource("bulkInsertTypeParams") | ||
| public void testDataSourceWriter(boolean populateMetaFields) throws Exception { | ||
| testDataSourceWriterInternal(Collections.EMPTY_MAP, Collections.EMPTY_MAP, populateMetaFields); | ||
| testDataSourceWriterInternal(Collections.emptyMap(), Collections.emptyMap(), populateMetaFields); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just asking, does it make any difference? I believe both are the same thing right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as per the api's javadoc, using the api should be preferred
Using this method is likely to have comparable cost to using the like-named field. (Unlike this method, the field does not provide type safety.)
| int numRecords = 15; | ||
| int numPartitions = 3; | ||
| int numRecords = 30; | ||
| int numPartitions = 2; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why change these values?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this has sth to do with org.apache.spark.streaming.kafka010.KafkaTestUtils which only utilizes 2 partitions (probably limited by spark partitions due to 1:1 mapping). hence doing more than 2 will not give expected even records per partitions. this can be pushed as follow up to improve the test setup.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
filed the improvement ticket https://issues.apache.org/jira/browse/HUDI-6266
| assertEquals(appendList, withKafkaOffsetColumns.subList(withKafkaOffsetColumns.size() - 3, withKafkaOffsetColumns.size())); | ||
|
|
||
| dfNoOffsetInfo.unpersist(); | ||
| dfWithOffsetInfo.unpersist(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
| int numPartitions = 3; | ||
| int numMessages = 15; | ||
| int numPartitions = 2; | ||
| int numMessages = 30; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here, why change it? is it reduce the test time (lesser partitions lesser i/o)?
53b3c28 to
5299e7b
Compare
5299e7b to
70af3bf
Compare
8011027 to
4e93aac
Compare
540705c to
8bc8b9c
Compare
| <plugin> | ||
| <groupId>org.jacoco</groupId> | ||
| <artifactId>jacoco-maven-plugin</artifactId> | ||
| <executions> | ||
| <execution> | ||
| <goals> | ||
| <goal>prepare-agent</goal> | ||
| </goals> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this step from jacoco plugin under integration-tests profile caused mystic class loading issue in flink 1.17. need to be cautious with seemingly innocuous plugin! cc @danny0405
[ERROR] testMergeOnReadInputFormatLogFileOnlyIteratorGetUnMergedLogFileIterator Time elapsed: 0.006 s <<< ERROR!
java.lang.NoClassDefFoundError: Could not initialize class org.apache.calcite.rel.metadata.DefaultRelMetadataProvider
at org.apache.hudi.table.ITTestSchemaEvolution.setUp(ITTestSchemaEvolution.java:81)
Change Logs
spark3.2Impact
Default dev profile change. Local repo setup should be updated accordingly.
Risk level
Medium.
Use bundle validate to verify artifacts.
Documentation Update
Contributor's checklist