Skip to content

Conversation

@xushiyan
Copy link
Member

Change Logs

  • Update default dev profile to spark3.2
  • Update pom dependencies to make build pass
  • Fix failing test cases with the default profile
  • Add IT job in GH actions CI
  • Update test job setup in CI

Impact

Default dev profile change. Local repo setup should be updated accordingly.

Risk level

Medium.

Use bundle validate to verify artifacts.

Documentation Update

  • Update build instructions in docs where applicable

Contributor's checklist

  • Read through contributor's guide
  • Change Logs and Impact were stated clearly
  • Adequate tests were added if applicable
  • CI passed

@xushiyan xushiyan force-pushed the HUDI-3088-default-spark32-2 branch from 2557433 to 07e889c Compare April 15, 2023 09:14
@xushiyan xushiyan force-pushed the HUDI-3088-default-spark32-2 branch from 07e889c to b13036a Compare May 14, 2023 14:38
@xushiyan xushiyan force-pushed the HUDI-3088-default-spark32-2 branch 5 times, most recently from 2b4ad62 to 203caaf Compare May 16, 2023 12:42
@xushiyan xushiyan force-pushed the HUDI-3088-default-spark32-2 branch from 1f9f158 to 78d54f5 Compare May 22, 2023 08:30
@apache apache deleted a comment from hudi-bot May 23, 2023
Comment on lines 246 to 259
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_${scala.binary.version}</artifactId>
<exclusions>
<exclusion>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-client-api</artifactId>
</exclusion>
<exclusion>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-client-runtime</artifactId>
</exclusion>
</exclusions>
</dependency>
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removing these won't affect using cli/spark/utilities bundles as spark will be provided

Comment on lines 232 to 243
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_${scala.binary.version}</artifactId>
<exclusions>
<exclusion>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-client-api</artifactId>
</exclusion>
<exclusion>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-client-runtime</artifactId>
</exclusion>
<exclusion>
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

Comment on lines 275 to 288
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming-kafka-0-10_${scala.binary.version}</artifactId>
<version>${spark.version}</version>
<exclusions>
<exclusion>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-client-api</artifactId>
</exclusion>
<exclusion>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-client-runtime</artifactId>
</exclusion>
</exclusions>
</dependency>
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

@xushiyan xushiyan force-pushed the HUDI-3088-default-spark32-2 branch from fe494c5 to 5f5ce7f Compare May 23, 2023 09:40
- script: |
grep "testcase" */target/surefire-reports/*.xml */*/target/surefire-reports/*.xml | awk -F'"' ' { print $6,$4,$2 } ' | sort -nr | head -n 100
displayName: Top 100 long-running testcases
- job: IT
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IT will still be running with spark2.4 as per the docker demo setup, and moved to GH actions

@Test
public void testWriteReadWithEvolvedSchema() throws Exception {
// Disable the test with evolved schema for HFile since it's not supported
public void testWriteReadWithEvolvedSchema(String evolvedSchemaPath) throws Exception {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we just remove it for now? there's already a tracking jira.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's disabled with a message. should be a good reminder there.

// Initialize HbaseMiniCluster
System.setProperty("zookeeper.preAllocSize", "100");
System.setProperty("zookeeper.maxCnxns", "60");
System.setProperty("zookeeper.4lw.commands.whitelist", "*");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need these configs in this PR?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is caused by spark version upgrade and changed zookeeper's version.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

trying to run in CI without it. locally seems fine.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"zookeeper.4lw.commands.whitelist", "*" will be required to start the test service properly with new zookeeper version used. added it back

// set env and directly in order to handle static init/gc issues
System.setProperty("zookeeper.preAllocSize", "100");
System.setProperty("zookeeper.maxCnxns", "60");
System.setProperty("zookeeper.4lw.commands.whitelist", "*");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same question as above. if it's a test issue then maybe add it in a separate PR?

Comment on lines 102 to 114
<dependency>
<groupId>org.apache.parquet</groupId>
<artifactId>parquet-avro</artifactId>
<version>${parquet.version}</version>
<scope>test</scope>
</dependency>

<dependency>
<groupId>org.apache.parquet</groupId>
<artifactId>parquet-hadoop</artifactId>
<version>${parquet.version}</version>
<scope>test</scope>
</dependency>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are these dependencies needed here? also, if we run integ tests with hudi-spark3.x-bundle, wouldn't they already be present in classpath?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe parquet-hadoop is not required. i'll have to re-run to confirm

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this actually affects Flink's IT. consider this is only test deps, did not dig further

2023-05-24T08:46:49.7944885Z [ERROR] Tests run: 128, Failures: 0, Errors: 4, Skipped: 0, Time elapsed: 1,026.49 s <<< FAILURE! - in org.apache.hudi.table.ITTestHoodieDataSource
2023-05-24T08:46:49.7945946Z [ERROR] testUpdateDelete{String, HoodieTableType}[1]  Time elapsed: 2.196 s  <<< ERROR!
2023-05-24T08:46:49.7947263Z org.apache.flink.table.api.TableException: Unsupported query: update t1 set age=18 where uuid in('id1', 'id2')
2023-05-24T08:46:49.7947914Z 	at org.apache.hudi.table.ITTestHoodieDataSource.execInsertSql(ITTestHoodieDataSource.java:2095)
2023-05-24T08:46:49.7948557Z 	at org.apache.hudi.table.ITTestHoodieDataSource.testUpdateDelete(ITTestHoodieDataSource.java:1971)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this actually affects Flink's IT. consider this is only test deps, did not dig further

scratch this. this is caused by a newly added testcase that only runs with flink 1.17

@MethodSource("bulkInsertTypeParams")
public void testDataSourceWriter(boolean populateMetaFields) throws Exception {
testDataSourceWriterInternal(Collections.EMPTY_MAP, Collections.EMPTY_MAP, populateMetaFields);
testDataSourceWriterInternal(Collections.emptyMap(), Collections.emptyMap(), populateMetaFields);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just asking, does it make any difference? I believe both are the same thing right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as per the api's javadoc, using the api should be preferred

Using this method is likely to have comparable cost to using the like-named field. (Unlike this method, the field does not provide type safety.)

int numRecords = 15;
int numPartitions = 3;
int numRecords = 30;
int numPartitions = 2;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why change these values?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this has sth to do with org.apache.spark.streaming.kafka010.KafkaTestUtils which only utilizes 2 partitions (probably limited by spark partitions due to 1:1 mapping). hence doing more than 2 will not give expected even records per partitions. this can be pushed as follow up to improve the test setup.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

assertEquals(appendList, withKafkaOffsetColumns.subList(withKafkaOffsetColumns.size() - 3, withKafkaOffsetColumns.size()));

dfNoOffsetInfo.unpersist();
dfWithOffsetInfo.unpersist();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

int numPartitions = 3;
int numMessages = 15;
int numPartitions = 2;
int numMessages = 30;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here, why change it? is it reduce the test time (lesser partitions lesser i/o)?

@xushiyan xushiyan force-pushed the HUDI-3088-default-spark32-2 branch 3 times, most recently from 53b3c28 to 5299e7b Compare May 24, 2023 17:08
@xushiyan xushiyan force-pushed the HUDI-3088-default-spark32-2 branch from 5299e7b to 70af3bf Compare May 25, 2023 04:19
@xushiyan xushiyan force-pushed the HUDI-3088-default-spark32-2 branch 3 times, most recently from 8011027 to 4e93aac Compare May 25, 2023 05:01
@xushiyan xushiyan added the priority:critical Production degraded; pipelines stalled label May 25, 2023
@xushiyan xushiyan force-pushed the HUDI-3088-default-spark32-2 branch from 540705c to 8bc8b9c Compare May 25, 2023 16:26
Comment on lines -1964 to -1971
<plugin>
<groupId>org.jacoco</groupId>
<artifactId>jacoco-maven-plugin</artifactId>
<executions>
<execution>
<goals>
<goal>prepare-agent</goal>
</goals>
Copy link
Member Author

@xushiyan xushiyan May 25, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this step from jacoco plugin under integration-tests profile caused mystic class loading issue in flink 1.17. need to be cautious with seemingly innocuous plugin! cc @danny0405

[ERROR] testMergeOnReadInputFormatLogFileOnlyIteratorGetUnMergedLogFileIterator  Time elapsed: 0.006 s  <<< ERROR!
java.lang.NoClassDefFoundError: Could not initialize class org.apache.calcite.rel.metadata.DefaultRelMetadataProvider
	at org.apache.hudi.table.ITTestSchemaEvolution.setUp(ITTestSchemaEvolution.java:81)

@hudi-bot
Copy link
Collaborator

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

dependencies Dependency updates priority:critical Production degraded; pipelines stalled release-0.14.0

Projects

Archived in project

Development

Successfully merging this pull request may close these issues.

3 participants