Skip to content

Conversation

@rahil-c
Copy link
Collaborator

@rahil-c rahil-c commented Jun 7, 2022

Tips

What is the purpose of the pull request

Main Changes

  • Change Hudi Docker Setup to support these upgraded versions( renamed several files, added necessary dependencies and configurations for Hadoop Hive etc.)
  • Upgrade jetty.version to 9.4.43.v20210629 and javalin version to 3.13.12 ( since these are couple together) to stay in line with EMR versions of these deps, as well as minor api fixes to RequestHandler and TimelineServer
  • Resolve Jetty dep conflict by doing many exclusions of jetty dependency throughout the package since Hadoop and Hive bring in jetty 9.3.x which causes conflicts with the above jetty version.
  • Added jetty dependencies in timeline-service pom in order to avoid dependency conflicts (without this many tests fail)
  • Minor Api changes made in HoodieRealtimeRecordReaderUtils in order for both avro 1.8.2 and avro 1.10.2 versions to be used.
  • Define explicit parquet version in Hudi hadoop mr ( the reason for this is that hive 3.1.2 expects parquet 1.10.2, otherwise we get into a dep conflict since the parquet version defined in root pom is 1.12 which is what spark 3.2.1 uses.)
  • Make changes in azure-pipelines.yaml for spark 3.2.1
  • Fix tests to be compatible with spark 3.2.1
  • Resolve log4j dependency conflict (several exclusions done)
  • Resolve netty dependency conflict (several exclusions done)
  • Fix test failures reported since we have not been running azure ci with spark3 before this pr
  • Disabled tests which were flaky or have an issue (documented below)

List of tests disabled in order to have the azure ci green (reasons documented in each JIRA)

https://issues.apache.org/jira/browse/HUDI-4233 testMergeOnReadSnapshotRelationWithDeltaLogsFallback
https://issues.apache.org/jira/browse/HUDI-4239 'TestCOWDataSourceStorage.testCopyOnWriteStorage'
https://issues.apache.org/jira/browse/HUDI-4241 ITTestHoodieSanity. testRunHoodieJavaAppOnMultiPartitionKeysMORTable
https://issues.apache.org/jira/browse/HUDI-4234 ITTestHoodieDataSource (flink related)
https://issues.apache.org/jira/browse/HUDI-4231 TestHoodieFlinkQuickstart.testHoodieFlinkQuickstart
https://issues.apache.org/jira/browse/HUDI-4229 TestOrcBootstrap
https://issues.apache.org/jira/browse/HUDI-4230 TestHiveIncrementalPuller.testPuller
https://issues.apache.org/jira/browse/HUDI-4236 ITTestHoodieSanity#testRunHoodieJavaAppOnMultiPartitionKeysMORTable
https://issues.apache.org/jira/browse/HUDI-4235 testSyncCOWTableWithProperties,testSyncMORTableWithProperties
https://issues.apache.org/jira/browse/HUDI-4232 below are test disabled for this

 TestHoodieDeltaStreamer,TestHoodieMultiTableDeltaStreamer,TestMultiFS#readLocalWriteHDFS, TestMergeOnReadRollbackActionExecutor#testRollbackForCanIndexLogFile,TestSparkHoodieHBaseIndex
TestCsvDFSSource
TestAvroDFSSource
TestJsonDFSSource
TestCsvDFSSource
TestParquetDFSSource
TestS3EventsSource
TestSqlSource
TestMysqlDebeziumSource
TestPostgresDebeziumSource
TestSqlFileBasedTransformer
TestHoodieLogFormat
TestHoodieLogFormatAppendFailure
TestDFSPropertiesConfiguration
TestHiveSyncGlobalCommitTool
TestInputPathHandler
TestHoodieCombineHiveInputFormat

https://issues.apache.org/jira/browse/HUDI-4264 ITTestCompactionCommand.testRepairCompaction
testValidateCompaction
testUnscheduleCompactFile
https://issues.apache.org/jira/browse/HUDI-4263 Disabled Azure IT unit test section
https://issues.apache.org/jira/browse/HUDI-4262 testRollbackWithDeltaAndCompactionCommit

Other notes

  • Revisit issue with not specifying a profile for spark and scala

Verify this pull request

(Please pick either of the following options)

This pull request is already covered by existing tests, such as (please describe tests). Azure CI

Committer checklist

  • Has a corresponding JIRA in PR title & commit

  • Commit message is descriptive of the change

  • CI is green

  • Necessary doc changes done or have another open PR

  • For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.

@rahil-c
Copy link
Collaborator Author

rahil-c commented Jun 8, 2022

@hudi-bot run azure

@rahil-c rahil-c changed the title [Rebase][WIP] Support Hadoop 3.x Hive 3.x and Spark 3.2.x default [WIP] Support Hadoop 3.x Hive 3.x and Spark 3.2.x default (rebase) Jun 8, 2022
@yihua
Copy link
Contributor

yihua commented Jun 8, 2022

@hudi-bot run azure

@apache apache deleted a comment from rahil-c Jun 8, 2022
@nsivabalan
Copy link
Contributor

@hudi-bot run_azure

LOG.info(String.format("Waiting for all the containers and services finishes in %d ms",
System.currentTimeMillis() - currTs));
try {
Thread.sleep(30000);
Copy link
Collaborator Author

@rahil-c rahil-c Jun 8, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We want to find a better way of checking to see if all services is up, ideally the method servicesUp should be functioning correctly but it seems that I have run into issues where certain containers were not fully setup causing issues with integ tests. For now i added a sleep for 30 sec to make sure things are created, but revisiting would be good.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be the reason why Azure CI tests take a longer time to finish. Would be good to figure out a better way to determine whether the environment is ready for tests.

/**
* Basic tests against {@link HoodieDeltaStreamer}, by issuing bulk_inserts, upserts, inserts. Check counts at the end.
*/
@Disabled("Disabled due to HDFS MiniCluster jetty conflict")
Copy link
Collaborator Author

@rahil-c rahil-c Jun 8, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should revisit tests affected by the mini cluster issue first https://issues.apache.org/jira/browse/HUDI-4232

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looking at this

@rahil-c rahil-c changed the title [WIP] Support Hadoop 3.x Hive 3.x and Spark 3.2.x default (rebase) Support Hadoop 3.x Hive 3.x and Spark 3.2.x default (rebase) Jun 9, 2022
@rahil-c
Copy link
Collaborator Author

rahil-c commented Jun 9, 2022

@hudi-bot run_azure

1 similar comment
@rahil-c
Copy link
Collaborator Author

rahil-c commented Jun 9, 2022

@hudi-bot run_azure

@rahil-c rahil-c force-pushed the hdp3-hive3-spark3-branch-rebase branch from 1530b65 to bca9e34 Compare June 9, 2022 15:34
@rahil-c
Copy link
Collaborator Author

rahil-c commented Jun 9, 2022

@hudi-bot run azure

4 similar comments
@rahil-c
Copy link
Collaborator Author

rahil-c commented Jun 9, 2022

@hudi-bot run azure

@rahil-c
Copy link
Collaborator Author

rahil-c commented Jun 9, 2022

@hudi-bot run azure

@rahil-c
Copy link
Collaborator Author

rahil-c commented Jun 9, 2022

@hudi-bot run azure

@rahil-c
Copy link
Collaborator Author

rahil-c commented Jun 9, 2022

@hudi-bot run azure

@nsivabalan nsivabalan added the priority:blocker Production down; release blocker label Jun 9, 2022
@rahil-c rahil-c force-pushed the hdp3-hive3-spark3-branch-rebase branch from dd6f7c0 to 5e3ada3 Compare June 9, 2022 23:47
@rahil-c
Copy link
Collaborator Author

rahil-c commented Jun 10, 2022

@hudi-bot run azure

1 similar comment
@rahil-c
Copy link
Collaborator Author

rahil-c commented Jun 12, 2022

@hudi-bot run azure

Copy link
Contributor

@yihua yihua left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll make a second pass of the POM changes.

options: $(MVN_OPTS_INSTALL) -Pintegration-tests
publishJUnitResults: false
jdkVersionOption: '1.8'
- task: Maven@3
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there still a problem to run the unit tests specified here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe when i tried readding the unit tests I ran into issues again with mini dfs cluster based on the JIRA https://issues.apache.org/jira/browse/HUDI-4263

- job: IT
displayName: IT modules
timeoutInMinutes: '120'
timeoutInMinutes: '180'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you know why it takes longer for the Spark 3 tests? Or this change is not necessary.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change might not fully be necessary, but helps with the case for when i would see the tests slightly run over the limit and then auto terminate.

HIVE_SITE_CONF_hive_metastore_uris=thrift://hivemetastore:9083
HIVE_SITE_CONF_hive_metastore_uri_resolver=org.apache.hudi.hadoop.hive.NoOpMetastoreUriResolverHook
HIVE_SITE_CONF_hive_metastore_event_db_notification_api_auth=false
HIVE_SITE_CONF_hive_execution_engine=mr
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MapReduce is no longer supported in Hive 3. Should this be configured with tez?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So there is I think a cavaet to this, I believe that Mapreduce is supported in some areas of hive based on the hive code but in general the default execution engine for hive3 is tez. From integtest stand point it seems that its been working fine with this setting above but i can try changing to tez to see if it causes issues.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Offline sync with @yihua

For now keeping engine as mr since switching to tez causes errors in the IT tests still.

<artifactId>awaitility</artifactId>
<scope>test</scope>
</dependency>
<dependency>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this used for? Does it have a compatible license?

Copy link
Collaborator Author

@rahil-c rahil-c Jul 11, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My assumption is you are referring to this dependency https://github.com/paul-hammant/paranamer?

    <dependency>
      <groupId>com.thoughtworks.paranamer</groupId>
      <artifactId>paranamer</artifactId>
      <version>2.8</version>
      <scope>test</scope>
    </dependency>

this change was made by @xushiyan https://issues.apache.org/jira/browse/HUDI-3088 #4752 but i think we need this newer 2.8 for spark 3.2.1 if its the default profile. In master we have this https://github.com/apache/hudi/search?q=paranamer in the licenses we seem to be referring to paranamer 2.7

Comment on lines +127 to +131
<dependency>
<groupId>org.eclipse.jetty</groupId>
<artifactId>jetty-server</artifactId>
<version>${jetty.version}</version>
</dependency>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Comment on lines +245 to +254
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-hive_${scala.binary.version}</artifactId>
<exclusions>
<exclusion>
<groupId>*</groupId>
<artifactId>*</artifactId>
</exclusion>
</exclusions>
</dependency>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have the same question. And it looks like all artifacts are excluded?

Comment on lines -217 to -218
// Allow queries without partition predicate
executeStatement("set hive.strict.checks.large.query=false", stmt);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We still need to keep this?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually we should not keep this, as it caused errors in the actual test. The reason being that this param existed in Hive 2.3.x https://github.com/apache/hive/blob/release-2.3.8-rc3/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java

whereas in Hive 3.1.2 it has been removed. The below properties for hive.strict.checks are as follows

    HIVE_STRICT_CHECKS_ORDERBY_NO_LIMIT("hive.strict.checks.orderby.no.limit", false,
        "Enabling strict large query checks disallows the following:\n" +
        "  Orderby without limit.\n" +
        "Note that this check currently does not consider data size, only the query pattern."),
    HIVE_STRICT_CHECKS_NO_PARTITION_FILTER("hive.strict.checks.no.partition.filter", false,
        "Enabling strict large query checks disallows the following:\n" +
        "  No partition being picked up for a query against partitioned table.\n" +
        "Note that this check currently does not consider data size, only the query pattern."),
    HIVE_STRICT_CHECKS_TYPE_SAFETY("hive.strict.checks.type.safety", true,
        "Enabling strict type safety checks disallows the following:\n" +
        "  Comparing bigints and strings.\n" +
        "  Comparing bigints and doubles."),
    HIVE_STRICT_CHECKS_CARTESIAN("hive.strict.checks.cartesian.product", false,
        "Enabling strict Cartesian join checks disallows the following:\n" +
        "  Cartesian product (cross join)."),
    HIVE_STRICT_CHECKS_BUCKETING("hive.strict.checks.bucketing", true,
        "Enabling strict bucketing checks disallows the following:\n" +
        "  Load into bucketed tables."),
    HIVE_LOAD_DATA_OWNER("hive.load.data.owner", "",
        "Set the owner of files loaded using load data in managed tables."),

    @Deprecated
    HIVEMAPREDMODE("hive.mapred.mode", null,
        "Deprecated; use hive.strict.checks.* settings instead."),

https://github.com/apache/hive/blob/rel/release-3.1.0/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java

Comment on lines 32 to 33
<!-- override parquet version to be same as Hive 3.1.2 -->
<parquet.version>1.10.1</parquet.version>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rahil-c adding to Udit's point, do we need to build the hudi-hadoop-mr-bundle for Hive 2.x and 3.x differently, based on the dependency version used?

Comment on lines +412 to +417
<exclusions>
<exclusion>
<groupId>org.eclipse.jetty</groupId>
<artifactId>*</artifactId>
</exclusion>
</exclusions>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, @rahil-c let's try to minimize the exclusions.

@rahil-c
Copy link
Collaborator Author

rahil-c commented Jul 11, 2022

@hudi-bot run azure

1 similar comment
@rahil-c
Copy link
Collaborator Author

rahil-c commented Jul 11, 2022

@hudi-bot run azure

@rahil-c rahil-c force-pushed the hdp3-hive3-spark3-branch-rebase branch from 85ded11 to 759bf0c Compare July 12, 2022 18:09
@rahil-c
Copy link
Collaborator Author

rahil-c commented Jul 12, 2022

@hudi-bot run azure

@hudi-bot
Copy link
Collaborator

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

@prasannarajaperumal prasannarajaperumal removed the priority:blocker Production down; release blocker label Jul 20, 2022
@yihua yihua added the priority:high Significant impact; potential bugs label Jul 20, 2022
@HEPBO3AH
Copy link

Any known blockers on this?

@xushiyan xushiyan added the status:in-progress Work in progress label Oct 31, 2022
@github-actions github-actions bot added the size:XL PR with lines of changes > 1000 label Feb 26, 2024
Copy link
Contributor

@yihua yihua left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spark 3.5 is the default now, in Azure CI as well. We'll use the #11539 to follow up on the Hadoop 3 support. Closing this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

dependencies Dependency updates priority:high Significant impact; potential bugs size:XL PR with lines of changes > 1000 status:in-progress Work in progress

Projects

Status: 🏁 Triaged
Status: ✅ Done

Development

Successfully merging this pull request may close these issues.

9 participants