[HUDI-3191] Rebasing Hive's FileInputFormat onto `AbstractHoodieTableFileIndex` #4531

alexeykudinkin · 2022-01-07T04:33:05Z

Tips

Thank you very much for contributing to Apache Hudi.
Please review https://hudi.apache.org/contribute/how-to-contribute before opening a pull request.

What is the purpose of the pull request

Rebasing Hive's FileInputFormat onto AbstractHoodieTableFileIndex

Brief change log

Moving AbstractHoodieTableFileIndex to "hudi-spark-common" (temporarily, will be migrated to "hudi-common")
Bootstrapping HiveHoodieTableFileIndex impl of AbstractHoodieTableFileIndex for Hive
Rebasing HiveFileInputFormatBase onto HiveHoodieTableFileIndex

Verify this pull request

This pull request is already covered by existing tests, such as (please describe tests).

Committer checklist

Has a corresponding JIRA in PR title & commit
Commit message is descriptive of the change
CI is green
Necessary doc changes done or have another open PR
For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.

alexeykudinkin · 2022-01-08T05:04:28Z

@hudi-bot run azure

alexeykudinkin · 2022-01-12T04:43:16Z

@hudi-bot run azure

hudi-common/src/main/java/org/apache/hudi/common/fs/FSUtils.java

yihua · 2022-01-14T02:40:45Z

hudi-common/pom.xml

+    <!-- Scala -->
+    <dependency>
+      <groupId>org.scala-lang</groupId>
+      <artifactId>scala-library</artifactId>
+      <version>${scala.version}</version>
+    </dependency>
+


I'm not sure if it's a good idea to have scala deps in hudi-common. Then there will be hudi-common_2.11 and hudi-common_2.12, which is not really necessary.

I'm not sure if it's a good idea to have scala deps in hudi-common.

Yeah, this is a by-product of keeping this FileIndex in Scala for now. There's also no other module shared by both Spark and Hive unfortunately.

HUDI-3239 would actually resolve this converting it to Java.

Then there will be hudi-common_2.11 and hudi-common_2.12, which is not really necessary.

This is not necessary -- since we don't depend on any diverging behavior here

yihua · 2022-01-14T02:41:13Z

hudi-common/src/main/scala/org/apache/hudi/AbstractHoodieTableFileIndex.scala

+ * path with the partition columns in this case.
+ *
+ */
+abstract class AbstractHoodieTableFileIndex(engineContext: HoodieEngineContext,


Maybe we should have this implemented in Java instead? Since it's no longer Spark specific, it is not required to be in Scala.

Yeah, there's a task for it: HUDI-3239

yihua · 2022-01-14T02:44:50Z

hudi-hadoop-mr/src/test/java/org/apache/hudi/hadoop/testutils/InputFormatTestUtil.java

  public static File prepareParquetTable(java.nio.file.Path basePath, Schema schema, int numberOfFiles,
                                         int numberOfRecords, String commitNumber, HoodieTableType tableType) throws IOException {
    HoodieTestUtils.init(HoodieTestUtils.getDefaultHadoopConf(), basePath.toString(), tableType, HoodieFileFormat.PARQUET);
+


nit: remove the added empty lines in several places?

Reason why i'm adding those is to logically separate clusters of operations (like setting up partition, inserting data, etc) to make it easier to digest from just a glance. WDYT?

Got it, Sg. I'll let you make the judgement.

hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/HiveHoodieTableFileIndex.java

yihua · 2022-01-14T02:58:37Z

hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/HoodieFileInputFormatBase.java

+    // TODO cleanup
+    validate(targetFiles, listStatusForSnapshotModeLegacy(job, tableMetaClientMap, snapshotPaths));


Have a flag to turn off validation by default for query performance?

This was actually just for CI validation. Will clean it up

Got it. I'll take another look once done.

Sorry, wasn't elaborate enough -- i'm using this hook to validate the behavior is the same as it was prior to refactoring. I'm planning to clean this up in the topmost PR on the stack.

lets file a JIRA for this for tracking

yihua · 2022-01-17T06:48:50Z

hudi-common/src/main/scala/org/apache/hudi/AbstractHoodieTableFileIndex.scala

+ * @param shouldIncludePendingCommits flags whether file-index should exclude any pending operations
+ * @param fileStatusCache transient cache of fetched [[FileStatus]]es
+ */
+abstract class AbstractHoodieTableFileIndex(engineContext: HoodieEngineContext,


Let's rename the class with Base prefix here.

yihua · 2022-01-17T06:49:30Z

hudi-hadoop-mr/src/test/java/org/apache/hudi/hadoop/testutils/InputFormatTestUtil.java

  public static File prepareParquetTable(java.nio.file.Path basePath, Schema schema, int numberOfFiles,
                                         int numberOfRecords, String commitNumber, HoodieTableType tableType) throws IOException {
    HoodieTestUtils.init(HoodieTestUtils.getDefaultHadoopConf(), basePath.toString(), tableType, HoodieFileFormat.PARQUET);
+


Got it, Sg. I'll let you make the judgement.

yihua · 2022-01-17T06:50:23Z

hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/HoodieFileInputFormatBase.java

+    // TODO cleanup
+    validate(targetFiles, listStatusForSnapshotModeLegacy(job, tableMetaClientMap, snapshotPaths));


Got it. I'll take another look once done.

…park-common"

…instead of single one)

…ndex` for table's files listing in snapshot mode

Rebased `AbstractHoodieTableFileIndex` onto `HoodieTableQueryType`

…"hudi-common"

…ns w/ partition-metadata

… its listing (for Hive compatibility)

- hudi-hive-sync-bundle - presto-bundle - trino-bundle

… jars into Hive CLI (addresses issues of tests failing after enabling Metadata table)

hudi-bot · 2022-01-18T20:37:43Z

CI report:

08b9ee5 Azure: SUCCESS

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run azure re-run the last Azure build

yihua

LGTM. I'll merge this one to unblock the stacked PRs.

vinothchandar

I am actually quite not sure if this is delivering us real code duplication reduction here. Also lets not please change bundles temporarily. and leave things in intermediate state

vinothchandar · 2022-01-19T19:12:33Z

hudi-common/src/main/java/org/apache/hudi/common/model/HoodieTableQueryType.java

+ * </ol>
+ */
+public enum HoodieTableQueryType {
+  QUERY_TYPE_SNAPSHOT,


drop the QUERY_TYPE prefix?

vinothchandar · 2022-01-19T19:13:13Z

hudi-common/src/main/scala/org/apache/hudi/HoodieTableFileIndexBase.scala

-                                            configProperties: TypedProperties,
-                                            specifiedQueryInstant: Option[String] = None,
-                                            @transient fileStatusCache: FileStatusCacheTrait) {
+abstract class HoodieTableFileIndexBase(engineContext: HoodieEngineContext,


BaseHoodieTableIndex or just even HoodieTableFileIndex

vinothchandar · 2022-01-19T19:14:57Z

hudi-common/src/main/java/org/apache/hudi/common/model/HoodieTableQueryType.java

+ *   are queried</li>
+ * </ol>
+ */
+public enum HoodieTableQueryType {


also just HoodieQueryType?

Fine either way, but in HoodieQueryType though Hoodie prefix seem to be affiliated with a query ("hoodie's query") and it's a little confusing as compared to "hoodie's table query"

vinothchandar · 2022-01-19T19:26:00Z

hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/HiveHoodieTableFileIndex.java

+import org.apache.hudi.common.model.HoodieTableQueryType;
+import org.apache.hudi.common.table.HoodieTableMetaClient;
+import org.apache.hudi.common.util.Option;
+import org.slf4j.Logger;


lets stick to log4j?

slf4j is dictated by Spark (otherwise we'd need to provide for wrapper of log4j over slf4j interface)

vinothchandar · 2022-01-19T19:27:19Z

hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/HoodieFileInputFormatBase.java

+    try {
+      return HoodieInputFormatUtils.getFileStatus(baseFileOpt.get());
+    } catch (IOException ioe) {
+      throw new RuntimeException(ioe);


Wrap into HoodieIOException?

Yeah, missed those

Already addressed in #4559

vinothchandar · 2022-01-19T19:27:27Z

hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/HoodieFileInputFormatBase.java

+      rtFileStatus.setDeltaLogFiles(sortedLogFiles);
+      return rtFileStatus;
+    } catch (IOException e) {
+      throw new RuntimeException(e);


vinothchandar · 2022-01-19T19:28:38Z

hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/HoodieFileInputFormatBase.java

+    // TODO cleanup
+    validate(targetFiles, listStatusForSnapshotModeLegacy(job, tableMetaClientMap, snapshotPaths));


lets file a JIRA for this for tracking

vinothchandar · 2022-01-19T19:31:00Z

packaging/hudi-presto-bundle/pom.xml

+                  <!-- TODO(HUDI-3239) remove this -->
+                  <include>org.scala-lang:scala-library</include>
+
+                  <include>org.apache.parquet:parquet-avro</include>


all of this is problematic. cc @codope

alexeykudinkin · 2022-01-19T19:38:48Z

@vinothchandar responding inline

I am actually quite not sure if this is delivering us real code duplication reduction here.

Can you elaborate?

Also lets not please change bundles temporarily. and leave things in intermediate state

MT doesn't work in Presto bundles on master. @codope is working to update the Hudi's Presto Docker image. As soon as this is done, either me or Sagar will revert these changes.

…FileIndex` (apache#4531)

alexeykudinkin changed the title ~~[WIP][HUDI-3191] Rebasing Hive's FileInputFormat onto AbstractHoodieTableFileIndex~~ [WIP][HUDI-3191][Stacked on 4520] Rebasing Hive's FileInputFormat onto AbstractHoodieTableFileIndex Jan 7, 2022

alexeykudinkin force-pushed the ak/rpath-ref-3 branch 4 times, most recently from 8848863 to 53b67bd Compare January 8, 2022 02:00

alexeykudinkin changed the title ~~[WIP][HUDI-3191][Stacked on 4520] Rebasing Hive's FileInputFormat onto AbstractHoodieTableFileIndex~~ [HUDI-3191][Stacked on 4520] Rebasing Hive's FileInputFormat onto AbstractHoodieTableFileIndex Jan 8, 2022

alexeykudinkin force-pushed the ak/rpath-ref-3 branch from 8203659 to 5dae550 Compare January 12, 2022 01:51

yihua reviewed Jan 14, 2022

View reviewed changes

alexeykudinkin force-pushed the ak/rpath-ref-3 branch from e138780 to 29076ce Compare January 14, 2022 18:53

yihua self-assigned this Jan 15, 2022

yihua mentioned this pull request Jan 17, 2022

[HUDI-3179] Extracted common AbstractHoodieTableFileIndex to be shared across engines #4520

Merged

5 tasks

yihua reviewed Jan 17, 2022

View reviewed changes

Alexey Kudinkin added 15 commits January 18, 2022 11:24

Relocated AbstractHoodieTableFileIndex from "hudi-spark" to "hudi-s…

4b3aba0

…park-common"

Bootstrapped HiveHoodieTableFileIndex based

8ac32cf

Made AbstractHoodieTableFileIndex accept set of input query paths (…

cdff879

…instead of single one)

Rebased HoodieFileInputFormatBase to leverage `HiveHoodieTableFileI…

499199e

…ndex` for table's files listing in snapshot mode

Introduced HoodieTableQueryType;

c4d39b4

Rebased `AbstractHoodieTableFileIndex` onto `HoodieTableQueryType`

Relocated AbstractHoodieTableFileIndex from "hudi-spark-common" to …

bf11932

…"hudi-common"

Cleaned up POMs

32e25ba

Missing license

409a0cd

Fixed compilation

04fd731

Missing license

63a3032

Revert accidental change

bb1b319

Missing dep

ed40d25

Make getRelativePartitionPath throw when not w/in base-path

4e83203

Tidying up

448498c

Fixed FileInputFormat tests setup to apporpiately annotate partitio…

70b4718

…ns w/ partition-metadata

Alexey Kudinkin added 16 commits January 18, 2022 11:24

Enabled AbstractHoodieTableFileIndex to ingest pending commits into…

ef83cd6

… its listing (for Hive compatibility)

Added Scala into "hudi-hadoop-mr" bundle (temporarily)

fee8803

Disabling Jetty logs in ITs

d07d961

Fixed parquet-avro version in "hadoop-mr-bundle"

b0075eb

Package scala-library into

978eac7

- hudi-hive-sync-bundle - presto-bundle - trino-bundle

Fixed avro, parquet-avro versions in bundles

9938728

Removed filtering of shaded Guava coming from Avro

97c5b61

Fixed incorrect shaded HBase jars refs in Presto/Trino bundles

344c8a6

Reverted Presto/Trino to pull in non-shaded HBase

121e8b5

Fixed IT to leverage AUX_CLASSPATH env var as a facility to load Hudi…

8bccaef

… jars into Hive CLI (addresses issues of tests failing after enabling Metadata table)

Added AUX_CLASSPATH to Hive's Dockerfile

f573b86

Typo

2734737

Tidying up java-docs

f614015

Restoring missing deps on Parquet, Avro

05258af

AbstractHoodieTableFileIndex > HoodieTableFileIndexBase

2d15c1c

Added test to cover new branch

08b9ee5

alexeykudinkin force-pushed the ak/rpath-ref-3 branch from d8313ea to 08b9ee5 Compare January 18, 2022 19:29

alexeykudinkin changed the title ~~[HUDI-3191][Stacked on 4520] Rebasing Hive's FileInputFormat onto AbstractHoodieTableFileIndex~~ [HUDI-3191] Rebasing Hive's FileInputFormat onto AbstractHoodieTableFileIndex Jan 18, 2022

yihua approved these changes Jan 18, 2022

View reviewed changes

yihua merged commit 4bea758 into apache:master Jan 18, 2022

vinothchandar reviewed Jan 19, 2022

View reviewed changes

codope mentioned this pull request Jan 20, 2022

[HUDI-3250] Upgrade Presto docker image #4646

Merged

5 tasks

vinishjail97 mentioned this pull request Jan 24, 2022

FixIgnoreKey nsivabalan/hudi#11

Closed

5 tasks

vingov pushed a commit to vingov/hudi that referenced this pull request Jan 26, 2022

[HUDI-3191] Rebasing Hive's FileInputFormat onto `AbstractHoodieTable…

d4bd81d

…FileIndex` (apache#4531)

alexeykudinkin mentioned this pull request Jan 27, 2022

[HUDI-3206] Unify Hive's MOR implementations to avoid duplication #4559

Merged

5 tasks

yihua mentioned this pull request Feb 8, 2022

[HUDI-3239] Convert BaseHoodieTableFileIndex to Java #4669

Merged

5 tasks

liusenhua pushed a commit to liusenhua/hudi that referenced this pull request Mar 1, 2022

[HUDI-3191] Rebasing Hive's FileInputFormat onto `AbstractHoodieTable…

12b6ebb

…FileIndex` (apache#4531)

vingov pushed a commit to vingov/hudi that referenced this pull request Apr 3, 2022

[HUDI-3191] Rebasing Hive's FileInputFormat onto `AbstractHoodieTable…

c7dd412

…FileIndex` (apache#4531)

		// TODO cleanup
		validate(targetFiles, listStatusForSnapshotModeLegacy(job, tableMetaClientMap, snapshotPaths));

[HUDI-3191] Rebasing Hive's FileInputFormat onto AbstractHoodieTableFileIndex #4531

[HUDI-3191] Rebasing Hive's FileInputFormat onto AbstractHoodieTableFileIndex #4531

Uh oh!

Conversation

alexeykudinkin commented Jan 7, 2022

Tips

What is the purpose of the pull request

Brief change log

Verify this pull request

Committer checklist

Uh oh!

alexeykudinkin commented Jan 8, 2022

Uh oh!

alexeykudinkin commented Jan 12, 2022

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alexeykudinkin Jan 14, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hudi-bot commented Jan 18, 2022

CI report:

Uh oh!

yihua left a comment

Choose a reason for hiding this comment

Uh oh!

vinothchandar left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

[HUDI-3191] Rebasing Hive's FileInputFormat onto `AbstractHoodieTableFileIndex` #4531

[HUDI-3191] Rebasing Hive's FileInputFormat onto `AbstractHoodieTableFileIndex` #4531

alexeykudinkin Jan 14, 2022 •

edited

Loading