[HUDI-3280] Cleaning up Hive-related hierarchies after refactoring #4743

alexeykudinkin · 2022-02-03T22:28:53Z

Tips

Thank you very much for contributing to Apache Hudi.
Please review https://hudi.apache.org/contribute/how-to-contribute before opening a pull request.

What is the purpose of the pull request

This PR cleans up Hive-related hierarchies after refactoring

Brief change log

Merged HoodieRealtimeFileSplit w/ BaseWithLogFilesSplit
Fixed RealtimeBootstrapBaseFileSplit
Moved methods form HoodieInputFormatUtils closer to where they're used
Cleaned up unused methods

List of commits in this patch:
https://github.com/apache/hudi/pull/4743/files/7d12ce874c3bfb9861b00449b90ebde0125388b7..9501b4ca02751b4b1223650fd85945691016f36d

Verify this pull request

This pull request is already covered by existing tests, such as (please describe tests).

Committer checklist

Has a corresponding JIRA in PR title & commit
Commit message is descriptive of the change
CI is green
Necessary doc changes done or have another open PR
For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.

nsivabalan

If you don't mind, can you summarize, what are the diff InputSplits we have (after all the refactoring and fixes) and what each split is used for.

nsivabalan · 2022-02-04T23:20:55Z

hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/realtime/HoodieRealtimePath.java

-    bs.setBelongsToIncrementalQuery(belongsToIncrementalPath);
+  public HoodieRealtimeFileSplit buildSplit(Path file, long start, long length, String[] hosts) {
+    HoodieRealtimeFileSplit bs = new HoodieRealtimeFileSplit(file, start, length, hosts);
+    bs.setBelongsToIncrementalQuery(belongsToIncrementalQuery);


why pathWithBootstrapFileStatus is not considered in this buildSplit? or don't we need to have another builder ?

alexeykudinkin · 2022-02-11T20:41:34Z

@hudi-bot run azure

Moved `BaseFileWithLogsSplit` under `realtime` package

…lit`

…mutable

…ic is spilled into COW impl

…ting from `HoodieCommitMetadata`

yihua

Overall LGTM. @nsivabalan could you take another look?

yihua · 2022-02-14T18:54:20Z

hudi-common/src/main/java/org/apache/hudi/common/model/HoodieCommitMetadata.java

        Path fullPath = relativeFilePath != null ? FSUtils.getPartitionPath(basePath, relativeFilePath) : null;
        if (fullPath != null) {
-          FileStatus fileStatus = new FileStatus(stat.getFileSizeInBytes(), false, 0, 0,
+          long blockSize = FSUtils.getFs(fullPath.toString(), hadoopConf).getDefaultBlockSize(fullPath);


Should the block size be extracted outside the loop based on the file system of base path, since it's a file-system config?

It's based on path, so don't want to bake in the assumptions that all paths are homogeneous (might be pointing at different HDFS nodes, w/ different block settings)

block size is a per file thing in HDFS. but rarely ever set differently. For cloud storage its all 0s. We need to see if getDefaultBlockSize() is an RPC call. if so, then need to avoid this, even if it assumes things.

Let me check. The reason why i'm fixing this is b/c w/ block-size 0 Hive will slice the file in 1 byte blocks.

@vinothchandar there's no RPC in the path (config stored w/in DFSClient)

@alexeykudinkin I think I confused it with getDefaultBlockSize() call which is based on file system (see below), not file status, and it only fetches from the config. This is fine.

Even if the block size is 0, should Hive still honor the actual block size of the file? At least that's my understanding for Trino Hive connector.

/** * Return the number of bytes that large input files should be optimally * be split into to minimize i/o time. * @deprecated use {@link #getDefaultBlockSize(Path)} instead */ @Deprecated public long getDefaultBlockSize() { // default to 32MB: large enough to minimize the impact of seeks return getConf().getLong("fs.local.block.size", 32 * 1024 * 1024); }

hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/HoodieCopyOnWriteTableInputFormat.java

yihua · 2022-02-14T19:06:20Z

hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/HoodieTableInputFormat.java

+  @Override
+  protected boolean isSplitable(FileSystem fs, Path filename) {
+    return super.isSplitable(fs, filename);
+  }
+
+  @Override
+  protected FileSplit makeSplit(Path file, long start, long length, String[] hosts) {
+    return super.makeSplit(file, start, length, hosts);
+  }
+
+  @Override
+  protected FileSplit makeSplit(Path file, long start, long length, String[] hosts, String[] inMemoryHosts) {
+    return super.makeSplit(file, start, length, hosts, inMemoryHosts);
+  }
+
+  @Override
+  protected FileStatus[] listStatus(JobConf job) throws IOException {
+    return super.listStatus(job);
+  }


these seem not necessary?

It's subtle: Parquet based hierarchy isn't inheriting from this one so they can't access these methods (which are protected in FileInputFormat). Having them re-declared here they now can b/c this class is w/in the same package as those classes trying to access them.

Got it. I see your point now. It would be good to add the reasoning as javadocs of this class.

yihua · 2022-02-14T19:08:37Z

hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/RealtimeFileStatus.java

 * Hence, this class tracks a log/base file status
 * in Path.
 */
 public class RealtimeFileStatus extends FileStatus {


Do you want to rename this and other classes with Realtime word as well?

Kept the naming as Realtime as an homage to previous state of things. Don't really want to rename pretty much all of the classes in Hive hierarchy (and essentially dock their history)

realtime is a remnant from the design motivations for MOR snapshot queries. We might even revive it after the caching story

vinothchandar

Took a first pass. Can you confirm a few things

vinothchandar · 2022-02-15T03:46:19Z

hudi-common/src/main/java/org/apache/hudi/common/model/HoodieCommitMetadata.java

        Path fullPath = relativeFilePath != null ? FSUtils.getPartitionPath(basePath, relativeFilePath) : null;
        if (fullPath != null) {
-          FileStatus fileStatus = new FileStatus(stat.getFileSizeInBytes(), false, 0, 0,
+          long blockSize = FSUtils.getFs(fullPath.toString(), hadoopConf).getDefaultBlockSize(fullPath);


block size is a per file thing in HDFS. but rarely ever set differently. For cloud storage its all 0s. We need to see if getDefaultBlockSize() is an RPC call. if so, then need to avoid this, even if it assumes things.

vinothchandar · 2022-02-15T03:47:52Z

hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/HoodieCopyOnWriteTableInputFormat.java

-  }
-
-  @Override
-  public final void setConf(Configuration conf) {


I am blanking now. but worth tracing why we needed these i.e any other Hive gotchas around combine input format etc?

They still stay, just moved into HoodieTableInputFormat

vinothchandar · 2022-02-16T15:39:33Z

hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/RealtimeFileStatus.java

 * Hence, this class tracks a log/base file status
 * in Path.
 */
 public class RealtimeFileStatus extends FileStatus {


realtime is a remnant from the design motivations for MOR snapshot queries. We might even revive it after the caching story

vinothchandar · 2022-02-16T15:46:16Z

...doop-mr/src/main/java/org/apache/hudi/hadoop/realtime/HoodieMergeOnReadTableInputFormat.java

+    return createRealtimeFileSplit(path, start, length, hosts);
+  }
+
+  private static boolean containsIncrementalQuerySplits(List<FileSplit> fileSplits) {


are these methods just moved over/consolidated?

Mostly. createRealtimeFileStatusUnchecked also refactored to accommodate for RealtimeFileStatus changes

…(true for Hive, false for Spark)

hudi-bot · 2022-02-16T22:59:25Z

CI report:

75000d7 UNKNOWN
61705e3 UNKNOWN
cb19f84 Azure: SUCCESS

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run azure re-run the last Azure build

…pache#4743)

alexeykudinkin changed the title ~~[HUDI-3280] Cleaning up Hive-related hierarchies after refactoring~~ [HUDI-3280][Stacked on 4669] Cleaning up Hive-related hierarchies after refactoring Feb 3, 2022

alexeykudinkin force-pushed the ak/rpath-ref-8 branch from 6104acf to 75000d7 Compare February 3, 2022 22:33

yihua self-assigned this Feb 3, 2022

nsivabalan reviewed Feb 4, 2022

View reviewed changes

alexeykudinkin mentioned this pull request Feb 7, 2022

[HUDI-3206] Unify Hive's MOR implementations to avoid duplication #4559

Merged

5 tasks

alexeykudinkin force-pushed the ak/rpath-ref-8 branch 3 times, most recently from 69659d7 to fdc9cd0 Compare February 11, 2022 02:13

alexeykudinkin changed the title ~~[HUDI-3280][Stacked on 4669] Cleaning up Hive-related hierarchies after refactoring~~ [HUDI-3280] Cleaning up Hive-related hierarchies after refactoring Feb 11, 2022

alexeykudinkin force-pushed the ak/rpath-ref-8 branch from 61705e3 to 62ea4a5 Compare February 11, 2022 04:09

Alexey Kudinkin added 19 commits February 11, 2022 15:02

Cleaned up conditional that should not be hit;

2014048

Made BaseFileWithLogsSplit implement RealtimeSplit;

17a72c8

Moved `BaseFileWithLogsSplit` under `realtime` package

Tidying up HoodieRealtimeFileSplit

258bd27

PathWithLogFilePath > RealtimePath;

a13d3ac

Refactored RealtimePath to produce HoodieRealtimeFileSplit

cf87b0e

Prefix w/ "Hoodie"

4770a89

Fixed HoodieRealtimeFileSplit to bear belongsToIncrementalQuery flag

f0a86b5

Tidying up

6ae1a71

Removed BaseFileWithLogsSplit

04c024a

Cleaned up dead-code

c2d8b04

Fixed RealtimeBootstrapBaseFileSplit

5fda93e

RealtimeBootstrapBaseFileSplit > `HoodieRealtimeBootstrapBaseFileSp…

ba49892

…lit`

Added comments

ce26a38

Propagate missing Virtual Key info into RealtimeSplits

ffe1451

Cleaned up RealtimeFileStatus, HoodieRealtimePath to be mostly im…

20b8380

…mutable

XXX

f07469c

Properly set up HoodieVirtualKeyInfo

3d78efd

Separate COW/MOR InputFormats to make sure that no MOR-specific log…

3e83820

…ic is spilled into COW impl

Tidying up

ee70c2f

Alexey Kudinkin added 4 commits February 11, 2022 15:02

Made sure Virtual Key info is fetched once for individual table

f2dde38

Tidying up

4a68cff

Reverting inadvertent change

d357ea2

Fixed incorrect blockSize being set in FileStatus object when lis…

a124630

…ting from `HoodieCommitMetadata`

alexeykudinkin force-pushed the ak/rpath-ref-8 branch from 8187ca2 to a124630 Compare February 11, 2022 23:02

Alexey Kudinkin added 2 commits February 11, 2022 16:47

Cleaned up deprecated code

5d842ac

Extracted base HoodieTableInputFormat

d8beebf

yihua approved these changes Feb 14, 2022

View reviewed changes

Alexey Kudinkin added 4 commits February 14, 2022 12:14

Misisng license

b3a5264

Added missing validation for whether instant is present in the timeline

d1f40f1

Killing dead-code

3ebaec5

Fixed tests

91262ed

vinothchandar reviewed Feb 16, 2022

View reviewed changes

Make BaseHoodieTableFileIndex only validate instants when required …

cb19f84

…(true for Hive, false for Spark)

yihua merged commit aaddaf5 into apache:master Feb 16, 2022

vingov pushed a commit to vingov/hudi that referenced this pull request Apr 3, 2022

[HUDI-3280] Cleaning up Hive-related hierarchies after refactoring (a…

5261b56

…pache#4743)

[HUDI-3280] Cleaning up Hive-related hierarchies after refactoring #4743

[HUDI-3280] Cleaning up Hive-related hierarchies after refactoring #4743

Uh oh!

Conversation

alexeykudinkin commented Feb 3, 2022 • edited by nsivabalan Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Tips

What is the purpose of the pull request

Brief change log

Verify this pull request

Committer checklist

Uh oh!

nsivabalan left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alexeykudinkin commented Feb 11, 2022

Uh oh!

yihua left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vinothchandar left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hudi-bot commented Feb 16, 2022

CI report:

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

alexeykudinkin commented Feb 3, 2022 •

edited by nsivabalan

Loading