Skip to content

Conversation

@alexeykudinkin
Copy link
Contributor

@alexeykudinkin alexeykudinkin commented Feb 3, 2022

Tips

What is the purpose of the pull request

This PR cleans up Hive-related hierarchies after refactoring

Brief change log

  • Merged HoodieRealtimeFileSplit w/ BaseWithLogFilesSplit
  • Fixed RealtimeBootstrapBaseFileSplit
  • Moved methods form HoodieInputFormatUtils closer to where they're used
  • Cleaned up unused methods

List of commits in this patch:
https://github.com/apache/hudi/pull/4743/files/7d12ce874c3bfb9861b00449b90ebde0125388b7..9501b4ca02751b4b1223650fd85945691016f36d

Verify this pull request

This pull request is already covered by existing tests, such as (please describe tests).

Committer checklist

  • Has a corresponding JIRA in PR title & commit

  • Commit message is descriptive of the change

  • CI is green

  • Necessary doc changes done or have another open PR

  • For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.

@alexeykudinkin alexeykudinkin changed the title [HUDI-3280] Cleaning up Hive-related hierarchies after refactoring [HUDI-3280][Stacked on 4669] Cleaning up Hive-related hierarchies after refactoring Feb 3, 2022
@yihua yihua self-assigned this Feb 3, 2022
Copy link
Contributor

@nsivabalan nsivabalan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you don't mind, can you summarize, what are the diff InputSplits we have (after all the refactoring and fixes) and what each split is used for.

bs.setBelongsToIncrementalQuery(belongsToIncrementalPath);
public HoodieRealtimeFileSplit buildSplit(Path file, long start, long length, String[] hosts) {
HoodieRealtimeFileSplit bs = new HoodieRealtimeFileSplit(file, start, length, hosts);
bs.setBelongsToIncrementalQuery(belongsToIncrementalQuery);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why pathWithBootstrapFileStatus is not considered in this buildSplit? or don't we need to have another builder ?

@alexeykudinkin alexeykudinkin force-pushed the ak/rpath-ref-8 branch 3 times, most recently from 69659d7 to fdc9cd0 Compare February 11, 2022 02:13
@alexeykudinkin alexeykudinkin changed the title [HUDI-3280][Stacked on 4669] Cleaning up Hive-related hierarchies after refactoring [HUDI-3280] Cleaning up Hive-related hierarchies after refactoring Feb 11, 2022
@alexeykudinkin
Copy link
Contributor Author

@hudi-bot run azure

Copy link
Contributor

@yihua yihua left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM. @nsivabalan could you take another look?

Path fullPath = relativeFilePath != null ? FSUtils.getPartitionPath(basePath, relativeFilePath) : null;
if (fullPath != null) {
FileStatus fileStatus = new FileStatus(stat.getFileSizeInBytes(), false, 0, 0,
long blockSize = FSUtils.getFs(fullPath.toString(), hadoopConf).getDefaultBlockSize(fullPath);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the block size be extracted outside the loop based on the file system of base path, since it's a file-system config?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's based on path, so don't want to bake in the assumptions that all paths are homogeneous (might be pointing at different HDFS nodes, w/ different block settings)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

block size is a per file thing in HDFS. but rarely ever set differently. For cloud storage its all 0s. We need to see if getDefaultBlockSize() is an RPC call. if so, then need to avoid this, even if it assumes things.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me check. The reason why i'm fixing this is b/c w/ block-size 0 Hive will slice the file in 1 byte blocks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vinothchandar there's no RPC in the path (config stored w/in DFSClient)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alexeykudinkin I think I confused it with getDefaultBlockSize() call which is based on file system (see below), not file status, and it only fetches from the config. This is fine.

Even if the block size is 0, should Hive still honor the actual block size of the file? At least that's my understanding for Trino Hive connector.

/**
   * Return the number of bytes that large input files should be optimally
   * be split into to minimize i/o time.
   * @deprecated use {@link #getDefaultBlockSize(Path)} instead
   */
  @Deprecated
  public long getDefaultBlockSize() {
    // default to 32MB: large enough to minimize the impact of seeks
    return getConf().getLong("fs.local.block.size", 32 * 1024 * 1024);
  }

Comment on lines +35 to +53
@Override
protected boolean isSplitable(FileSystem fs, Path filename) {
return super.isSplitable(fs, filename);
}

@Override
protected FileSplit makeSplit(Path file, long start, long length, String[] hosts) {
return super.makeSplit(file, start, length, hosts);
}

@Override
protected FileSplit makeSplit(Path file, long start, long length, String[] hosts, String[] inMemoryHosts) {
return super.makeSplit(file, start, length, hosts, inMemoryHosts);
}

@Override
protected FileStatus[] listStatus(JobConf job) throws IOException {
return super.listStatus(job);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these seem not necessary?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's subtle: Parquet based hierarchy isn't inheriting from this one so they can't access these methods (which are protected in FileInputFormat). Having them re-declared here they now can b/c this class is w/in the same package as those classes trying to access them.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. I see your point now. It would be good to add the reasoning as javadocs of this class.

* Hence, this class tracks a log/base file status
* in Path.
*/
public class RealtimeFileStatus extends FileStatus {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you want to rename this and other classes with Realtime word as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kept the naming as Realtime as an homage to previous state of things. Don't really want to rename pretty much all of the classes in Hive hierarchy (and essentially dock their history)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

realtime is a remnant from the design motivations for MOR snapshot queries. We might even revive it after the caching story

Copy link
Member

@vinothchandar vinothchandar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Took a first pass. Can you confirm a few things

Path fullPath = relativeFilePath != null ? FSUtils.getPartitionPath(basePath, relativeFilePath) : null;
if (fullPath != null) {
FileStatus fileStatus = new FileStatus(stat.getFileSizeInBytes(), false, 0, 0,
long blockSize = FSUtils.getFs(fullPath.toString(), hadoopConf).getDefaultBlockSize(fullPath);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

block size is a per file thing in HDFS. but rarely ever set differently. For cloud storage its all 0s. We need to see if getDefaultBlockSize() is an RPC call. if so, then need to avoid this, even if it assumes things.

}

@Override
public final void setConf(Configuration conf) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am blanking now. but worth tracing why we needed these i.e any other Hive gotchas around combine input format etc?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They still stay, just moved into HoodieTableInputFormat

* Hence, this class tracks a log/base file status
* in Path.
*/
public class RealtimeFileStatus extends FileStatus {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

realtime is a remnant from the design motivations for MOR snapshot queries. We might even revive it after the caching story

return createRealtimeFileSplit(path, start, length, hosts);
}

private static boolean containsIncrementalQuerySplits(List<FileSplit> fileSplits) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are these methods just moved over/consolidated?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly. createRealtimeFileStatusUnchecked also refactored to accommodate for RealtimeFileStatus changes

@hudi-bot
Copy link
Collaborator

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

@yihua yihua merged commit aaddaf5 into apache:master Feb 16, 2022
vingov pushed a commit to vingov/hudi that referenced this pull request Apr 3, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants