-
Notifications
You must be signed in to change notification settings - Fork 2.5k
[HUDI-3280] Cleaning up Hive-related hierarchies after refactoring #4743
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
6104acf to
75000d7
Compare
nsivabalan
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you don't mind, can you summarize, what are the diff InputSplits we have (after all the refactoring and fixes) and what each split is used for.
| bs.setBelongsToIncrementalQuery(belongsToIncrementalPath); | ||
| public HoodieRealtimeFileSplit buildSplit(Path file, long start, long length, String[] hosts) { | ||
| HoodieRealtimeFileSplit bs = new HoodieRealtimeFileSplit(file, start, length, hosts); | ||
| bs.setBelongsToIncrementalQuery(belongsToIncrementalQuery); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why pathWithBootstrapFileStatus is not considered in this buildSplit? or don't we need to have another builder ?
69659d7 to
fdc9cd0
Compare
61705e3 to
62ea4a5
Compare
|
@hudi-bot run azure |
Moved `BaseFileWithLogsSplit` under `realtime` package
…ic is spilled into COW impl
8187ca2 to
a124630
Compare
yihua
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall LGTM. @nsivabalan could you take another look?
| Path fullPath = relativeFilePath != null ? FSUtils.getPartitionPath(basePath, relativeFilePath) : null; | ||
| if (fullPath != null) { | ||
| FileStatus fileStatus = new FileStatus(stat.getFileSizeInBytes(), false, 0, 0, | ||
| long blockSize = FSUtils.getFs(fullPath.toString(), hadoopConf).getDefaultBlockSize(fullPath); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should the block size be extracted outside the loop based on the file system of base path, since it's a file-system config?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's based on path, so don't want to bake in the assumptions that all paths are homogeneous (might be pointing at different HDFS nodes, w/ different block settings)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
block size is a per file thing in HDFS. but rarely ever set differently. For cloud storage its all 0s. We need to see if getDefaultBlockSize() is an RPC call. if so, then need to avoid this, even if it assumes things.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me check. The reason why i'm fixing this is b/c w/ block-size 0 Hive will slice the file in 1 byte blocks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vinothchandar there's no RPC in the path (config stored w/in DFSClient)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@alexeykudinkin I think I confused it with getDefaultBlockSize() call which is based on file system (see below), not file status, and it only fetches from the config. This is fine.
Even if the block size is 0, should Hive still honor the actual block size of the file? At least that's my understanding for Trino Hive connector.
/**
* Return the number of bytes that large input files should be optimally
* be split into to minimize i/o time.
* @deprecated use {@link #getDefaultBlockSize(Path)} instead
*/
@Deprecated
public long getDefaultBlockSize() {
// default to 32MB: large enough to minimize the impact of seeks
return getConf().getLong("fs.local.block.size", 32 * 1024 * 1024);
}
hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/HoodieCopyOnWriteTableInputFormat.java
Show resolved
Hide resolved
| @Override | ||
| protected boolean isSplitable(FileSystem fs, Path filename) { | ||
| return super.isSplitable(fs, filename); | ||
| } | ||
|
|
||
| @Override | ||
| protected FileSplit makeSplit(Path file, long start, long length, String[] hosts) { | ||
| return super.makeSplit(file, start, length, hosts); | ||
| } | ||
|
|
||
| @Override | ||
| protected FileSplit makeSplit(Path file, long start, long length, String[] hosts, String[] inMemoryHosts) { | ||
| return super.makeSplit(file, start, length, hosts, inMemoryHosts); | ||
| } | ||
|
|
||
| @Override | ||
| protected FileStatus[] listStatus(JobConf job) throws IOException { | ||
| return super.listStatus(job); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these seem not necessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's subtle: Parquet based hierarchy isn't inheriting from this one so they can't access these methods (which are protected in FileInputFormat). Having them re-declared here they now can b/c this class is w/in the same package as those classes trying to access them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it. I see your point now. It would be good to add the reasoning as javadocs of this class.
| * Hence, this class tracks a log/base file status | ||
| * in Path. | ||
| */ | ||
| public class RealtimeFileStatus extends FileStatus { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you want to rename this and other classes with Realtime word as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Kept the naming as Realtime as an homage to previous state of things. Don't really want to rename pretty much all of the classes in Hive hierarchy (and essentially dock their history)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
realtime is a remnant from the design motivations for MOR snapshot queries. We might even revive it after the caching story
vinothchandar
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Took a first pass. Can you confirm a few things
| Path fullPath = relativeFilePath != null ? FSUtils.getPartitionPath(basePath, relativeFilePath) : null; | ||
| if (fullPath != null) { | ||
| FileStatus fileStatus = new FileStatus(stat.getFileSizeInBytes(), false, 0, 0, | ||
| long blockSize = FSUtils.getFs(fullPath.toString(), hadoopConf).getDefaultBlockSize(fullPath); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
block size is a per file thing in HDFS. but rarely ever set differently. For cloud storage its all 0s. We need to see if getDefaultBlockSize() is an RPC call. if so, then need to avoid this, even if it assumes things.
| } | ||
|
|
||
| @Override | ||
| public final void setConf(Configuration conf) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am blanking now. but worth tracing why we needed these i.e any other Hive gotchas around combine input format etc?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They still stay, just moved into HoodieTableInputFormat
| * Hence, this class tracks a log/base file status | ||
| * in Path. | ||
| */ | ||
| public class RealtimeFileStatus extends FileStatus { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
realtime is a remnant from the design motivations for MOR snapshot queries. We might even revive it after the caching story
| return createRealtimeFileSplit(path, start, length, hosts); | ||
| } | ||
|
|
||
| private static boolean containsIncrementalQuerySplits(List<FileSplit> fileSplits) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are these methods just moved over/consolidated?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly. createRealtimeFileStatusUnchecked also refactored to accommodate for RealtimeFileStatus changes
…(true for Hive, false for Spark)
Tips
What is the purpose of the pull request
This PR cleans up Hive-related hierarchies after refactoring
Brief change log
HoodieRealtimeFileSplitw/BaseWithLogFilesSplitRealtimeBootstrapBaseFileSplitHoodieInputFormatUtilscloser to where they're usedList of commits in this patch:
https://github.com/apache/hudi/pull/4743/files/7d12ce874c3bfb9861b00449b90ebde0125388b7..9501b4ca02751b4b1223650fd85945691016f36d
Verify this pull request
This pull request is already covered by existing tests, such as (please describe tests).
Committer checklist
Has a corresponding JIRA in PR title & commit
Commit message is descriptive of the change
CI is green
Necessary doc changes done or have another open PR
For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.