-
Notifications
You must be signed in to change notification settings - Fork 2.5k
[HUDI-3206] Unify Hive's MOR implementations to avoid duplication #4559
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
a70ea22 to
7798caf
Compare
060a816 to
b557e6b
Compare
0aa3cea to
e68070c
Compare
e68070c to
8a5f12f
Compare
|
hi @alexeykudinkin , a high level question. Is this approach using the hudi metadata table to fetch latest file? That would be great if we can get rid of file listing while unifying the code path for reader. |
|
@garyli1019 Hive is using MT as of #4531 |
c18bd33 to
6d402f1
Compare
…e` (those are not file-format specific)
… MOR functionality
6d402f1 to
b2598cb
Compare
|
@hudi-bot run azure |
8849116 to
3a9315b
Compare
3a9315b to
0a7342e
Compare
| val (uuid, idx) = uuids.zipWithIndex.find { case (uuid, _) => fileName.contains(uuid) }.get | ||
| fileName.replace(uuid, idx.toString) | ||
| val uuid = uuids.find(uuid => fileName.contains(uuid)).get | ||
| fileName.replace(uuid, "xxx") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is to fix test flakiness
hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/HoodieFileInputFormatBase.java
Show resolved
Hide resolved
hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/HoodieFileInputFormatBase.java
Show resolved
Hide resolved
| protected FileSplit makeSplit(Path file, long start, long length, | ||
| String[] hosts, String[] inMemoryHosts) { | ||
| FileSplit split = new FileSplit(file, start, length, hosts, inMemoryHosts); | ||
| if (file instanceof PathWithBootstrapFileStatus) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you help me understand why don't I see diff FileStatuses here like RealtimeFileStatus for eg, but just PathWithBootstrapFileStatus?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My understanding is that these methods are moved from HoodieParquetInputFormat. @alexeykudinkin is the logic parquet agnostic and going to be reused by other file formats like HFile? I think if that's not the case, better to keep them in HoodieParquetInputFormat, since this base class is going to be general for different formats.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is used only for COW
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@alexeykudinkin what about my above question regarding whether the logic should reside in HoodieParquetInputFormat or this class?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yihua GH comments are weird, when i was responding to @nsivabalan there were no comments of yours, and then it sandwiched your comment in b/w of those.
My understanding is that these methods are moved from HoodieParquetInputFormat. @alexeykudinkin is the logic parquet agnostic and going to be reused by other file formats like HFile? I think if that's not the case, better to keep them in HoodieParquetInputFormat, since this base class is going to be general for different formats.
Your understanding is correct, all of this logic is file-format agnostic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sounds good!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@alexeykudinkin no worries. That sg.
| return timeline; | ||
| } | ||
| // NOTE: We're only using {@code HoodieHFileInputFormat} to compose {@code RecordReader} | ||
| private final HoodieHFileInputFormat hFileInputFormat = new HoodieHFileInputFormat(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are quite a few things different in HFileRealtimeInputFormat compared to ParquetRealtimeInputFormat. for eg, getSplits just does HoodieRealtimeInputFormatUtils.getRealtimeSplits(job, fileSplits) in case of HFile, where as in case of Parquet, we do both realtimeSplits and incrementalSplits.
and then
@Override
protected boolean includeLogFilesForSnapshotView() {
return false;
}
@Override
protected boolean isSplitable(FileSystem fs, Path filename) {
// This file isn't splittable.
return false;
}
these are diff in parquet real time IF.
filterInstantsTimeline(HoodieDefaultTimeline timeline) is empty in case of HFileRTIF. after unification, I don't see these are overridden here in HoodieHFileRealtimeInputFormat.
can you help me understand please.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In principle, COW/MOR handling should be invariant of file-format used underneath.
Good catch w/ isSplitable
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I get it. but in master, we disabled log files for HFile realtime IF and now we are enabling is it?
excerpt from HoodieHFileInputFormat in master.
@Override
protected boolean includeLogFilesForSnapshotView() {
return true;
}
hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/HoodieFileInputFormatBase.java
Show resolved
Hide resolved
|
|
||
| protected abstract boolean includeLogFilesForSnapshotView(); | ||
| @Override | ||
| protected FileSplit makeSplit(Path file, long start, long length, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: put two makeSplit() methods together?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will clean this up in the final PR #4743
| protected FileSplit makeSplit(Path file, long start, long length, | ||
| String[] hosts, String[] inMemoryHosts) { | ||
| FileSplit split = new FileSplit(file, start, length, hosts, inMemoryHosts); | ||
| if (file instanceof PathWithBootstrapFileStatus) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My understanding is that these methods are moved from HoodieParquetInputFormat. @alexeykudinkin is the logic parquet agnostic and going to be reused by other file formats like HFile? I think if that's not the case, better to keep them in HoodieParquetInputFormat, since this base class is going to be general for different formats.
| return targetFiles; | ||
| } | ||
|
|
||
| private void validate(List<FileStatus> targetFiles, List<FileStatus> legacyFileStatuses) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are the methods below merely moved? Looks like the order changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, GJF rule is applied
(non-static methods)
public
protected
private
(static methods)
public
protected
private
| {"c1_maxValue":486,"c1_minValue":59,"c1_num_nulls":0,"c2_maxValue":" 79sdc","c2_minValue":" 111sdc","c2_num_nulls":0,"c3_maxValue":771.590,"c3_minValue":82.111,"c3_num_nulls":0,"c5_maxValue":50,"c5_minValue":7,"c5_num_nulls":0,"c6_maxValue":"2020-11-21","c6_minValue":"2020-01-22","c6_num_nulls":0,"c7_maxValue":"5g==","c7_minValue":"Ow==","c7_num_nulls":0,"c8_maxValue":9,"c8_minValue":9,"c8_num_nulls":0,"file":"part-00002-1-c000.snappy.parquet"} | ||
| {"c1_maxValue":959,"c1_minValue":0,"c1_num_nulls":0,"c2_maxValue":" 959sdc","c2_minValue":" 0sdc","c2_num_nulls":0,"c3_maxValue":916.697,"c3_minValue":19.000,"c3_num_nulls":0,"c5_maxValue":97,"c5_minValue":1,"c5_num_nulls":0,"c6_maxValue":"2020-11-22","c6_minValue":"2020-01-01","c6_num_nulls":0,"c7_maxValue":"yA==","c7_minValue":"AA==","c7_num_nulls":0,"c8_maxValue":9,"c8_minValue":9,"c8_num_nulls":0,"file":"part-00003-0-c000.snappy.parquet"} | ||
| {"c1_maxValue":272,"c1_minValue":8,"c1_num_nulls":0,"c2_maxValue":" 8sdc","c2_minValue":" 129sdc","c2_num_nulls":0,"c3_maxValue":979.272,"c3_minValue":430.129,"c3_num_nulls":0,"c5_maxValue":28,"c5_minValue":2,"c5_num_nulls":0,"c6_maxValue":"2020-11-20","c6_minValue":"2020-03-23","c6_num_nulls":0,"c7_maxValue":"8A==","c7_minValue":"Ag==","c7_num_nulls":0,"c8_maxValue":9,"c8_minValue":9,"c8_num_nulls":0,"file":"part-00003-1-c000.snappy.parquet"} | ||
| {"c1_maxValue":272,"c1_minValue":8,"c1_num_nulls":0,"c2_maxValue":" 8sdc","c2_minValue":" 129sdc","c2_num_nulls":0,"c3_maxValue":979.272,"c3_minValue":430.129,"c3_num_nulls":0,"c5_maxValue":28,"c5_minValue":2,"c5_num_nulls":0,"c6_maxValue":"2020-11-20","c6_minValue":"2020-03-23","c6_num_nulls":0,"c7_maxValue":"8A==","c7_minValue":"Ag==","c7_num_nulls":0,"c8_maxValue":9,"c8_minValue":9,"c8_num_nulls":0,"file":"part-00003-xxx-c000.snappy.parquet"} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are these two files changed? To fix test flakyness?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct
|
@hudi-bot run azure |
yihua
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Are there already unit/functional tests around HFile input format?
nsivabalan
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. check if we have tests for HFileInputFormat (regular and realtime). If not, can you file a tracking jira and take it up later. lets proceed w/ landing all stacked PRs.
|
@nsivabalan we do have tests for all input formats (Parquet, HFile, Orc) |
…ache#4559) Unify Hive's MOR implementations to avoid duplication to avoid duplication across implementations for different file-formats (Parquet, HFile, etc) - Extracted HoodieRealtimeFileInputFormatBase (extending COW HoodieFileInputFormatBase base) - Rebased Parquet, HFile implementations onto HoodieRealtimeFileInputFormatBase - Tidying up
…ache#4559) Unify Hive's MOR implementations to avoid duplication to avoid duplication across implementations for different file-formats (Parquet, HFile, etc) - Extracted HoodieRealtimeFileInputFormatBase (extending COW HoodieFileInputFormatBase base) - Rebased Parquet, HFile implementations onto HoodieRealtimeFileInputFormatBase - Tidying up
Tips
What is the purpose of the pull request
Unify Hive's MOR implementations to avoid duplication to avoid duplication across implementations for different file-formats (Parquet, HFile, etc)
Brief change log
HoodieRealtimeFileInputFormatBase(extending COWHoodieFileInputFormatBasebase)HoodieRealtimeFileInputFormatBaseVerify this pull request
This pull request is already covered by existing tests, such as (please describe tests).
Committer checklist
Has a corresponding JIRA in PR title & commit
Commit message is descriptive of the change
CI is green
Necessary doc changes done or have another open PR
For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.