Skip to content

Conversation

@alexeykudinkin
Copy link
Contributor

Tips

What is the purpose of the pull request

Unify Hive's MOR implementations to avoid duplication to avoid duplication across implementations for different file-formats (Parquet, HFile, etc)

Brief change log

  • Extracted HoodieRealtimeFileInputFormatBase (extending COW HoodieFileInputFormatBase base)
  • Rebased Parquet, HFile implementations onto HoodieRealtimeFileInputFormatBase
  • Tidying up

Verify this pull request

This pull request is already covered by existing tests, such as (please describe tests).

Committer checklist

  • Has a corresponding JIRA in PR title & commit

  • Commit message is descriptive of the change

  • CI is green

  • Necessary doc changes done or have another open PR

  • For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.

@alexeykudinkin alexeykudinkin force-pushed the ak/rpath-ref-5 branch 7 times, most recently from a70ea22 to 7798caf Compare January 14, 2022 23:16
@alexeykudinkin alexeykudinkin force-pushed the ak/rpath-ref-5 branch 2 times, most recently from 060a816 to b557e6b Compare January 19, 2022 21:35
@alexeykudinkin alexeykudinkin force-pushed the ak/rpath-ref-5 branch 2 times, most recently from 0aa3cea to e68070c Compare January 21, 2022 20:22
@yihua yihua self-assigned this Jan 21, 2022
@garyli1019
Copy link
Member

hi @alexeykudinkin , a high level question. Is this approach using the hudi metadata table to fetch latest file? That would be great if we can get rid of file listing while unifying the code path for reader.

@alexeykudinkin
Copy link
Contributor Author

@garyli1019 Hive is using MT as of #4531

@alexeykudinkin alexeykudinkin changed the title [HUDI-3206][Stacked on 4556] Unify Hive's MOR implementations to avoid duplication [HUDI-3206] Unify Hive's MOR implementations to avoid duplication Feb 3, 2022
@alexeykudinkin
Copy link
Contributor Author

@hudi-bot run azure

val (uuid, idx) = uuids.zipWithIndex.find { case (uuid, _) => fileName.contains(uuid) }.get
fileName.replace(uuid, idx.toString)
val uuid = uuids.find(uuid => fileName.contains(uuid)).get
fileName.replace(uuid, "xxx")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is to fix test flakiness

protected FileSplit makeSplit(Path file, long start, long length,
String[] hosts, String[] inMemoryHosts) {
FileSplit split = new FileSplit(file, start, length, hosts, inMemoryHosts);
if (file instanceof PathWithBootstrapFileStatus) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you help me understand why don't I see diff FileStatuses here like RealtimeFileStatus for eg, but just PathWithBootstrapFileStatus?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding is that these methods are moved from HoodieParquetInputFormat. @alexeykudinkin is the logic parquet agnostic and going to be reused by other file formats like HFile? I think if that's not the case, better to keep them in HoodieParquetInputFormat, since this base class is going to be general for different formats.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is used only for COW

Copy link
Contributor

@yihua yihua Feb 5, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alexeykudinkin what about my above question regarding whether the logic should reside in HoodieParquetInputFormat or this class?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yihua GH comments are weird, when i was responding to @nsivabalan there were no comments of yours, and then it sandwiched your comment in b/w of those.

My understanding is that these methods are moved from HoodieParquetInputFormat. @alexeykudinkin is the logic parquet agnostic and going to be reused by other file formats like HFile? I think if that's not the case, better to keep them in HoodieParquetInputFormat, since this base class is going to be general for different formats.

Your understanding is correct, all of this logic is file-format agnostic.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds good!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alexeykudinkin no worries. That sg.

return timeline;
}
// NOTE: We're only using {@code HoodieHFileInputFormat} to compose {@code RecordReader}
private final HoodieHFileInputFormat hFileInputFormat = new HoodieHFileInputFormat();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are quite a few things different in HFileRealtimeInputFormat compared to ParquetRealtimeInputFormat. for eg, getSplits just does HoodieRealtimeInputFormatUtils.getRealtimeSplits(job, fileSplits) in case of HFile, where as in case of Parquet, we do both realtimeSplits and incrementalSplits.
and then

  @Override
  protected boolean includeLogFilesForSnapshotView() {
    return false;
  }
  @Override
  protected boolean isSplitable(FileSystem fs, Path filename) {
    // This file isn't splittable.
    return false;
  }

these are diff in parquet real time IF.

filterInstantsTimeline(HoodieDefaultTimeline timeline) is empty in case of HFileRTIF. after unification, I don't see these are overridden here in HoodieHFileRealtimeInputFormat.

can you help me understand please.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In principle, COW/MOR handling should be invariant of file-format used underneath.

Good catch w/ isSplitable

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I get it. but in master, we disabled log files for HFile realtime IF and now we are enabling is it?
excerpt from HoodieHFileInputFormat in master.

 @Override
  protected boolean includeLogFilesForSnapshotView() {
    return true;
  }


protected abstract boolean includeLogFilesForSnapshotView();
@Override
protected FileSplit makeSplit(Path file, long start, long length,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: put two makeSplit() methods together?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will clean this up in the final PR #4743

protected FileSplit makeSplit(Path file, long start, long length,
String[] hosts, String[] inMemoryHosts) {
FileSplit split = new FileSplit(file, start, length, hosts, inMemoryHosts);
if (file instanceof PathWithBootstrapFileStatus) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding is that these methods are moved from HoodieParquetInputFormat. @alexeykudinkin is the logic parquet agnostic and going to be reused by other file formats like HFile? I think if that's not the case, better to keep them in HoodieParquetInputFormat, since this base class is going to be general for different formats.

return targetFiles;
}

private void validate(List<FileStatus> targetFiles, List<FileStatus> legacyFileStatuses) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are the methods below merely moved? Looks like the order changes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, GJF rule is applied

(non-static methods)
public
protected
private

(static methods)
public
protected
private

{"c1_maxValue":486,"c1_minValue":59,"c1_num_nulls":0,"c2_maxValue":" 79sdc","c2_minValue":" 111sdc","c2_num_nulls":0,"c3_maxValue":771.590,"c3_minValue":82.111,"c3_num_nulls":0,"c5_maxValue":50,"c5_minValue":7,"c5_num_nulls":0,"c6_maxValue":"2020-11-21","c6_minValue":"2020-01-22","c6_num_nulls":0,"c7_maxValue":"5g==","c7_minValue":"Ow==","c7_num_nulls":0,"c8_maxValue":9,"c8_minValue":9,"c8_num_nulls":0,"file":"part-00002-1-c000.snappy.parquet"}
{"c1_maxValue":959,"c1_minValue":0,"c1_num_nulls":0,"c2_maxValue":" 959sdc","c2_minValue":" 0sdc","c2_num_nulls":0,"c3_maxValue":916.697,"c3_minValue":19.000,"c3_num_nulls":0,"c5_maxValue":97,"c5_minValue":1,"c5_num_nulls":0,"c6_maxValue":"2020-11-22","c6_minValue":"2020-01-01","c6_num_nulls":0,"c7_maxValue":"yA==","c7_minValue":"AA==","c7_num_nulls":0,"c8_maxValue":9,"c8_minValue":9,"c8_num_nulls":0,"file":"part-00003-0-c000.snappy.parquet"}
{"c1_maxValue":272,"c1_minValue":8,"c1_num_nulls":0,"c2_maxValue":" 8sdc","c2_minValue":" 129sdc","c2_num_nulls":0,"c3_maxValue":979.272,"c3_minValue":430.129,"c3_num_nulls":0,"c5_maxValue":28,"c5_minValue":2,"c5_num_nulls":0,"c6_maxValue":"2020-11-20","c6_minValue":"2020-03-23","c6_num_nulls":0,"c7_maxValue":"8A==","c7_minValue":"Ag==","c7_num_nulls":0,"c8_maxValue":9,"c8_minValue":9,"c8_num_nulls":0,"file":"part-00003-1-c000.snappy.parquet"}
{"c1_maxValue":272,"c1_minValue":8,"c1_num_nulls":0,"c2_maxValue":" 8sdc","c2_minValue":" 129sdc","c2_num_nulls":0,"c3_maxValue":979.272,"c3_minValue":430.129,"c3_num_nulls":0,"c5_maxValue":28,"c5_minValue":2,"c5_num_nulls":0,"c6_maxValue":"2020-11-20","c6_minValue":"2020-03-23","c6_num_nulls":0,"c7_maxValue":"8A==","c7_minValue":"Ag==","c7_num_nulls":0,"c8_maxValue":9,"c8_minValue":9,"c8_num_nulls":0,"file":"part-00003-xxx-c000.snappy.parquet"}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are these two files changed? To fix test flakyness?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct

@alexeykudinkin
Copy link
Contributor Author

@hudi-bot run azure

@hudi-bot
Copy link
Collaborator

hudi-bot commented Feb 5, 2022

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

Copy link
Contributor

@yihua yihua left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Are there already unit/functional tests around HFile input format?

Copy link
Contributor

@nsivabalan nsivabalan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. check if we have tests for HFileInputFormat (regular and realtime). If not, can you file a tracking jira and take it up later. lets proceed w/ landing all stacked PRs.

@nsivabalan nsivabalan merged commit 3f263b8 into apache:master Feb 7, 2022
@alexeykudinkin
Copy link
Contributor Author

@nsivabalan we do have tests for all input formats (Parquet, HFile, Orc)

liusenhua pushed a commit to liusenhua/hudi that referenced this pull request Mar 1, 2022
…ache#4559)

Unify Hive's MOR implementations to avoid duplication to avoid duplication across implementations for different file-formats (Parquet, HFile, etc)

- Extracted HoodieRealtimeFileInputFormatBase (extending COW HoodieFileInputFormatBase base)
- Rebased Parquet, HFile implementations onto HoodieRealtimeFileInputFormatBase
- Tidying up
vingov pushed a commit to vingov/hudi that referenced this pull request Apr 3, 2022
…ache#4559)

Unify Hive's MOR implementations to avoid duplication to avoid duplication across implementations for different file-formats (Parquet, HFile, etc)

- Extracted HoodieRealtimeFileInputFormatBase (extending COW HoodieFileInputFormatBase base)
- Rebased Parquet, HFile implementations onto HoodieRealtimeFileInputFormatBase
- Tidying up
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants