[HUDI-651] Fix incremental queries in hive for MOR tables #1817

bhasudha · 2020-07-11T00:21:21Z

Tips

Thank you very much for contributing to Apache Hudi.
Please review https://hudi.apache.org/contributing.html before opening a pull request.

What is the purpose of the pull request

This commit addresses two issues:
1. Honors end time if less than the most recent completed commit time
2. Doesnt require a base parquet file to be present in case when the begin and end times match only the the deltacommits.

To achieve this:
- Created a seperate FileSplit for handling incremental queries
- New RecordReader to handle the new FileSplit
- FileSlice Scanner to scan files in a File Slice. First takes base parquet file (if present) and applies merged records from all log files in that slice. If base file is not present returns merged records from log files on scanning
- HoodieParquetRealtimeInputFormat modified to switch to this HoodieMORIncrementalFileSplit and HoodieMORIncrementalRecordReader from getSplit(..) and getRecordReader(..) in case of incremental queries.
- Includes unit test to test different incremental queries

(For example: This pull request adds quick-start document.)

Brief change log

(for example:)

Modify AnnotationLocation checkstyle rule in checkstyle.xml

Pending action items before moving to full PR

Noted a bunch of TODOs. will try to resolve them.
Still working on testing this fix in an integrated environment. in the process, I plan to test in Docker Demo setup that provides testing across all query engines.
Needs changes to HoodieCombinedInputFormat
May need changes in Spark as well.

Verify this pull request

(Please pick either of the following options)

This pull request is a trivial rework / code cleanup without any test coverage.

(or)

This pull request is already covered by existing tests, such as (please describe tests).

(or)

This change added tests and can be verified as follows:

(example:)

Added integration tests for end-to-end.
Added HoodieClientWriteTest to verify the change.
Manually verified the change by running a job locally.

Committer checklist

Has a corresponding JIRA in PR title & commit
Commit message is descriptive of the change
CI is green
Necessary doc changes done or have another open PR
For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.

- HoodieTestUtils changes to include fake write stats per affected file - Refactor AbstractRealtimeRecordReader to be able to support additional FileSplit types. Currently it only takes HoodieRealtimeFileSplit. In future, we can add another constructor with FileSplit for incremental queries.

- This commit addresses two issues: 1. Honors end time if less than the most recent completed commit time 2. Doesnt require a base parquet file to be present in case when the begin and end times match only the the deltacommits. To achieve this: - Created a seperate FileSplit for handling incremental queries - New RecordReader to handle the new FileSplit - FileSlice Scanner to scan files in a File Slice. First takes base parquet file (if present) and applies merged records from all log files in that slice. If base file is not present returns merged records from log files on scanning - HoodieParquetRealtimeInputFormat modified to switch to this HoodieMORIncrementalFileSplit and HoodieMORIncrementalRecordReader from getSplit(..) and getRecordReader(..) in case of incremental queries. - Includes unit test to test different incremental queries

bhasudha · 2020-07-11T00:24:33Z

@garyli1019 Please take a look at this and provide feedback

bhasudha · 2020-07-11T00:31:45Z

hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/realtime/HoodieMergedFileSlicesScanner.java

+  private final List<FileSlice> fileSlices;
+
+  // Final map of compacted/merged records
+  // TODO change map to external spillable map. But ArrayWritable is not implementing Serializable


We are merging parquet and log files and need to assume either of the files can be absent. I kept the Map interface as <String, ArrayWritable>. But since ArrayWritable is not implementing Serializable, that prevents the use of ExternalSpillableMap. Appreciate any ideas.

bhasudha · 2020-07-11T00:35:11Z

...adoop-mr/src/main/java/org/apache/hudi/hadoop/realtime/HoodieParquetRealtimeInputFormat.java

+    // is this an incremental query
+    List<String> incrementalTables = getIncrementalTableNames(Job.getInstance(job));
+    if (!incrementalTables.isEmpty()) {
+      //TODO For now assuming the query can be either incremental or snapshot and NOT both.


This is an assumption for now since we are not touching the snapshot queries and adding a new path to handle incremental queries, if incremental tables is not empty. In future this might have to change. OR may be a better way to apply this constraint instead of relying on simply the incremental tables ?

bhasudha · 2020-07-11T00:36:01Z

...adoop-mr/src/main/java/org/apache/hudi/hadoop/realtime/HoodieParquetRealtimeInputFormat.java

+  private Map<String, List<FileStatus>> listStatusForAffectedPartitions(
+      Path basePath, List<HoodieInstant> commitsToCheck, HoodieTimeline timeline) throws IOException {
+    // Extract files touched by these commits.
+    // TODO This might need to be done in parallel like listStatus parallelism ?


There is scope for parallelized listing here. But should not be a blocker immediately.

bhasudha · 2020-07-11T00:36:40Z

...adoop-mr/src/main/java/org/apache/hudi/hadoop/realtime/HoodieParquetRealtimeInputFormat.java

+          String relativeFilePath = stat.getPath();
+          Path fullPath = relativeFilePath != null ? FSUtils.getPartitionPath(basePath, relativeFilePath) : null;
+          if (fullPath != null) {
+            //TODO Should the length of file be totalWriteBytes or fileSizeInBytes?


@bvaradar @n3nash can you help clarify this ?

@bhasudha : totalWriteBytes = fileSizeInBytes for base files (parquet). For log files, this is not the case as we use a heuristic to estimated the bytes written per delta-commit.

Getting the actual file size would require a RPC call and will be costly here. Also, looking at where the file size is useful in read-path, it is only needed for combining and splitting file-splits which is based on base-file for Realtime. So, it should be fine to use the metadata stat stat.getTotalWriteBytes() with a change.

As same log file can appear in multiple delta commits (for HDFS and other file-systems supporting appends), the logic below needs to handle that, You can simply have the cumulative write-bytes for each log file appearing across delta-commits to get a better approximate size of log files.

bvaradar · 2020-07-11T19:15:51Z

...adoop-mr/src/main/java/org/apache/hudi/hadoop/realtime/HoodieParquetRealtimeInputFormat.java

+          String relativeFilePath = stat.getPath();
+          Path fullPath = relativeFilePath != null ? FSUtils.getPartitionPath(basePath, relativeFilePath) : null;
+          if (fullPath != null) {
+            //TODO Should the length of file be totalWriteBytes or fileSizeInBytes?


@bhasudha : totalWriteBytes = fileSizeInBytes for base files (parquet). For log files, this is not the case as we use a heuristic to estimated the bytes written per delta-commit.

Getting the actual file size would require a RPC call and will be costly here. Also, looking at where the file size is useful in read-path, it is only needed for combining and splitting file-splits which is based on base-file for Realtime. So, it should be fine to use the metadata stat stat.getTotalWriteBytes() with a change.

As same log file can appear in multiple delta commits (for HDFS and other file-systems supporting appends), the logic below needs to handle that, You can simply have the cumulative write-bytes for each log file appearing across delta-commits to get a better approximate size of log files.

bvaradar · 2020-07-11T19:16:16Z

...adoop-mr/src/main/java/org/apache/hudi/hadoop/realtime/HoodieParquetRealtimeInputFormat.java

+            //TODO Should the length of file be totalWriteBytes or fileSizeInBytes?
+            FileStatus fs = new FileStatus(stat.getTotalWriteBytes(), false, 0, 0,
+                0, fullPath);
+            partitionToFileStatusesMap.get(entry.getKey()).add(fs);


need to handle duplicate log files here.

garyli1019

Hi @bhasudha , I took an initial pass and have a few questions.
I thought about the scenario that the base file was not included but the log files should be included in the incremental query for Spark Datasource as well. The two approaches I have been thinking:

If the starting commit time of the incremental query is a delta commit without compaction, we can search the previous compaction commit and read it from there, then do filtering later. So we can avoid the case like missing the base file, but this is not the optimal solution since we read some redundant parquet files.
We can improve the existing HoodieRealtimeFormat or HoodieRealtimeRecordReader to handle the missing base file scenario. Maybe create an HoodieRealtimeFlieSlice with an empty baseFile during the file listing? Do you think this is doable?

WDYT?

vinothchandar · 2020-07-23T05:49:24Z

@garyli1019 are you talking about corner cases not handled in this PR? can you review the PR once for intended functionality? I am trying to see if this can help MOR/Incremental query on spark SQL in some form.

garyli1019 · 2020-07-23T06:12:36Z

@garyli1019 are you talking about corner cases not handled in this PR? can you review the PR once for intended functionality? I am trying to see if this can help MOR/Incremental query on spark SQL in some form.

@vinothchandar I think the Spark Datasource will use a different approach. IIUC, this PR is trying to solve when the incremental query started from an uncompacted delta commit, which doesn't have a base file for some file groups and leads to missing the log records. For Spark Datasource, we can create a HoodieFileSplit without baseFile and read logs only. I am not sure if this could be done in the HoodieRealtimeFileSplit and without the extra handle for HoodieMORIncrementalFileSplit.

vinothchandar · 2020-07-29T23:22:19Z

yes this is trying to fix things for Hive primarily. We can take a separate approach for spark incremental queries

vinothchandar · 2020-08-04T00:11:48Z

cc @satishkotha to review and suggest what to do in 0.6.0 timelines

satishkotha

@n3nash said he's going to take a look at this. I'm just leaving 1 high level question to understand the approach.

satishkotha · 2020-08-05T18:04:18Z

...adoop-mr/src/main/java/org/apache/hudi/hadoop/realtime/HoodieParquetRealtimeInputFormat.java

        + ", Ids :" + jobConf.get(ColumnProjectionUtils.READ_COLUMN_IDS_CONF_STR));
    // sanity check
-    ValidationUtils.checkArgument(split instanceof HoodieRealtimeFileSplit,
+    ValidationUtils.checkArgument(split instanceof HoodieRealtimeFileSplit || split instanceof HoodieMORIncrementalFileSplit,


High level question, is it possible to make baseFile optional in HoodieRealtimeFileSplit instead of creating new class HoodieMORIncrementalFileSplit? We may also have to make changes in RecordReader classes if baseFile is not present.

@satishkotha There are few requirements we need to satisfy in order to support this in HoodieRealtimeFileSplit:

The start and end time should be honored by the incremental query. If end time is not specified then it can be assumed to be minCommit from (maxNumberrOfCommits, mostRecentCommit). Currently this is not happening as intended.

The base file and log files can be optional. This can be the case when the boundaries of incremental query filter is such that the start commit time matches a log file and/or an end commit time matches only the base file across file slices. Or the incremental query is touching a FileSlice that is not compacted yet.

When I initially started, I was not sure how big the refactor and testing it would be to achieve both of the above requirements in the same HoodieRealtimeFileSplit. This would also require regression testing of snapshot queries in all query engines and new incremental query path in all query engines. So instead of impacting the snapshot queries code path that is running fine, conservatively, I branched out to make these changes only applicable to incremental query path and intended to consolidate them in long term after stabilizing and gaining more confidence.

n3nash · 2021-03-30T07:09:36Z

@garyli1019 Is this PR addressed as part of #1938 or should we revive this ?

garyli1019 · 2021-03-30T07:20:08Z

@garyli1019 Is this PR addressed as part of #1938 or should we revive this ?

hi @n3nash , #1938 add support for MOR incremental query on Spark DataSource. This PR seems to aim to fix on Hive.

n3nash · 2021-03-31T04:24:45Z

@garyli1019 Thanks for the context. @bhasudha Are you able to revive and rebase this PR ? Once you do that, I can help land this.

n3nash · 2021-04-06T06:27:56Z

@bhasudha I will try to rebase this PR and get it to a working state tomorrow, if you are working on it already, please let me know.

nsivabalan · 2021-05-21T14:28:28Z

@garyli1019 and @n3nash : can we have some owner for this. if not, I can take it up, but I haven't worked on incremental path before, but willing to expand my knowledge :)

garyli1019 · 2021-05-22T08:47:26Z

@nsivabalan I am too busy to land hudi at work so I may not able to work on this recently. Please feel free to pick this up if you are interested.

hudi-bot · 2022-01-31T04:02:14Z

CI report:

3040613 UNKNOWN

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run azure re-run the last Azure build

yihua · 2022-09-07T00:43:34Z

@alexeykudinkin I believe HoodieMergeOnReadTableInputFormat should fix the incremental query in Hive for MOR tables. Does it?

alexeykudinkin · 2022-09-07T01:05:48Z

@yihua it does

yihua · 2022-09-10T04:55:40Z

Sounds good. Given HoodieMergeOnReadTableInputFormat from the latest refactoring has already fixed the incremental query in Hive for MOR tables, this PR is no longer needed.

Bhavani Sudha Saktheeswaran added 2 commits July 10, 2020 17:07

bhasudha added the status:in-progress Work in progress label Jul 11, 2020

bhasudha commented Jul 11, 2020

View reviewed changes

bvaradar reviewed Jul 11, 2020

View reviewed changes

garyli1019 reviewed Jul 13, 2020

View reviewed changes

vinothchandar assigned vinothchandar and n3nash Jul 14, 2020

satishkotha reviewed Aug 5, 2020

View reviewed changes

vinothchandar assigned garyli1019 Aug 31, 2020

garyli1019 mentioned this pull request Sep 8, 2020

[HUDI-920] Support Incremental query for MOR table #1938

Merged

5 tasks

n3nash added the awaiting-user-response label Mar 31, 2021

nsivabalan changed the title ~~[HUDI-651] Fix incremental queries in MOR tables~~ [HUDI-651] Fix incremental queries in hive for MOR tables May 21, 2021

xushiyan removed the user-action label Jan 31, 2022

yihua added priority:critical Production degraded; pipelines stalled incremental-query engine:hive Hive integration labels Sep 7, 2022

yihua closed this Sep 10, 2022

[HUDI-651] Fix incremental queries in hive for MOR tables #1817

[HUDI-651] Fix incremental queries in hive for MOR tables #1817

Uh oh!

Conversation

bhasudha commented Jul 11, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Tips

What is the purpose of the pull request

Brief change log

Pending action items before moving to full PR

Verify this pull request

Committer checklist

Uh oh!

bhasudha commented Jul 11, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

garyli1019 left a comment

Choose a reason for hiding this comment

Uh oh!

vinothchandar commented Jul 23, 2020

Uh oh!

garyli1019 commented Jul 23, 2020

Uh oh!

vinothchandar commented Jul 29, 2020

Uh oh!

vinothchandar commented Aug 4, 2020

Uh oh!

satishkotha left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

n3nash commented Mar 30, 2021

Uh oh!

garyli1019 commented Mar 30, 2021

Uh oh!

n3nash commented Mar 31, 2021

Uh oh!

n3nash commented Apr 6, 2021

Uh oh!

nsivabalan commented May 21, 2021

Uh oh!

garyli1019 commented May 22, 2021

Uh oh!

hudi-bot commented Jan 31, 2022

CI report:

Uh oh!

yihua commented Sep 7, 2022

Uh oh!

alexeykudinkin commented Sep 7, 2022

Uh oh!

yihua commented Sep 10, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

11 participants

bhasudha commented Jul 11, 2020 •

edited

Loading