[HUDI-2732][RFC-38] Spark Datasource V2 Integration #3964

leesf · 2021-11-10T15:55:53Z

Tips

Thank you very much for contributing to Apache Hudi.
Please review https://hudi.apache.org/contribute/how-to-contribute before opening a pull request.

What is the purpose of the pull request

(For example: This pull request adds quick-start document.)

Brief change log

(for example:)

Modify AnnotationLocation checkstyle rule in checkstyle.xml

Verify this pull request

(Please pick either of the following options)

This pull request is a trivial rework / code cleanup without any test coverage.

(or)

This pull request is already covered by existing tests, such as (please describe tests).

(or)

This change added tests and can be verified as follows:

(example:)

Added integration tests for end-to-end.
Added HoodieClientWriteTest to verify the change.
Manually verified the change by running a job locally.

Committer checklist

Has a corresponding JIRA in PR title & commit
Commit message is descriptive of the change
CI is green
Necessary doc changes done or have another open PR
For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.

leesf · 2021-11-10T15:58:18Z

CC @vinothchandar @bvaradar @nsivabalan @codope @umehrot2 @YannByron @li36909 @EdwinGuo @xushiyan

vinothchandar · 2021-12-09T00:13:30Z

rfc/README.md

 | 35 | [Make Flink MOR table writing streaming friendly](https://cwiki.apache.org/confluence/display/HUDI/RFC-35%3A+Make+Flink+MOR+table+writing+streaming+friendly) | `UNDER REVIEW` |
-| 36 | [HUDI Metastore Server](https://cwiki.apache.org/confluence/display/HUDI/%5BWIP%5D+RFC-36%3A+HUDI+Metastore+Server) | `UNDER REVIEW` |
+| 36 | [HUDI Metastore Server](https://cwiki.apache.org/confluence/display/HUDI/%5BWIP%5D+RFC-36%3A+HUDI+Metastore+Server) | `UNDER REVIEW` |
+| 38 | [Spark DataSource V2 Integration](./rfc-38/rfc-38.md) | `UNDER REVIEW` |


lets separate the number update into a different PR? as mentioned in the process?

In fact the PR has been merged #3964, will update the PR

vinothchandar · 2021-12-09T00:14:35Z

rfc/rfc-38/rfc-38.md

+- @leesf
+
+## Approvers
+- 


@xushiyan @YannByron and I can be approvers if you don't mind

vinothchandar · 2021-12-09T00:24:02Z

@leesf Love to understand the plan going forward here and how we plan to migrate the existing v1 write path onto the v2 APIs. Specifically, current v1 upsert pipeline consists of the following logical stages preCombine -> index -> partition -> write before committing out the files. In other words, we benefit from v1 API providing ways to shuffle the dataframe further before writing to disk and IIUC v2 takes this flexibility away?

Assuming I am correct (and spark has not introduced any new APIs that help us mitigate this), should we do the following?

introduce a new hudiv2 datasource i.e spark.write.format("hudiv2") that just supports bulk_insert on the datasource write path.
We also add a new SparkDatasetWriteClient which exposes methods for upsert,delete, .. and we use that as the basis for our SQL/DML layer as well.
We continue to support the v1 hudi datasource as-is for sometime. There are lots of users who like how they can do upserts/deletes by executing a spark.write.format("hudi").option()...

leesf · 2021-12-09T01:59:17Z

@leesf Love to understand the plan going forward here and how we plan to migrate the existing v1 write path onto the v2 APIs. Specifically, current v1 upsert pipeline consists of the following logical stages preCombine -> index -> partition -> write before committing out the files. In other words, we benefit from v1 API providing ways to shuffle the dataframe further before writing to disk and IIUC v2 takes this flexibility away?

Assuming I am correct (and spark has not introduced any new APIs that help us mitigate this), should we do the following?

introduce a new hudiv2 datasource i.e spark.write.format("hudiv2") that just supports bulk_insert on the datasource write path.

We also add a new SparkDatasetWriteClient which exposes methods for upsert,delete, .. and we use that as the basis for our SQL/DML layer as well.

We continue to support the v1 hudi datasource as-is for sometime. There are lots of users who like how they can do upserts/deletes by executing a spark.write.format("hudi").option()...

@vinothchandar In fact, I do not intend to introduce "hudiv2" format when introducing V2 code path, since it will make end users change their code and the "hudiv2" is not a good name("hudi" is good enough) IMO, instead I would like to change the former "hudi" format into "hudi_internal" and make "hudi" format as the v2 code path as default to make it transparent for end users, and integrate with current bulk_insert V2 write path.
And In the first phase, we would fallback to V1 write path while introduce V2 interface(HoodieCatalog and HoodieInternalTableV2), and integrate with current bulk_insert V2 write path. In the second phase, we would explore the way to integrate with SparkDatasetWriteClient which @xushiyan did a PoC to make it purely V2 code path.

vinothchandar · 2021-12-09T23:52:16Z

And In the first phase, we would fallback to V1 write path

Can this be done? Love to see some code for this.

leesf · 2021-12-10T06:08:30Z

And In the first phase, we would fallback to V1 write path

Can this be done? Love to see some code for this.

yes, will open a PR in recent days.

nsivabalan · 2022-01-24T01:54:25Z

@leesf : I did not go through the lineage of this patch. But I do know we landed the another PR related to spark datasource V2. so, is this patch still valid or can we close it out.

xushiyan · 2022-01-31T06:32:39Z

@leesf : I did not go through the lineage of this patch. But I do know we landed the another PR related to spark datasource V2. so, is this patch still valid or can we close it out.

@nsivabalan this is the RFC PR. Current work is in #4611

xushiyan

@leesf thanks for the rfc; please kindly update some parts to reflect our latest discussion.

rfc/rfc-38/rfc-38.md

hudi-bot · 2022-02-14T10:40:07Z

CI report:

3f6276e Azure: SUCCESS

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run azure re-run the last Azure build

xushiyan

LGTM

vinothchandar

We can land this RFC and keep it evolving as we enter the next phases

leesf requested a review from vinothchandar November 11, 2021 11:45

vinothchandar added the rfc Request for comments label Nov 21, 2021

vinothchandar reviewed Dec 9, 2021

View reviewed changes

vinothchandar self-assigned this Dec 9, 2021

leesf force-pushed the HUDI-2732 branch from ad8df3d to a0c9b18 Compare December 17, 2021 08:49

xushiyan reviewed Feb 14, 2022

View reviewed changes

rfc/rfc-38/rfc-38.md Outdated Show resolved Hide resolved

rfc/rfc-38/rfc-38.md Show resolved Hide resolved

leesf added 2 commits February 14, 2022 16:24

[HUDI-2732][RFC-38] Spark Datasource V2 Integration

9d64e93

fix comments

3f6276e

leesf force-pushed the HUDI-2732 branch from a0c9b18 to 3f6276e Compare February 14, 2022 08:25

xushiyan approved these changes Feb 15, 2022

View reviewed changes

vinothchandar approved these changes Feb 17, 2022

View reviewed changes

leesf merged commit 76b6ad6 into apache:master Feb 21, 2022

vingov pushed a commit to vingov/hudi that referenced this pull request Apr 3, 2022

[HUDI-2732][RFC-38] Spark Datasource V2 Integration (apache#3964)

1c70b11

geserdugarov mentioned this pull request Jul 24, 2025

[HUDI-4449] Claim of RFC-98, Spark Datasource V2 Read #13609

Merged

4 tasks

[HUDI-2732][RFC-38] Spark Datasource V2 Integration #3964

[HUDI-2732][RFC-38] Spark Datasource V2 Integration #3964

Uh oh!

Conversation

leesf commented Nov 10, 2021

Tips

What is the purpose of the pull request

Brief change log

Verify this pull request

Committer checklist

Uh oh!

leesf commented Nov 10, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vinothchandar Dec 9, 2021

Choose a reason for hiding this comment

Uh oh!

leesf Dec 9, 2021

Choose a reason for hiding this comment

Uh oh!

vinothchandar Dec 9, 2021

Choose a reason for hiding this comment

Uh oh!

leesf Dec 9, 2021

Choose a reason for hiding this comment

Uh oh!

vinothchandar commented Dec 9, 2021

Uh oh!

leesf commented Dec 9, 2021

Uh oh!

vinothchandar commented Dec 9, 2021

Uh oh!

leesf commented Dec 10, 2021

Uh oh!

nsivabalan commented Jan 24, 2022

Uh oh!

xushiyan commented Jan 31, 2022

Uh oh!

xushiyan left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

hudi-bot commented Feb 14, 2022

CI report:

Uh oh!

xushiyan left a comment

Choose a reason for hiding this comment

Uh oh!

vinothchandar left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

leesf commented Nov 10, 2021 •

edited

Loading