Skip to content

Conversation

@leesf
Copy link
Contributor

@leesf leesf commented Nov 10, 2021

Tips

What is the purpose of the pull request

(For example: This pull request adds quick-start document.)

Brief change log

(for example:)

  • Modify AnnotationLocation checkstyle rule in checkstyle.xml

Verify this pull request

(Please pick either of the following options)

This pull request is a trivial rework / code cleanup without any test coverage.

(or)

This pull request is already covered by existing tests, such as (please describe tests).

(or)

This change added tests and can be verified as follows:

(example:)

  • Added integration tests for end-to-end.
  • Added HoodieClientWriteTest to verify the change.
  • Manually verified the change by running a job locally.

Committer checklist

  • Has a corresponding JIRA in PR title & commit

  • Commit message is descriptive of the change

  • CI is green

  • Necessary doc changes done or have another open PR

  • For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.

@leesf
Copy link
Contributor Author

leesf commented Nov 10, 2021

@leesf leesf requested a review from vinothchandar November 11, 2021 11:45
@vinothchandar vinothchandar added the rfc Request for comments label Nov 21, 2021
rfc/README.md Outdated
| 35 | [Make Flink MOR table writing streaming friendly](https://cwiki.apache.org/confluence/display/HUDI/RFC-35%3A+Make+Flink+MOR+table+writing+streaming+friendly) | `UNDER REVIEW` |
| 36 | [HUDI Metastore Server](https://cwiki.apache.org/confluence/display/HUDI/%5BWIP%5D+RFC-36%3A+HUDI+Metastore+Server) | `UNDER REVIEW` |
| 36 | [HUDI Metastore Server](https://cwiki.apache.org/confluence/display/HUDI/%5BWIP%5D+RFC-36%3A+HUDI+Metastore+Server) | `UNDER REVIEW` |
| 38 | [Spark DataSource V2 Integration](./rfc-38/rfc-38.md) | `UNDER REVIEW` |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lets separate the number update into a different PR? as mentioned in the process?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In fact the PR has been merged #3964, will update the PR

- @leesf

## Approvers
-
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@xushiyan @YannByron and I can be approvers if you don't mind

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure

@vinothchandar
Copy link
Member

@leesf Love to understand the plan going forward here and how we plan to migrate the existing v1 write path onto the v2 APIs. Specifically, current v1 upsert pipeline consists of the following logical stages preCombine -> index -> partition -> write before committing out the files. In other words, we benefit from v1 API providing ways to shuffle the dataframe further before writing to disk and IIUC v2 takes this flexibility away?

Assuming I am correct (and spark has not introduced any new APIs that help us mitigate this), should we do the following?

  • introduce a new hudiv2 datasource i.e spark.write.format("hudiv2") that just supports bulk_insert on the datasource write path.
  • We also add a new SparkDatasetWriteClient which exposes methods for upsert,delete, .. and we use that as the basis for our SQL/DML layer as well.
  • We continue to support the v1 hudi datasource as-is for sometime. There are lots of users who like how they can do upserts/deletes by executing a spark.write.format("hudi").option()...

@vinothchandar vinothchandar self-assigned this Dec 9, 2021
@leesf
Copy link
Contributor Author

leesf commented Dec 9, 2021

@leesf Love to understand the plan going forward here and how we plan to migrate the existing v1 write path onto the v2 APIs. Specifically, current v1 upsert pipeline consists of the following logical stages preCombine -> index -> partition -> write before committing out the files. In other words, we benefit from v1 API providing ways to shuffle the dataframe further before writing to disk and IIUC v2 takes this flexibility away?

Assuming I am correct (and spark has not introduced any new APIs that help us mitigate this), should we do the following?

  • introduce a new hudiv2 datasource i.e spark.write.format("hudiv2") that just supports bulk_insert on the datasource write path.
  • We also add a new SparkDatasetWriteClient which exposes methods for upsert,delete, .. and we use that as the basis for our SQL/DML layer as well.
  • We continue to support the v1 hudi datasource as-is for sometime. There are lots of users who like how they can do upserts/deletes by executing a spark.write.format("hudi").option()...

@vinothchandar In fact, I do not intend to introduce "hudiv2" format when introducing V2 code path, since it will make end users change their code and the "hudiv2" is not a good name("hudi" is good enough) IMO, instead I would like to change the former "hudi" format into "hudi_internal" and make "hudi" format as the v2 code path as default to make it transparent for end users, and integrate with current bulk_insert V2 write path.
And In the first phase, we would fallback to V1 write path while introduce V2 interface(HoodieCatalog and HoodieInternalTableV2), and integrate with current bulk_insert V2 write path. In the second phase, we would explore the way to integrate with SparkDatasetWriteClient which @xushiyan did a PoC to make it purely V2 code path.

@vinothchandar
Copy link
Member

And In the first phase, we would fallback to V1 write path

Can this be done? Love to see some code for this.

@leesf
Copy link
Contributor Author

leesf commented Dec 10, 2021

And In the first phase, we would fallback to V1 write path

Can this be done? Love to see some code for this.

yes, will open a PR in recent days.

@nsivabalan
Copy link
Contributor

@leesf : I did not go through the lineage of this patch. But I do know we landed the another PR related to spark datasource V2. so, is this patch still valid or can we close it out.

@xushiyan
Copy link
Member

@leesf : I did not go through the lineage of this patch. But I do know we landed the another PR related to spark datasource V2. so, is this patch still valid or can we close it out.

@nsivabalan this is the RFC PR. Current work is in #4611

Copy link
Member

@xushiyan xushiyan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@leesf thanks for the rfc; please kindly update some parts to reflect our latest discussion.

@hudi-bot
Copy link
Collaborator

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

Copy link
Member

@xushiyan xushiyan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Member

@vinothchandar vinothchandar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can land this RFC and keep it evolving as we enter the next phases

@leesf leesf merged commit 76b6ad6 into apache:master Feb 21, 2022
vingov pushed a commit to vingov/hudi that referenced this pull request Apr 3, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

rfc Request for comments

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants