-
Notifications
You must be signed in to change notification settings - Fork 2.5k
[HUDI-2732][RFC-38] Spark Datasource V2 Integration #3964
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
rfc/README.md
Outdated
| | 35 | [Make Flink MOR table writing streaming friendly](https://cwiki.apache.org/confluence/display/HUDI/RFC-35%3A+Make+Flink+MOR+table+writing+streaming+friendly) | `UNDER REVIEW` | | ||
| | 36 | [HUDI Metastore Server](https://cwiki.apache.org/confluence/display/HUDI/%5BWIP%5D+RFC-36%3A+HUDI+Metastore+Server) | `UNDER REVIEW` | | ||
| | 36 | [HUDI Metastore Server](https://cwiki.apache.org/confluence/display/HUDI/%5BWIP%5D+RFC-36%3A+HUDI+Metastore+Server) | `UNDER REVIEW` | | ||
| | 38 | [Spark DataSource V2 Integration](./rfc-38/rfc-38.md) | `UNDER REVIEW` | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lets separate the number update into a different PR? as mentioned in the process?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In fact the PR has been merged #3964, will update the PR
rfc/rfc-38/rfc-38.md
Outdated
| - @leesf | ||
|
|
||
| ## Approvers | ||
| - |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@xushiyan @YannByron and I can be approvers if you don't mind
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure
|
@leesf Love to understand the plan going forward here and how we plan to migrate the existing v1 write path onto the v2 APIs. Specifically, current v1 upsert pipeline consists of the following logical stages Assuming I am correct (and spark has not introduced any new APIs that help us mitigate this), should we do the following?
|
@vinothchandar In fact, I do not intend to introduce "hudiv2" format when introducing V2 code path, since it will make end users change their code and the "hudiv2" is not a good name("hudi" is good enough) IMO, instead I would like to change the former "hudi" format into "hudi_internal" and make "hudi" format as the v2 code path as default to make it transparent for end users, and integrate with current bulk_insert V2 write path. |
Can this be done? Love to see some code for this. |
yes, will open a PR in recent days. |
|
@leesf : I did not go through the lineage of this patch. But I do know we landed the another PR related to spark datasource V2. so, is this patch still valid or can we close it out. |
@nsivabalan this is the RFC PR. Current work is in #4611 |
xushiyan
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@leesf thanks for the rfc; please kindly update some parts to reflect our latest discussion.
xushiyan
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
vinothchandar
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can land this RFC and keep it evolving as we enter the next phases
Tips
What is the purpose of the pull request
(For example: This pull request adds quick-start document.)
Brief change log
(for example:)
Verify this pull request
(Please pick either of the following options)
This pull request is a trivial rework / code cleanup without any test coverage.
(or)
This pull request is already covered by existing tests, such as (please describe tests).
(or)
This change added tests and can be verified as follows:
(example:)
Committer checklist
Has a corresponding JIRA in PR title & commit
Commit message is descriptive of the change
CI is green
Necessary doc changes done or have another open PR
For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.