-
Notifications
You must be signed in to change notification settings - Fork 2.5k
[HUDI-3047] Basic Implementation of Spark Datasource V2 #4350
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
efaf807 to
e8837f6
Compare
|
gread work!!! |
ac933cb to
0ecb15f
Compare
3855884 to
fea9016
Compare
|
@hudi-bot run azure |
...atasource/hudi-spark-common/src/main/scala/org/apache/hudi/MergeOnReadSnapshotRelation.scala
Outdated
Show resolved
Hide resolved
...source/hudi-spark-common/src/main/scala/org/apache/spark/sql/hudi/HoodieSqlCommonUtils.scala
Show resolved
Hide resolved
...source/hudi-spark-common/src/main/scala/org/apache/spark/sql/hudi/HoodieSqlCommonUtils.scala
Show resolved
Hide resolved
|
@leesf Only half way through.. will continue tmr...could you please also summarize the changes in the description as a high-level intro? easier to digest.. thanks! |
I have edited the description to summarize the changes . |
daaabf8 to
66d2d16
Compare
vinothchandar
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@leesf before looking at the code deeply, Can you please clarify what breaking changes we plan in this approach? Is the current "hudi " source codepath renamed to something else? So existing users need to update "All" their jobs to use the new source?
@vinothchandar There is no breaking changes for the approach, users have no need to change their format. I move some classes from |
f984f3a to
66d2d16
Compare
|
@hudi-bot run azure |
|
@leesf so this is just preparing the code and moving things around? no functional changes? Will review this more closely |
|
you may want to rebase the PR. seems like a lot went in :) |
xushiyan
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did a quick pass and left some naming suggestions.
...rce/hudi-spark3/src/main/scala/org/apache/spark/sql/hudi/catalog/HoodieInternalTableV2.scala
Outdated
Show resolved
Hide resolved
...tasource/hudi-spark3/src/main/scala/org/apache/spark/sql/hudi/catalog/StageHoodieTable.scala
Outdated
Show resolved
Hide resolved
...asource/hudi-spark3/src/main/scala/org/apache/spark/sql/hudi/catalog/TableCreationModes.java
Outdated
Show resolved
Hide resolved
...asource/hudi-spark3/src/main/scala/org/apache/spark/sql/hudi/catalog/TableCreationModes.java
Outdated
Show resolved
Hide resolved
...tasource/hudi-spark3/src/main/scala/org/apache/spark/sql/hudi/catalog/StageHoodieTable.scala
Outdated
Show resolved
Hide resolved
...spark-datasource/hudi-spark3/src/main/scala/org/apache/spark/sql/adapter/Spark3Adapter.scala
Outdated
Show resolved
Hide resolved
...ource/hudi-spark3/src/main/scala/org/apache/spark/sql/hudi/catalog/HoodieConfigBuilder.scala
Outdated
Show resolved
Hide resolved
...ource/hudi-spark3/src/main/scala/org/apache/spark/sql/hudi/catalog/HoodieConfigBuilder.scala
Show resolved
Hide resolved
...ource/hudi-spark3/src/main/scala/org/apache/spark/sql/hudi/catalog/HoodieConfigBuilder.scala
Show resolved
Hide resolved
yes, no functional changes. |
| spark: "spark2" | ||
| - scala: "scala-2.11" | ||
| spark: "spark2,spark-shade-unbundle-avro" | ||
| - scala: "scala-2.12" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@xushiyan I find it very hard to make it compatible between spark 3.2.0 and spark 3.0.x/spark3.1.x(there is no V1Write for spark 3.0.x and 3.1.x ) after we upgrade spark version to spark 3.2.0, so I commented out the workflow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@leesf We introduce build profiles spark3.0.x and spark3.1.x mostly due to spark's own incompatibilities. Here I think we can make some rules: to enable v2 writer, users have to make sure they're on spark 3.2+. Sounds good? In future, we may gradually drop support for old spark versions if the old spark code deviates too far from the latest one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sounds good.
e98123a to
7a49e17
Compare
|
@hudi-bot run azure |
…k3 work. 2. Introduce HoodieCatalog to manage hudi tables and make write/read workable with spark2&spark3
| } | ||
|
|
||
| override def shortName(): String = "hudi" | ||
| override def shortName(): String = "hudi_v1" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this would cause every job out there to be upgraded? Not sure if we can afford to do this. Also would like to clearly understand if the new v2 implementation will support ALL the existing functionality or be a drop-in replacement for the current v1 implementation?
I think its crucial to get aligned on this before we proceed further.
| */ | ||
| def parseMultipartIdentifier(parser: ParserInterface, sqlText: String): Seq[String] | ||
|
|
||
| def isHoodieTable(table: LogicalPlan, spark: SparkSession): Boolean = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any difference with hoodieSqlCommonUtils.isHoodieTable? I see sometimes we use adapter.isHoodieTable, sometimes use hoodieSqlCommonUtils.isHoodieTable
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any difference with hoodieSqlCommonUtils.isHoodieTable? I see sometimes we use adapter.isHoodieTable, sometimes use hoodieSqlCommonUtils.isHoodieTable
in fact hoodieSqlCommonUtils.isHoodieTable method is used in v1 codebase to judge if a table is a hoodie table in v1 codebase , but adapter.isHoodieTable method is to judge if a table is a hoodie table in v2 codebase, change the name would be better to understand.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your reply, and please correct me if I'm wrong. Can we just move the method implementation out from SparkAdapter to Spark2Adapter, Spark2Adapter can judge if a table is a hoodie table in v1 codebase, while Spark3Adapter can judge it by v2 codebase, by this, I think the method hoodieSqlCommonUtils.isHoodieTable can be simply removed?
Looking forward this PR can be merged as soon as possible, this is an excellent work as we can support many other features with DSV2 based on this. :)
|
closing this as have a new PR #4611 |
Tips
What is the purpose of the pull request
hudiformat to hudi_v1 and add hudi format in hudi-spark2, hudi-spark3 and hudi-spark3.1.x module.HoodieSpark3Analysisclass since some commands changed in V2.Brief change log
(for example:)
Verify this pull request
(Please pick either of the following options)
This pull request is a trivial rework / code cleanup without any test coverage.
(or)
This pull request is already covered by existing tests, such as (please describe tests).
(or)
This change added tests and can be verified as follows:
(example:)
Committer checklist
Has a corresponding JIRA in PR title & commit
Commit message is descriptive of the change
CI is green
Necessary doc changes done or have another open PR
For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.