[HUDI-3047] Basic Implementation of Spark Datasource V2 #4350

leesf · 2021-12-17T06:37:38Z

Tips

Thank you very much for contributing to Apache Hudi.
Please review https://hudi.apache.org/contribute/how-to-contribute before opening a pull request.

What is the purpose of the pull request

add modules and refactor existing modules to make spark2 and spark3 work, the main change is to move classes from hudi-spark module to hudi-spark-common module to make both hudi-spark2, hudi-spark3 and hudi-spark3.1.x modules would make use of the classes, also add hudi-spark3-common and hudi-spark2-common modules to make spark3.1.x and spark 3.2.0 reuse most of code, hudi-spark2-common has no class and is just a placeholder to make package command work in hudi-spark-bundle & hudi-spark module.
change spark-datasource/hudi-spark hudi format to hudi_v1 and add hudi format in hudi-spark2, hudi-spark3 and hudi-spark3.1.x module.
Introduce HoodieCatalog to manage hudi tables and make write/read workable with spark2&spark3 workable, currently the write and read path both fallback to V1 path, but introduce V2 Table to pave the way for pure V2 code path, and accordingly handle V2 command in HoodieSpark3Analysis class since some commands changed in V2.
spark 3.1.x and spark 2.4.x would still use the exist spark sql to write and read data and still use V1 format, but from spark3.2.0, we move to V2 format, users need config 'org.apache.spark.sql.spark_catalog=org.apache.spark.sql.hudi.catalog.HoodieCatalog'
add flag to exist tests on spark 3.2.0 with HoodieCatalog.

Brief change log

(for example:)

Modify AnnotationLocation checkstyle rule in checkstyle.xml

Verify this pull request

(Please pick either of the following options)

This pull request is a trivial rework / code cleanup without any test coverage.

(or)

This pull request is already covered by existing tests, such as (please describe tests).

(or)

This change added tests and can be verified as follows:

(example:)

Added integration tests for end-to-end.
Added HoodieClientWriteTest to verify the change.
Manually verified the change by running a job locally.

Committer checklist

Has a corresponding JIRA in PR title & commit
Commit message is descriptive of the change
CI is green
Necessary doc changes done or have another open PR
For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.

leesf · 2021-12-17T06:38:36Z

CC @vinothchandar @xushiyan @YannByron @nsivabalan

xiarixiaoyao · 2021-12-17T11:06:46Z

gread work！！！

leesf · 2021-12-23T05:45:25Z

@hudi-bot run azure

...atasource/hudi-spark-common/src/main/scala/org/apache/hudi/MergeOnReadSnapshotRelation.scala

...source/hudi-spark-common/src/main/scala/org/apache/spark/sql/hudi/HoodieSqlCommonUtils.scala

xushiyan · 2021-12-23T08:02:43Z

@leesf Only half way through.. will continue tmr...could you please also summarize the changes in the description as a high-level intro? easier to digest.. thanks!

leesf · 2021-12-23T08:21:23Z

@leesf Only half way through.. will continue tmr...could you please also summarize the changes in the description as a high-level intro? easier to digest.. thanks!

I have edited the description to summarize the changes .

vinothchandar

@leesf before looking at the code deeply, Can you please clarify what breaking changes we plan in this approach? Is the current "hudi " source codepath renamed to something else? So existing users need to update "All" their jobs to use the new source?

leesf · 2021-12-26T09:37:07Z

@leesf before looking at the code deeply, Can you please clarify what breaking changes we plan in this approach? Is the current "hudi " source codepath renamed to something else? So existing users need to update "All" their jobs to use the new source?

@vinothchandar There is no breaking changes for the approach, users have no need to change their format. I move some classes from hudi-spark-datasource/hudi-spark into hudi-spark-datasource/hudi-spark-common module to make reuse of these classes. And the hudi format under spark-datasource/hudi-spark module before is moved to spark-datasource/hudi-spark-common to reuse DefaultSource code. And I introduce hudi format which located in both hudi-spark-datasource/hudi-spark2 and hudi-spark-datasource/hudi-spark3 which means no matter which spark version users use, they use still use hudi format. since the hudi-spark-datasource/hudi-spark module depends on hudi-spark-datasource/hudi-spark2 or hudi-spark-datasource/hudi-spark3 module.

leesf · 2021-12-27T02:40:20Z

@hudi-bot run azure

vinothchandar · 2021-12-29T14:57:09Z

@leesf so this is just preparing the code and moving things around? no functional changes? Will review this more closely

vinothchandar · 2021-12-29T14:57:25Z

you may want to rebase the PR. seems like a lot went in :)

xushiyan

Did a quick pass and left some naming suggestions.

...rce/hudi-spark3/src/main/scala/org/apache/spark/sql/hudi/catalog/HoodieInternalTableV2.scala

...tasource/hudi-spark3/src/main/scala/org/apache/spark/sql/hudi/catalog/StageHoodieTable.scala

...asource/hudi-spark3/src/main/scala/org/apache/spark/sql/hudi/catalog/TableCreationModes.java

...tasource/hudi-spark3/src/main/scala/org/apache/spark/sql/hudi/catalog/StageHoodieTable.scala

...spark-datasource/hudi-spark3/src/main/scala/org/apache/spark/sql/adapter/Spark3Adapter.scala

...ource/hudi-spark3/src/main/scala/org/apache/spark/sql/hudi/catalog/HoodieConfigBuilder.scala

leesf · 2021-12-31T02:40:56Z

@leesf so this is just preparing the code and moving things around? no functional changes? Will review this more closely

yes, no functional changes.

leesf · 2021-12-31T06:47:25Z

.github/workflows/bot.yml

            spark: "spark2"
          - scala: "scala-2.11"
            spark: "spark2,spark-shade-unbundle-avro"
-          - scala: "scala-2.12"


@xushiyan I find it very hard to make it compatible between spark 3.2.0 and spark 3.0.x/spark3.1.x(there is no V1Write for spark 3.0.x and 3.1.x ) after we upgrade spark version to spark 3.2.0, so I commented out the workflow.

@leesf We introduce build profiles spark3.0.x and spark3.1.x mostly due to spark's own incompatibilities. Here I think we can make some rules: to enable v2 writer, users have to make sure they're on spark 3.2+. Sounds good? In future, we may gradually drop support for old spark versions if the old spark code deviates too far from the latest one.

sounds good.

leesf · 2022-01-02T14:36:54Z

@hudi-bot run azure

…k3 work. 2. Introduce HoodieCatalog to manage hudi tables and make write/read workable with spark2&spark3

hudi-bot · 2022-01-05T11:43:34Z

CI report:

5f2bceb UNKNOWN
3855884 UNKNOWN
78e8080 UNKNOWN
daaabf8 UNKNOWN
082742e UNKNOWN
f984f3a UNKNOWN
080c565 UNKNOWN
61b9853 UNKNOWN
27e6196 UNKNOWN
57b44c7 UNKNOWN
6e90594 Azure: SUCCESS

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run azure re-run the last Azure build

vinothchandar · 2022-01-06T05:47:32Z

hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/DefaultSource.scala

  }

-  override def shortName(): String = "hudi"
+  override def shortName(): String = "hudi_v1"


this would cause every job out there to be upgraded? Not sure if we can afford to do this. Also would like to clearly understand if the new v2 implementation will support ALL the existing functionality or be a drop-in replacement for the current v1 implementation?

I think its crucial to get aligned on this before we proceed further.

boneanxs · 2022-01-10T03:51:42Z

hudi-client/hudi-spark-client/src/main/scala/org/apache/spark/sql/hudi/SparkAdapter.scala

   */
  def parseMultipartIdentifier(parser: ParserInterface, sqlText: String): Seq[String]
+
+  def isHoodieTable(table: LogicalPlan, spark: SparkSession): Boolean = {


Is there any difference with hoodieSqlCommonUtils.isHoodieTable? I see sometimes we use adapter.isHoodieTable, sometimes use hoodieSqlCommonUtils.isHoodieTable

Is there any difference with hoodieSqlCommonUtils.isHoodieTable? I see sometimes we use adapter.isHoodieTable, sometimes use hoodieSqlCommonUtils.isHoodieTable

in fact hoodieSqlCommonUtils.isHoodieTable method is used in v1 codebase to judge if a table is a hoodie table in v1 codebase , but adapter.isHoodieTable method is to judge if a table is a hoodie table in v2 codebase, change the name would be better to understand.

Thanks for your reply, and please correct me if I'm wrong. Can we just move the method implementation out from SparkAdapter to Spark2Adapter, Spark2Adapter can judge if a table is a hoodie table in v1 codebase, while Spark3Adapter can judge it by v2 codebase, by this, I think the method hoodieSqlCommonUtils.isHoodieTable can be simply removed?

Looking forward this PR can be merged as soon as possible, this is an excellent work as we can support many other features with DSV2 based on this. :)

leesf · 2022-01-20T08:23:03Z

closing this as have a new PR #4611

leesf force-pushed the HUDI-3047 branch 2 times, most recently from efaf807 to e8837f6 Compare December 17, 2021 08:31

leesf force-pushed the HUDI-3047 branch 3 times, most recently from ac933cb to 0ecb15f Compare December 20, 2021 14:50

vinothchandar self-assigned this Dec 21, 2021

vinothchandar added the big-needle-movers label Dec 21, 2021

leesf force-pushed the HUDI-3047 branch 5 times, most recently from 3855884 to fea9016 Compare December 22, 2021 12:28

leesf assigned xushiyan Dec 23, 2021

xushiyan reviewed Dec 23, 2021

View reviewed changes

leesf force-pushed the HUDI-3047 branch 2 times, most recently from daaabf8 to 66d2d16 Compare December 23, 2021 15:08

vinothchandar reviewed Dec 25, 2021

View reviewed changes

leesf force-pushed the HUDI-3047 branch 2 times, most recently from f984f3a to 66d2d16 Compare December 26, 2021 14:49

xushiyan reviewed Dec 30, 2021

View reviewed changes

leesf force-pushed the HUDI-3047 branch from 2543c00 to 4b65108 Compare December 31, 2021 03:42

leesf commented Dec 31, 2021

View reviewed changes

leesf force-pushed the HUDI-3047 branch 5 times, most recently from e98123a to 7a49e17 Compare January 2, 2022 12:48

leesf force-pushed the HUDI-3047 branch from 898203e to fe7c2dd Compare January 3, 2022 05:34

leesf added 9 commits January 4, 2022 14:01

1. add modules and refactor exisiting modules to make spark2 and spar…

e9d8eaf

…k3 work. 2. Introduce HoodieCatalog to manage hudi tables and make write/read workable with spark2&spark3

fix comments

b08c502

fix build

958af8c

fix hudi-spark-bundle package

a3aae4e

fix spark 3.2.0 tests

ff026f4

code refactor

b96b821

code refactor and fix tests

dc2f8a9

fix some test failures

3d947ae

fix spark 3.1.x test failure

54b205a

leesf force-pushed the HUDI-3047 branch from fe7c2dd to 57b44c7 Compare January 5, 2022 09:49

adapter to spark 3.2.0

6e90594

leesf force-pushed the HUDI-3047 branch from 57b44c7 to 6e90594 Compare January 5, 2022 09:50

vinothchandar reviewed Jan 6, 2022

View reviewed changes

boneanxs reviewed Jan 10, 2022

View reviewed changes

xushiyan mentioned this pull request Jan 20, 2022

[HUDI-3254] Introduce HoodieCatalog to manage tables for Spark Datasource V2 #4611

Merged

5 tasks

leesf closed this Jan 20, 2022

[HUDI-3047] Basic Implementation of Spark Datasource V2 #4350

[HUDI-3047] Basic Implementation of Spark Datasource V2 #4350

Uh oh!

Conversation

leesf commented Dec 17, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Tips

What is the purpose of the pull request

Brief change log

Verify this pull request

Committer checklist

Uh oh!

leesf commented Dec 17, 2021

Uh oh!

xiarixiaoyao commented Dec 17, 2021

Uh oh!

leesf commented Dec 23, 2021

Uh oh!

Uh oh!

Uh oh!

Uh oh!

xushiyan commented Dec 23, 2021

Uh oh!

leesf commented Dec 23, 2021

Uh oh!

vinothchandar left a comment

Choose a reason for hiding this comment

Uh oh!

leesf commented Dec 26, 2021

Uh oh!

leesf commented Dec 27, 2021

Uh oh!

vinothchandar commented Dec 29, 2021

Uh oh!

vinothchandar commented Dec 29, 2021

Uh oh!

xushiyan left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

leesf commented Dec 31, 2021

Uh oh!

leesf Dec 31, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

xushiyan Jan 2, 2022

Choose a reason for hiding this comment

Uh oh!

leesf Jan 3, 2022

Choose a reason for hiding this comment

Uh oh!

leesf commented Jan 2, 2022

Uh oh!

hudi-bot commented Jan 5, 2022

CI report:

Uh oh!

vinothchandar Jan 6, 2022

Choose a reason for hiding this comment

Uh oh!

boneanxs Jan 10, 2022

Choose a reason for hiding this comment

Uh oh!

leesf Jan 12, 2022

Choose a reason for hiding this comment

Uh oh!

boneanxs Jan 14, 2022

Choose a reason for hiding this comment

Uh oh!

leesf commented Jan 20, 2022

Uh oh!

Reviewers

leesf commented Dec 17, 2021 •

edited

Loading

leesf Dec 31, 2021 •

edited

Loading