Skip to content

Conversation

@leesf
Copy link
Contributor

@leesf leesf commented Dec 17, 2021

Tips

What is the purpose of the pull request

  1. add modules and refactor existing modules to make spark2 and spark3 work, the main change is to move classes from hudi-spark module to hudi-spark-common module to make both hudi-spark2, hudi-spark3 and hudi-spark3.1.x modules would make use of the classes, also add hudi-spark3-common and hudi-spark2-common modules to make spark3.1.x and spark 3.2.0 reuse most of code, hudi-spark2-common has no class and is just a placeholder to make package command work in hudi-spark-bundle & hudi-spark module.
  2. change spark-datasource/hudi-spark hudi format to hudi_v1 and add hudi format in hudi-spark2, hudi-spark3 and hudi-spark3.1.x module.
  3. Introduce HoodieCatalog to manage hudi tables and make write/read workable with spark2&spark3 workable, currently the write and read path both fallback to V1 path, but introduce V2 Table to pave the way for pure V2 code path, and accordingly handle V2 command in HoodieSpark3Analysis class since some commands changed in V2.
  4. spark 3.1.x and spark 2.4.x would still use the exist spark sql to write and read data and still use V1 format, but from spark3.2.0, we move to V2 format, users need config 'org.apache.spark.sql.spark_catalog=org.apache.spark.sql.hudi.catalog.HoodieCatalog'
  5. add flag to exist tests on spark 3.2.0 with HoodieCatalog.

Brief change log

(for example:)

  • Modify AnnotationLocation checkstyle rule in checkstyle.xml

Verify this pull request

(Please pick either of the following options)

This pull request is a trivial rework / code cleanup without any test coverage.

(or)

This pull request is already covered by existing tests, such as (please describe tests).

(or)

This change added tests and can be verified as follows:

(example:)

  • Added integration tests for end-to-end.
  • Added HoodieClientWriteTest to verify the change.
  • Manually verified the change by running a job locally.

Committer checklist

  • Has a corresponding JIRA in PR title & commit

  • Commit message is descriptive of the change

  • CI is green

  • Necessary doc changes done or have another open PR

  • For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.

@leesf
Copy link
Contributor Author

leesf commented Dec 17, 2021

@leesf leesf force-pushed the HUDI-3047 branch 2 times, most recently from efaf807 to e8837f6 Compare December 17, 2021 08:31
@xiarixiaoyao
Copy link
Contributor

gread work!!!

@leesf leesf force-pushed the HUDI-3047 branch 3 times, most recently from ac933cb to 0ecb15f Compare December 20, 2021 14:50
@vinothchandar vinothchandar self-assigned this Dec 21, 2021
@leesf leesf force-pushed the HUDI-3047 branch 5 times, most recently from 3855884 to fea9016 Compare December 22, 2021 12:28
@leesf
Copy link
Contributor Author

leesf commented Dec 23, 2021

@hudi-bot run azure

@xushiyan
Copy link
Member

@leesf Only half way through.. will continue tmr...could you please also summarize the changes in the description as a high-level intro? easier to digest.. thanks!

@leesf
Copy link
Contributor Author

leesf commented Dec 23, 2021

@leesf Only half way through.. will continue tmr...could you please also summarize the changes in the description as a high-level intro? easier to digest.. thanks!

I have edited the description to summarize the changes .

@leesf leesf force-pushed the HUDI-3047 branch 2 times, most recently from daaabf8 to 66d2d16 Compare December 23, 2021 15:08
Copy link
Member

@vinothchandar vinothchandar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@leesf before looking at the code deeply, Can you please clarify what breaking changes we plan in this approach? Is the current "hudi " source codepath renamed to something else? So existing users need to update "All" their jobs to use the new source?

@leesf
Copy link
Contributor Author

leesf commented Dec 26, 2021

@leesf before looking at the code deeply, Can you please clarify what breaking changes we plan in this approach? Is the current "hudi " source codepath renamed to something else? So existing users need to update "All" their jobs to use the new source?

@vinothchandar There is no breaking changes for the approach, users have no need to change their format. I move some classes from hudi-spark-datasource/hudi-spark into hudi-spark-datasource/hudi-spark-common module to make reuse of these classes. And the hudi format under spark-datasource/hudi-spark module before is moved to spark-datasource/hudi-spark-common to reuse DefaultSource code. And I introduce hudi format which located in both hudi-spark-datasource/hudi-spark2 and hudi-spark-datasource/hudi-spark3 which means no matter which spark version users use, they use still use hudi format. since the hudi-spark-datasource/hudi-spark module depends on hudi-spark-datasource/hudi-spark2 or hudi-spark-datasource/hudi-spark3 module.

@leesf leesf force-pushed the HUDI-3047 branch 2 times, most recently from f984f3a to 66d2d16 Compare December 26, 2021 14:49
@leesf
Copy link
Contributor Author

leesf commented Dec 27, 2021

@hudi-bot run azure

@vinothchandar
Copy link
Member

@leesf so this is just preparing the code and moving things around? no functional changes? Will review this more closely

@vinothchandar
Copy link
Member

you may want to rebase the PR. seems like a lot went in :)

Copy link
Member

@xushiyan xushiyan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did a quick pass and left some naming suggestions.

@leesf
Copy link
Contributor Author

leesf commented Dec 31, 2021

@leesf so this is just preparing the code and moving things around? no functional changes? Will review this more closely

yes, no functional changes.

spark: "spark2"
- scala: "scala-2.11"
spark: "spark2,spark-shade-unbundle-avro"
- scala: "scala-2.12"
Copy link
Contributor Author

@leesf leesf Dec 31, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@xushiyan I find it very hard to make it compatible between spark 3.2.0 and spark 3.0.x/spark3.1.x(there is no V1Write for spark 3.0.x and 3.1.x ) after we upgrade spark version to spark 3.2.0, so I commented out the workflow.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@leesf We introduce build profiles spark3.0.x and spark3.1.x mostly due to spark's own incompatibilities. Here I think we can make some rules: to enable v2 writer, users have to make sure they're on spark 3.2+. Sounds good? In future, we may gradually drop support for old spark versions if the old spark code deviates too far from the latest one.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds good.

@leesf leesf force-pushed the HUDI-3047 branch 5 times, most recently from e98123a to 7a49e17 Compare January 2, 2022 12:48
@leesf
Copy link
Contributor Author

leesf commented Jan 2, 2022

@hudi-bot run azure

@hudi-bot
Copy link
Collaborator

hudi-bot commented Jan 5, 2022

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

}

override def shortName(): String = "hudi"
override def shortName(): String = "hudi_v1"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this would cause every job out there to be upgraded? Not sure if we can afford to do this. Also would like to clearly understand if the new v2 implementation will support ALL the existing functionality or be a drop-in replacement for the current v1 implementation?

I think its crucial to get aligned on this before we proceed further.

*/
def parseMultipartIdentifier(parser: ParserInterface, sqlText: String): Seq[String]

def isHoodieTable(table: LogicalPlan, spark: SparkSession): Boolean = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any difference with hoodieSqlCommonUtils.isHoodieTable? I see sometimes we use adapter.isHoodieTable, sometimes use hoodieSqlCommonUtils.isHoodieTable

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any difference with hoodieSqlCommonUtils.isHoodieTable? I see sometimes we use adapter.isHoodieTable, sometimes use hoodieSqlCommonUtils.isHoodieTable

in fact hoodieSqlCommonUtils.isHoodieTable method is used in v1 codebase to judge if a table is a hoodie table in v1 codebase , but adapter.isHoodieTable method is to judge if a table is a hoodie table in v2 codebase, change the name would be better to understand.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your reply, and please correct me if I'm wrong. Can we just move the method implementation out from SparkAdapter to Spark2Adapter, Spark2Adapter can judge if a table is a hoodie table in v1 codebase, while Spark3Adapter can judge it by v2 codebase, by this, I think the method hoodieSqlCommonUtils.isHoodieTable can be simply removed?

Looking forward this PR can be merged as soon as possible, this is an excellent work as we can support many other features with DSV2 based on this. :)

@leesf
Copy link
Contributor Author

leesf commented Jan 20, 2022

closing this as have a new PR #4611

@leesf leesf closed this Jan 20, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants