Skip to content

[HUDI-3172] Refactor hudi existing modules to make more code reuse in V2 Implementation#4514

Merged
leesf merged 4 commits intoapache:masterfrom
leesf:HUDI-3172
Jan 14, 2022
Merged

[HUDI-3172] Refactor hudi existing modules to make more code reuse in V2 Implementation#4514
leesf merged 4 commits intoapache:masterfrom
leesf:HUDI-3172

Conversation

@leesf
Copy link
Copy Markdown
Contributor

@leesf leesf commented Jan 5, 2022

Tips

What is the purpose of the pull request

  • Introduce hudi-spark3-common and hudi-spark2-common modules to place classes that would be reused in different spark versions, also introduce hudi-spark3.1.x to support spark 3.1.x
  • Introduce hudi format under hudi-spark2, hudi-spark3, hudi-spark3.1.x modules and change the hudi format in original hudi-spark module to hudi_v1 format
  • Manually tested on Spark 3.1.2 and Spark 3.2.0 SQL.

Brief change log

(for example:)

  • Modify AnnotationLocation checkstyle rule in checkstyle.xml

Verify this pull request

(Please pick either of the following options)

This pull request is a trivial rework / code cleanup without any test coverage.

(or)

This pull request is already covered by existing tests, such as (please describe tests).

(or)

This change added tests and can be verified as follows:

(example:)

  • Added integration tests for end-to-end.
  • Added HoodieClientWriteTest to verify the change.
  • Manually verified the change by running a job locally.

Committer checklist

  • Has a corresponding JIRA in PR title & commit

  • Commit message is descriptive of the change

  • CI is green

  • Necessary doc changes done or have another open PR

  • For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.

@leesf leesf force-pushed the HUDI-3172 branch 4 times, most recently from 306e7d4 to 3a20dc1 Compare January 5, 2022 12:01
@leesf leesf requested a review from xushiyan January 5, 2022 12:01
@leesf
Copy link
Copy Markdown
Contributor Author

leesf commented Jan 5, 2022

@hudi-bot run azure

@leesf leesf force-pushed the HUDI-3172 branch 3 times, most recently from d9381d0 to ac8d014 Compare January 6, 2022 05:15
@vinothchandar vinothchandar changed the title [HUDI-3172] Refactor hudi existing modules to make more code reuse in… [HUDI-3172] Refactor hudi existing modules to make more code reuse in V2 Implementation Jan 6, 2022
Copy link
Copy Markdown
Member

@vinothchandar vinothchandar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me make another pass at all the pom changes. That seems to be main thing here.
In the meantime, could you clarify these comments?

Also have you tested these changes across spark 2.x and 3.1/3.2 bundles ?

}

override def shortName(): String = "hudi"
override def shortName(): String = "hudi_v1"
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@leesf i suppose this refactoring PR not meant to include this change?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If not change the format, it would conflict with hudi-spark2/hudi-spark3.1.x/hudi-spark3 module format.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would it conflict? Given we are extending DefaultSource and overriding shortName()?

Copy link
Copy Markdown
Contributor Author

@leesf leesf Jan 7, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is because in hudi-spark-bundle module. I used <transformer implementation="org.apache.maven.plugins.shade.resource.AppendingTransformer"> <resource>META-INF/services/org.apache.spark.sql.sources.DataSourceRegister</resource> </transformer> to append the formats(hudi_v1 and hudi) in DataSourceRegister file, so it will conflict if not change the format. As to the PR itself, we do not need to change the format to hudi_v1 and not use AppendingTransformer. But when implementing V2 codepath, I find it difficult to handle the incremental bootstrap table(

) as it will pass schema to Spark and difficult to handle in v2 codepath, after finding a good way to handle it, we would definitely delete the hudi_v1 format here.

@leesf
Copy link
Copy Markdown
Contributor Author

leesf commented Jan 6, 2022

Let me make another pass at all the pom changes. That seems to be main thing here. In the meantime, could you clarify these comments?

Also have you tested these changes across spark 2.x and 3.1/3.2 bundles ?

@vinothchandar Yes, I have manually tested it with spark 3.2.0 and spark 3.1.2 version on spark sql. and the CI tested it on spark 2.4.x and works well.

Copy link
Copy Markdown
Member

@xushiyan xushiyan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Since a lot of files were moved, cherry pick later commits for 0.10.1 could be problematic. Shall we move this into a feature branch?

@vinothchandar
Copy link
Copy Markdown
Member

@xushiyan given we are almost winding down for 0.10.1, I suggest we land this sooner than later. That way we can focus on stabilizing master for 0.11.0 - that's not too far away. wdyt?

@xushiyan
Copy link
Copy Markdown
Member

xushiyan commented Jan 7, 2022

@xushiyan given we are almost winding down for 0.10.1, I suggest we land this sooner than later. That way we can focus on stabilizing master for 0.11.0 - that's not too far away. wdyt?

it won't be too far away. @nsivabalan is cherry picking for 0.10.1 which will complete by Jan 9. Holding this off for 2 more days can avoid conflicts from some Spark fixes merged after this. Seeing 2 more fixes coming. After that we should be able to land this right away.

@vinothchandar
Copy link
Copy Markdown
Member

2 more days should be okay?

@vinothchandar
Copy link
Copy Markdown
Member

vinothchandar commented Jan 7, 2022

@leesf few questions.

  1. What do the hudi-spark2-extensions/hudi-spark3-extensions do? What code would these have in the future?

  2. Users may have spark jobs that depend directly on hudi-spark . any impact to these from the changes?

  3. I feel there are way too many packages now - hudi-spark*, then it depends hudi-spark-common* . Is there a way for us to merge hudi-spark-common into hudi-spark* ? is the hudi-spark3-common so we can share code across the different 3.x versions? Can you add a README that explains what code should go where going forward

  4. Any updates to release notes or README?

Copy link
Copy Markdown
Member

@vinothchandar vinothchandar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made some cleanup suggestions. LGTM overall

@nsivabalan
Copy link
Copy Markdown
Contributor

yeah, would really appreciate if we can wait until Jan 9 to land this patch. thanks!

@leesf
Copy link
Copy Markdown
Contributor Author

leesf commented Jan 7, 2022

@leesf few questions.

  1. What do the hudi-spark2-extensions/hudi-spark3-extensions do? What code would these have in the future?
  2. Users may have spark jobs that depend directly on hudi-spark . any impact to these from the changes?
  3. I feel there are way too many packages now - hudi-spark*, then it depends hudi-spark-common* . Is there a way for us to merge hudi-spark-common into hudi-spark* ? is the hudi-spark3-common so we can share code across the different 3.x versions? Can you add a README that explains what code should go where going forward
  4. Any updates to release notes or README?
  1. I did intend to remove the modules but wrongly added. removed now.
  2. I did not change hudi-spark module, so it should have no impact on the users.
  3. I think the current modules has be in a good shape, if merge code from hudi-spark-common into hudi-spark* or merge code from hudi-spark3-common into hudi-spark3 & spark3.1.x, there would be many duplicate classes, which i think is not an ideal way.
  4. Added a README under hudi-spark-datasource module.

@leesf leesf force-pushed the HUDI-3172 branch 3 times, most recently from b7b4aca to b6a98a3 Compare January 10, 2022 11:19
@leesf
Copy link
Copy Markdown
Contributor Author

leesf commented Jan 11, 2022

yeah, would really appreciate if we can wait until Jan 9 to land this patch. thanks!

@nsivabalan @xushiyan time to land this patch?

@xushiyan
Copy link
Copy Markdown
Member

@leesf yes we won't have conflicting patches to pick from master. we can land this one now.

@leesf
Copy link
Copy Markdown
Contributor Author

leesf commented Jan 11, 2022

@leesf yes we won't have conflicting patches to pick from master. we can land this one now.

@xushiyan CI passed, so can we merge?

@leesf
Copy link
Copy Markdown
Contributor Author

leesf commented Jan 13, 2022

@hudi-bot run azure

@hudi-bot
Copy link
Copy Markdown
Collaborator

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

@leesf leesf merged commit 5ce45c4 into apache:master Jan 14, 2022
@vinishjail97 vinishjail97 mentioned this pull request Jan 24, 2022
5 tasks
vingov pushed a commit to vingov/hudi that referenced this pull request Jan 26, 2022
… V2 Implementation (apache#4514)

* Introduce hudi-spark3-common and hudi-spark2-common modules to place classes that would be reused in different spark versions, also introduce hudi-spark3.1.x to support spark 3.1.x.
* Introduce hudi format under hudi-spark2, hudi-spark3, hudi-spark3.1.x modules and change the hudi format in original hudi-spark module to hudi_v1 format.
* Manually tested on Spark 3.1.2 and Spark 3.2.0 SQL.
* Added a README.md file under hudi-spark-datasource module.
liusenhua pushed a commit to liusenhua/hudi that referenced this pull request Mar 1, 2022
… V2 Implementation (apache#4514)

* Introduce hudi-spark3-common and hudi-spark2-common modules to place classes that would be reused in different spark versions, also introduce hudi-spark3.1.x to support spark 3.1.x.
* Introduce hudi format under hudi-spark2, hudi-spark3, hudi-spark3.1.x modules and change the hudi format in original hudi-spark module to hudi_v1 format.
* Manually tested on Spark 3.1.2 and Spark 3.2.0 SQL.
* Added a README.md file under hudi-spark-datasource module.
vingov pushed a commit to vingov/hudi that referenced this pull request Apr 3, 2022
… V2 Implementation (apache#4514)

* Introduce hudi-spark3-common and hudi-spark2-common modules to place classes that would be reused in different spark versions, also introduce hudi-spark3.1.x to support spark 3.1.x.
* Introduce hudi format under hudi-spark2, hudi-spark3, hudi-spark3.1.x modules and change the hudi format in original hudi-spark module to hudi_v1 format.
* Manually tested on Spark 3.1.2 and Spark 3.2.0 SQL.
* Added a README.md file under hudi-spark-datasource module.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants