Skip to content

Conversation

@wangxianghu
Copy link
Contributor

@wangxianghu wangxianghu commented May 25, 2020

Tips

What is the purpose of the pull request

This pull request introduce HoodieWriteInput for hudi write client as the unified input format.

Brief change log

  • add a new module hudi-writer-commom
  • introduce HoodieWriteInput for hudi write client as the unified input format.

Verify this pull request

This pull request is a trivial rework / code cleanup without any test coverage.

Committer checklist

  • Has a corresponding JIRA in PR title & commit

  • Commit message is descriptive of the change

  • CI is green

  • Necessary doc changes done or have another open PR

  • For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.

@wangxianghu
Copy link
Contributor Author

Hi @yanghua , could you please take a look when free, thanks!

@codecov-commenter
Copy link

codecov-commenter commented May 25, 2020

Codecov Report

Merging #1665 into master will not change coverage.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff            @@
##             master    #1665   +/-   ##
=========================================
  Coverage     18.21%   18.21%           
  Complexity      856      856           
=========================================
  Files           348      348           
  Lines         15332    15332           
  Branches       1523     1523           
=========================================
  Hits           2792     2792           
  Misses        12183    12183           
  Partials        357      357           

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update bde7a70...9f0121f. Read the comment docs.

@vinothchandar vinothchandar self-assigned this May 25, 2020
@vinothchandar
Copy link
Member

Is there an umbrella task to understand how all the follow up work will be.. on this..For e.g I am wondering what the eventual methods on HoodieWriteInput will be and how it will abstract away the RDD construct

@wangxianghu
Copy link
Contributor Author

wangxianghu commented May 26, 2020

Is there an umbrella task to understand how all the follow up work will be.. on this..For e.g I am wondering what the eventual methods on HoodieWriteInput will be and how it will abstract away the RDD construct

Is there an umbrella task to understand how all the follow up work will be.. on this..For e.g I am wondering what the eventual methods on HoodieWriteInput will be and how it will abstract away the RDD construct

yes, here it is : https://issues.apache.org/jira/browse/HUDI-909
currently only four subtasks are filed, which is the very foundation of the entire abstraction:

  1. Introduce HoodieWriteInput for hudi write client: https://issues.apache.org/jira/browse/HUDI-910
  2. Introduce HoodieWriteOutput for hudi write client: https://issues.apache.org/jira/browse/HUDI-911
  3. Introduce HoodieWriteKey for hudi write client: https://issues.apache.org/jira/browse/HUDI-912
  4. Introduce HoodieEngineContext for hudi write client: https://issues.apache.org/jira/browse/HUDI-913

For Spark these could be :

JavaRDD<HoodieRecord<T>> records = ... ; // read from souce
HoodieWriteInput<JavaRDD<HoodieRecord<T>>> inputRecords = new HoodieWriteInput(records);
JavaRDD<HoodieRecord<T>> inputRdds = inputRecords.getInputs();
JavaSparkContext jsc = ...;
SerializableConfiguration hadoopConf = ...;`
HoodieEngineContext  hec = new HoodieSparkEngineContext(hadoopConf, jsc); //HoodieSparkEngineContext extends HoodieEngineContext`
JavaSparkContext jsc = ((HoodieSparkEngineContext)hec).getContext();

The HoodieWriteKey and HoodieWriteOutput are the same as HoodieWriteInput.

upsert api could be like this:

public HoodieWriteOutput<JavaRDD<WriteStatus>> upsert(HoodieWriteInput<JavaRDD<HoodieRecord<T>>> records, final String instantTime) {...}

The content of the method is almost the same as before.

For Java and Flink, just replace the JavaRDD with List.

Copy link
Contributor

@leesf leesf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am ok with the abstraction and reviewed before in other branch.

@leesf
Copy link
Contributor

leesf commented May 28, 2020

@vinothchandar just a reminder on this PR.

@vinothchandar
Copy link
Member

@wangxianghu @leesf sorry for that delays. I was trying to understand the relation to this work
https://issues.apache.org/jira/browse/HUDI-538 cc @yanghua
This is not still clear to me where we are moving towards. While I agree that abstracting RDD into a WriteInput, like to understand how the existing code is going to evolve further. May I ask that we do a draft or first that replaces the entire code in hudi-client with these abstractions ( no need to even have the code compile. But want to understand the final shape we are looking at). I am also happy to do that myself and discuss.. Please let me know if that seems like a reasonable ask

@vinothchandar
Copy link
Member

This pr itself is fine. But given we are adding a new module and this is a critical thing to get right, trying to understand more upfront

@wangxianghu
Copy link
Contributor Author

@wangxianghu @leesf sorry for that delays. I was trying to understand the relation to this work
https://issues.apache.org/jira/browse/HUDI-538 cc @yanghua
This is not still clear to me where we are moving towards. While I agree that abstracting RDD into a WriteInput, like to understand how the existing code is going to evolve further. May I ask that we do a draft or first that replaces the entire code in hudi-client with these abstractions ( no need to even have the code compile. But want to understand the final shape we are looking at). I am also happy to do that myself and discuss.. Please let me know if that seems like a reasonable ask

Hi @vinothchandar thanks for your suggestion. Replacing the entire code in hudi-client with new abstractions is a good choose, I'll do it :)

@wangxianghu wangxianghu marked this pull request as draft June 1, 2020 02:51
@vinothchandar
Copy link
Member

@wangxianghu Thanks for being a sport! That will give us good confidence that we can ultimately pull this off.. Happy to help along as needed..

@wangxianghu
Copy link
Contributor Author

@wangxianghu Thanks for being a sport! That will give us good confidence that we can ultimately pull this off.. Happy to help along as needed..

@wangxianghu wangxianghu closed this Jun 9, 2020
@wangxianghu wangxianghu reopened this Jun 9, 2020
@wangxianghu
Copy link
Contributor Author

wangxianghu commented Jun 9, 2020

Hi @vinothchandar , This is the abstration of hudi-client module(not finished yet):
https://github.com/wangxianghu/hudi/tree/HUDI-xxx
please take a look when free.
cc @yanghua @leesf

@leesf
Copy link
Contributor

leesf commented Jun 10, 2020

Hi @vinothchandar , This is the abstration of hudi-client module(not finished yet):
https://github.com/wangxianghu/hudi/tree/HUDI-xxx
please take a look when free.
cc @yanghua @leesf

Ack, look at the branch, the structure looks good to me, @vinothchandar would you please also take a look when you are free, we need land this feature ASAP.

@wangxianghu wangxianghu marked this pull request as ready for review June 11, 2020 09:33
@vinothchandar
Copy link
Member

Sg.. Will jump on #1727 . Closing this one

@wangxianghu wangxianghu deleted the HUDI-910 branch August 28, 2020 10:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants