-
Notifications
You must be signed in to change notification settings - Fork 2.5k
[HUDI-76] Add CSV Source support for Hudi Delta Streamer #1165
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[HUDI-76] Add CSV Source support for Hudi Delta Streamer #1165
Conversation
|
is this still WIP? |
|
@vinothchandar I'm adding unit test cases. |
|
@yihua : Please ping me once the PR is ready for review. Thanks, |
c73b7e9 to
f37bd95
Compare
f37bd95 to
03f37c4
Compare
|
@bvaradar This PR is ready for review. @leesf @vinothchandar Feel free to also review this PR. I'm not sure if we can merge this PR by the release cut. If not, we can add this feature to the next release. Thanks @UZi5136225 for helping test the functionality of this PR and reporting the issue of corrupt data generated from DeltaStreamer with text files (CSV format with no header line). The latter has been fix in another PR. |
|
|
e68ad87 to
f9bf397
Compare
|
@yihua are you targeting this for the next release still |
|
@vinothchandar From my side, the code change is ready. I'm not sure if it can be reviewed and merged in time. I'm fine with pushing this to v0.6.0. |
|
@bvaradar I added more javadoc and checked that Spark CSV supports timestamp-type fields. |
|
cc @pratyakshsharma could you help review this? |
|
@vinothchandar ack. |
|
cc @nsivabalan could you take over in case @pratyakshsharma is busy.. |
|
@vinothchandar will review it by EOD today. :) |
|
Sure. go ahead. I also plan to review it sometime. but will let you be the primary reviewer. |
hudi-utilities/src/test/java/org/apache/hudi/utilities/TestHoodieDeltaStreamer.java
Outdated
Show resolved
Hide resolved
hudi-utilities/src/test/java/org/apache/hudi/utilities/TestHoodieDeltaStreamer.java
Outdated
Show resolved
Hide resolved
hudi-utilities/src/test/java/org/apache/hudi/utilities/TestHoodieDeltaStreamer.java
Outdated
Show resolved
Hide resolved
|
@nsivabalan Please shepherd this across the finish line :) |
|
Sure. Will take care. |
bb188f2 to
e23adf4
Compare
yihua
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pratyakshsharma Thanks for the review! Good catches on the nits. I've addressed them.
@nsivabalan The PR is ready for a final pass.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, for CSV format, the nested schema is not well supported. So to test CSV source, we need to generate the test CSV data with a flattened schema.
nsivabalan
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor comments. Will merge once addressed.
hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/CsvDFSSource.java
Outdated
Show resolved
Hide resolved
hudi-utilities/src/test/java/org/apache/hudi/utilities/TestHoodieDeltaStreamer.java
Outdated
Show resolved
Hide resolved
hudi-utilities/src/test/java/org/apache/hudi/utilities/TestHoodieDeltaStreamer.java
Outdated
Show resolved
Hide resolved
hudi-utilities/src/test/java/org/apache/hudi/utilities/UtilitiesTestBase.java
Outdated
Show resolved
Hide resolved
hudi-utilities/src/test/java/org/apache/hudi/utilities/sources/TestCsvDFSSource.java
Outdated
Show resolved
Hide resolved
|
One question about using nested schema. Can you remind me what happens if someone passes in a nested schema for CsvDeltaStreamer? |
|
@yihua : Can we get this across the line. |
|
Sorry for the delay. I'll get to this PR this week. |
I used some code below to test the nested schema for CSV reader in Spark. It throws the following exception, which means that Spark CSV source does not support nested schema currently. In most cases, the CSV schemas should be flattened. It depends on Spark's behavior whether nested schema is supported for CSV source (in the future nested schema may be supported for CSV). So we don't enforce the check in our Hudi code. Appendix: a simple diff for testing nested CSV schema: |
e23adf4 to
fb6bc0b
Compare
yihua
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nsivabalan Thanks for the review. Please take another look.
hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/CsvDFSSource.java
Outdated
Show resolved
Hide resolved
hudi-utilities/src/test/java/org/apache/hudi/utilities/TestHoodieDeltaStreamer.java
Outdated
Show resolved
Hide resolved
hudi-utilities/src/test/java/org/apache/hudi/utilities/TestHoodieDeltaStreamer.java
Outdated
Show resolved
Hide resolved
hudi-utilities/src/test/java/org/apache/hudi/utilities/UtilitiesTestBase.java
Outdated
Show resolved
Hide resolved
hudi-utilities/src/test/java/org/apache/hudi/utilities/sources/TestCsvDFSSource.java
Outdated
Show resolved
Hide resolved
Codecov Report
@@ Coverage Diff @@
## master #1165 +/- ##
============================================
+ Coverage 67.69% 67.74% +0.04%
- Complexity 243 253 +10
============================================
Files 338 339 +1
Lines 16371 16396 +25
Branches 1672 1676 +4
============================================
+ Hits 11083 11108 +25
Misses 4548 4548
Partials 740 740
Continue to review full report at Codecov.
|
|
@yihua : LGTM. Can you squash the commits. I will merge after that. |
|
@vinothchandar : do you think we can have a blog on deltastreamer with csv? |
|
I plan to write one after this is merged. |
fb6bc0b to
cf765df
Compare
|
@nsivabalan Done squashing the commits. |
|
yes.. lets please do a blog! |
|
I don't see an option to merge the PR. Is it that @leesf is yet to approve? or do I need to request permission or something ? |
@nsivabalan Please refer to this wiki to get github write access to the repository. https://cwiki.apache.org/confluence/display/HUDI/Committer+On-boarding+Guide |
|
@leesf : Thanks. I got the permission now. |
You are welcome and a nice shot. just one minor tip, please merge(squash & merge / rebase & merge) with [HUDI-xxx] at the begining of commit. :) |
|
got it, sure. |
|
@leesf this has happened enough times now, that we probably need a Code Review guide as well? wdyt |
Agree, I would like to update https://cwiki.apache.org/confluence/display/HUDI/Committer+On-boarding+Guide to add Code Review guide for new committers, wdyt? |
|
I feel it can go on the contributing guide.. Code reviews are also contributing :) .. either way is fine by me.. Draft something and share on the mailing list? |
Sure, will draft when get a chance. |
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
What is the purpose of the pull request
Add CSV Source support for Hudi Delta Streamer
Brief change log
Verify this pull request
This change added tests and can be verified as follows:
Committer checklist
Has a corresponding JIRA in PR title & commit
Commit message is descriptive of the change
CI is green
Necessary doc changes done or have another open PR
For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.