Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add skip header lines in TextIO #28502

Closed
wants to merge 4 commits into from
Closed

Add skip header lines in TextIO #28502

wants to merge 4 commits into from

Conversation

gudfhr95
Copy link

@gudfhr95 gudfhr95 commented Sep 18, 2023

Fixes #17990

Changes

  • Add withSkipHeaderLines
  • Disable isSplittable of TextSource when skipHeaderLines is greater than 0

Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:

  • Mention the appropriate issue in your description (for example: addresses #123), if applicable. This will automatically add a link to the pull request in the issue. If you would like the issue to automatically close on merging the pull request, comment fixes #<ISSUE NUMBER> instead.
  • Update CHANGES.md with noteworthy changes.
  • If this contribution is large, please file an Apache Individual Contributor License Agreement.

See the Contributor Guide for more tips on how to make review process smoother.

To check the build health, please visit https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md

GitHub Actions Tests Status (on master branch)

Build python source distribution and wheels
Python tests
Java tests
Go tests

See CI.md for more information about GitHub Actions CI or the workflows README to see a list of phrases to trigger workflows.

@github-actions github-actions bot added the java label Sep 18, 2023
@gudfhr95 gudfhr95 marked this pull request as ready for review September 18, 2023 15:47
@github-actions
Copy link
Contributor

Checks are failing. Will not request review until checks are succeeding. If you'd like to override that behavior, comment assign set of reviewers

@gudfhr95
Copy link
Author

Run Java_GCP_IO_Direct PreCommit

@gudfhr95
Copy link
Author

Run Java PreCommit

Copy link
Contributor

@damondouglas damondouglas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good day @gudfhr95, Thank you so much for your contribution! I would like to propose the following change to your PR.

Instead of making changes to TextIO, FileIO.ReadableFile's readFullyAsUTF8String() method would still fulfill #17990's feature request. I would like to propose that this PR add a new method such as: public String readFullyAsUTF8String(int skipHeaderLines) throws IOException {}

What do you think of this idea? Your contribution and opinion is important.

Thank you, again for your contribution!

@gudfhr95
Copy link
Author

Hello, @damondouglas. I really appreciate for your review and your suggestion.
I have a question about your suggestion.

As far as I understand, even though I found every occurrences of readFullyAsUTF8String(), I think that there is no relationship between readFullyAsUTF8String() method and reading TextSource or FileBasedSource. Could you explain where can I find connection between them?

@damondouglas
Copy link
Contributor

damondouglas commented Sep 25, 2023

Hello, @damondouglas. I really appreciate for your review and your suggestion. I have a question about your suggestion.

As far as I understand, even though I found every occurrences of readFullyAsUTF8String(), I think that there is no relationship between readFullyAsUTF8String() method and reading TextSource or FileBasedSource. Could you explain where can I find connection between them?

Hello @gudfhr95, It's nice to hear from you and thank you for your response. I just created this example gist that uses FileIO.ReadableFile's readFullyAsUTF8String method. In the gist, I performed the line skipping in a DoFn as copied from the gist below.

@ProcessElement
public void process(@Element ReadableFile element) {
        String rawData = element.readFullyAsUTF8String();
        Stream<String> lines = rawData.lines().skip(/* int */ skipLines);
        List<String> result = lines.collect(Collectors.toList());
}

https://gist.github.com/damondouglas/84c34469d4a6d7b7483468659482855c. Note you can copy the entire contents of the gist and replace the text of https://play.beam.apache.org/ and run on its own.

However, I can see this code in a new readFullyAsUTF8String method that takes an int skipLines argument. So the above DoFn could look like this:

@ProcessElement
public void process(@Element ReadableFile element) {
        List<String> lines = element.readFullyAsUTF8String(/* int */ skipLines);
}

@gudfhr95
Copy link
Author

Ah, I see. Thank you for your detailed examples.
Your suggestion is to add a utility methods in ReadableFIles.

Then, I think that I should revert my previous commits and create new PR only containing readFullyAsUTF8String(int skipLines)
I'll do it ASAP and request your review.

Again, thank you for your kind explanation @damondouglas!

@gudfhr95 gudfhr95 closed this by deleting the head repository Sep 28, 2023
@damondouglas
Copy link
Contributor

@gudfhr95 Take your time. Again, we are very grateful for your contribution and thank you again for listening to my feedback. I look forward to talking with you soon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Skip header row from csv
2 participants