Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BEAM-10785] Add support for coder argument in WriteToBigQuery #17518

Closed
wants to merge 3 commits into from

Conversation

harrydrippin
Copy link
Contributor

This PR fixes BEAM-10785 by adding the coder argument from WriteToBigQuery to JsonRowWriter to enable users to modify the coder for various reasons on batch pipeline. The coder argument will be defaulted to RowAsDictJsonCoder if user didn't specify.

This modification needs to add coder argument on below classes and functions. This PR also applies changes for them.

  • BigQueryBatchFileLoads
  • WriteRecordsToFile
  • WriteGroupedRecordsToFile
  • _make_new_file_writer()
  • JsonRowWriter

This version was tested on DataflowRunner and there were no any problems while using custom coder. But this PR does not have any additional unit tests now, so it will be great if reviewers can provide some directions about appropriate unit tests. Please let me know if there are any concerns on this PR. Thank you!


Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:

  • Choose reviewer(s) and mention them in a comment (R: @username).
  • Format the pull request title like [BEAM-XXX] Fixes bug in ApproximateQuantiles, where you replace BEAM-XXX with the appropriate JIRA issue, if applicable. This will automatically link the pull request to the issue.
  • Update CHANGES.md with noteworthy changes.
  • If this contribution is large, please file an Apache Individual Contributor License Agreement.

See the Contributor Guide for more tips on how to make review process smoother.

To check the build health, please visit https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md

GitHub Actions Tests Status (on master branch)

Build python source distribution and wheels
Python tests
Java tests

See CI.md for more information about GitHub Actions CI.

@asf-ci
Copy link

asf-ci commented May 1, 2022

Can one of the admins verify this patch?

1 similar comment
@asf-ci
Copy link

asf-ci commented May 1, 2022

Can one of the admins verify this patch?

@harrydrippin
Copy link
Contributor Author

R: @aaltay
R: @charlesccychen

@codecov
Copy link

codecov bot commented May 1, 2022

Codecov Report

Merging #17518 (b7b325d) into master (bf4f641) will decrease coverage by 0.10%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master   #17518      +/-   ##
==========================================
- Coverage   74.08%   73.98%   -0.11%     
==========================================
  Files         697      694       -3     
  Lines       91980    91454     -526     
==========================================
- Hits        68141    67659     -482     
+ Misses      22590    22582       -8     
+ Partials     1249     1213      -36     
Flag Coverage Δ
python 83.73% <100.00%> (-0.01%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
sdks/python/apache_beam/io/gcp/bigquery.py 65.11% <100.00%> (+0.15%) ⬆️
...s/python/apache_beam/io/gcp/bigquery_file_loads.py 87.79% <100.00%> (+0.08%) ⬆️
sdks/python/apache_beam/io/gcp/bigquery_tools.py 84.34% <100.00%> (ø)
sdks/go/pkg/beam/core/graph/mtime/time.go 51.35% <0.00%> (-17.07%) ⬇️
sdks/go/pkg/beam/core/graph/fn.go 76.03% <0.00%> (-9.31%) ⬇️
sdks/go/pkg/beam/core/runtime/exec/pardo.go 50.67% <0.00%> (-8.44%) ⬇️
sdks/go/pkg/beam/core/runtime/exec/fn.go 62.41% <0.00%> (-7.14%) ⬇️
sdks/go/pkg/beam/runners/dataflow/dataflow.go 53.02% <0.00%> (-5.61%) ⬇️
sdks/python/apache_beam/internal/gcp/auth.py 73.33% <0.00%> (-5.34%) ⬇️
...ks/go/pkg/beam/runners/dataflow/dataflowlib/job.go 16.26% <0.00%> (-5.29%) ⬇️
... and 34 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update bf4f641...b7b325d. Read the comment docs.

@harrydrippin
Copy link
Contributor Author

Run Python_PVR_Flink PreCommit

@aaltay aaltay requested a review from pabloem May 7, 2022 01:37
@aaltay
Copy link
Member

aaltay commented May 27, 2022

R: @johnjcasey

@johnjcasey
Copy link
Contributor

This LGTM. I don't think we need a test case for this, because ultimately it is just passing the coder downstream

@aaltay
Copy link
Member

aaltay commented Jun 1, 2022

Could you please resolve the conflicts? And I can merge after that.

@pabloem
Copy link
Member

pabloem commented Jun 1, 2022

my apologies on the delay to review this, but it is still not clear to me why we need to add this feature to WriteToBQ.

I see in https://issues.apache.org/jira/browse/BEAM-10785 that non-ascii characters are replaced when formatting in JSON - is that correct? Does this cause a problem when inserting into BigQuery? Can you explain the problem in more detail?

Generally, I'd prefer if we did not have to define a new parameter - but rather fix the existing coder if it has any issues. Can you please share the use case that this would address?

@harrydrippin
Copy link
Contributor Author

@pabloem The problem in my case was occurred when I was processing the chat data including emojis and putting it into BigQuery (they were all replaced to replacement character), so our major need in this problem was to disable ensure_ascii from True to False on json.dumps(). But there was no exposed control for replacing that argument, so I temporarily customized RowAsDictJsonCoder and WriteToBigQuery in my environment like below:

class CustomRowAsDictJsonCoder(coders.Coder):

    def encode(self, table_row):
        try:
            # ...
            return json.dumps(table_row, ensure_ascii=False, default=default_encoder).encode("utf-8")
            #                            ------------------
        # except: ...

I also prefer to not define any additional parameters if possible, but I thought that we don't have any possible way to modify parameters inside the coder, or replace the coder. Please correct me if you have any concern over this.

@harrydrippin
Copy link
Contributor Author

@aaltay Conflict is resolved.

@pabloem
Copy link
Member

pabloem commented Jun 3, 2022

@harrydrippin thanks for the analysis, and great catch! - I think your fix to the coder itself would be a much better fix. Would you be willing to perform that fix instead?

@harrydrippin
Copy link
Contributor Author

@pabloem Just for sure, do you mean it is good if I apply the change to the original RowAsDictJsonCoder implementation just like my custom one? (adding ensure_ascii=False on the json.dumps)

I will submit another PR for this if you confirm. Thanks!

@pabloem
Copy link
Member

pabloem commented Jun 9, 2022

yes @harrydrippin that's correct - and if you could, a test case would be ideal

@aaltay
Copy link
Member

aaltay commented Jun 18, 2022

@harrydrippin - What is the next step on this PR?

@harrydrippin
Copy link
Contributor Author

@aaltay I am going to submit another PR (fixing coder) for this issue.

@pabloem
Copy link
Member

pabloem commented Sep 30, 2022

I'll close this as we got the other fix. Sorry again about the big delay!

@pabloem pabloem closed this Sep 30, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants