-
Notifications
You must be signed in to change notification settings - Fork 4.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BEAM-10785] Add support for coder argument in WriteToBigQuery #17518
Conversation
Can one of the admins verify this patch? |
1 similar comment
Can one of the admins verify this patch? |
R: @aaltay |
Codecov Report
@@ Coverage Diff @@
## master #17518 +/- ##
==========================================
- Coverage 74.08% 73.98% -0.11%
==========================================
Files 697 694 -3
Lines 91980 91454 -526
==========================================
- Hits 68141 67659 -482
+ Misses 22590 22582 -8
+ Partials 1249 1213 -36
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
Run Python_PVR_Flink PreCommit |
R: @johnjcasey |
This LGTM. I don't think we need a test case for this, because ultimately it is just passing the coder downstream |
Could you please resolve the conflicts? And I can merge after that. |
my apologies on the delay to review this, but it is still not clear to me why we need to add this feature to WriteToBQ. I see in https://issues.apache.org/jira/browse/BEAM-10785 that non-ascii characters are replaced when formatting in JSON - is that correct? Does this cause a problem when inserting into BigQuery? Can you explain the problem in more detail? Generally, I'd prefer if we did not have to define a new parameter - but rather fix the existing coder if it has any issues. Can you please share the use case that this would address? |
@pabloem The problem in my case was occurred when I was processing the chat data including emojis and putting it into BigQuery (they were all replaced to replacement character), so our major need in this problem was to disable class CustomRowAsDictJsonCoder(coders.Coder):
def encode(self, table_row):
try:
# ...
return json.dumps(table_row, ensure_ascii=False, default=default_encoder).encode("utf-8")
# ------------------
# except: ... I also prefer to not define any additional parameters if possible, but I thought that we don't have any possible way to modify parameters inside the coder, or replace the coder. Please correct me if you have any concern over this. |
@aaltay Conflict is resolved. |
@harrydrippin thanks for the analysis, and great catch! - I think your fix to the coder itself would be a much better fix. Would you be willing to perform that fix instead? |
@pabloem Just for sure, do you mean it is good if I apply the change to the original I will submit another PR for this if you confirm. Thanks! |
yes @harrydrippin that's correct - and if you could, a test case would be ideal |
@harrydrippin - What is the next step on this PR? |
@aaltay I am going to submit another PR (fixing coder) for this issue. |
I'll close this as we got the other fix. Sorry again about the big delay! |
This PR fixes BEAM-10785 by adding the
coder
argument fromWriteToBigQuery
toJsonRowWriter
to enable users to modify the coder for various reasons on batch pipeline. Thecoder
argument will be defaulted toRowAsDictJsonCoder
if user didn't specify.This modification needs to add
coder
argument on below classes and functions. This PR also applies changes for them.BigQueryBatchFileLoads
WriteRecordsToFile
WriteGroupedRecordsToFile
_make_new_file_writer()
JsonRowWriter
This version was tested on
DataflowRunner
and there were no any problems while using custom coder. But this PR does not have any additional unit tests now, so it will be great if reviewers can provide some directions about appropriate unit tests. Please let me know if there are any concerns on this PR. Thank you!Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:
R: @username
).[BEAM-XXX] Fixes bug in ApproximateQuantiles
, where you replaceBEAM-XXX
with the appropriate JIRA issue, if applicable. This will automatically link the pull request to the issue.CHANGES.md
with noteworthy changes.See the Contributor Guide for more tips on how to make review process smoother.
To check the build health, please visit https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md
GitHub Actions Tests Status (on master branch)
See CI.md for more information about GitHub Actions CI.