-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update document Storage API Support in Google BigQuery I/O connector
for Python SDK
#26889
Conversation
Add usage of BigQuery StorageWriteAPI in Python.
Add usage of BigQuery StorageReadAPI in Python.
Assigning reviewers. If you would like to opt out of this review, comment R: @AnandInguva for label python. Available commands:
The PR bot will only process comments in the main thread (not review comments). |
R: @ahmedabu98 also, please check the "PythonFormatter" and "PythonLint" results. Could format the code by running # Run from root beam repo dir
pip install yapf==0.29.0
git diff HEAD --name-only | grep "\.py$" | xargs yapf --in-place |
Stopping reviewer notifications for this pull request: review requested by someone other than the bot, ceding control |
apply lint and format
# [START model_bigqueryio_write_with_storage_write_api] | ||
quotes | beam.io.WriteToBigQuery( | ||
table_spec, | ||
schema=table_schema, | ||
method=beam.io.WriteToBigQuery.Method.STORAGE_WRITE_API) | ||
# [END model_bigqueryio_write_with_storage_write_api] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Writing with storage api currently has some limitations on supported data types. This is due to the what types are currently supported at the cross-language boundary. Important to note that the python decimal.Decimal
is needed to write a NUMERIC
BigQuery type. Similarly, the the Beam Timestamp
type is needed to write a TIMESTAMP
BigQuery type.
Also, some other types are not yet supported, like "DATETIME". Might be worth mentioning some of that, here is the full allowed list:
beam/sdks/python/apache_beam/io/gcp/bigquery_tools.py
Lines 112 to 123 in 0b43074
BIGQUERY_TYPE_TO_PYTHON_TYPE = { | |
"STRING": str, | |
"BOOL": bool, | |
"BOOLEAN": bool, | |
"BYTES": bytes, | |
"INT64": np.int64, | |
"INTEGER": np.int64, | |
"FLOAT64": np.float64, | |
"FLOAT": np.float64, | |
"NUMERIC": decimal.Decimal, | |
"TIMESTAMP": apache_beam.utils.timestamp.Timestamp, | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for letting me know. I didn't know the limitations!
I have just added these notes.👍 0af3174
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#27347
Hi, @ahmedabu98 .
I tried to write 'TIMESTAMP' type, but I got an error message
java.lang.IllegalArgumentException: Input schema is not assignable to output schema.
Input schema=Fields:
.......
INFO 2023-07-03T09:27:24.015855Z Field{name=ts, description=, type=LOGICAL_TYPE<beam:logical_type:micros_instant:v1>, options={{}}}
.......
, Output schema=Fields:
.......
INFO 2023-07-03T09:27:24.016054Z Field{name=ts, description=, type=DATETIME, options={{}}}
.......
I looked into beam source codes and found that input mapping and output mapping are not matched in the schema validation code). I think output mapping should be FieldType.logicalType(SqlTypes.TIMESTAMP)
.
Is there a way to avoid this error, or is TIMESTAMP
type not supported yet?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I looked into the code again and found this.
It seems that beam:logical_type:millis_instant:v1
is translated as FieldType.DATETIME
in SchemaTranslation, which is matched with the output mapping.
To use MillisInstant
, LogicalType.register_logical_type(MillisInstant)
should be added to the pipeline code.
Example
LogicalType.register_logical_type(MillisInstant)
quotes | beam.io.WriteToBigQuery(
table_spec,
schema=table_schema,
write_disposition=beam.io.BigQueryDisposition.WRITE_TRUNCATE,
create_disposition=beam.io.BigQueryDisposition.CREATE_IF_NEEDED,
method=beam.io.WriteToBigQuery.Method.STORAGE_WRITE_API,
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for looking into this @HaeSe0ng!
@Abacn do you know if it's sufficient for us to include the line LogicalType.register_logical_type(MillisInstant)
in the connector? e.g. here:
beam/sdks/python/apache_beam/io/gcp/bigquery.py
Lines 2193 to 2194 in 7eaef18
else: | |
# Storage Write API |
Or will we have to tell users to add this line themselves?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I looked into the code again and found this. It seems that
beam:logical_type:millis_instant:v1
is translated asFieldType.DATETIME
in SchemaTranslation, which is matched with the output mapping. To useMillisInstant
,LogicalType.register_logical_type(MillisInstant)
should be added to the pipeline code.Example
LogicalType.register_logical_type(MillisInstant) quotes | beam.io.WriteToBigQuery( table_spec, schema=table_schema, write_disposition=beam.io.BigQueryDisposition.WRITE_TRUNCATE, create_disposition=beam.io.BigQueryDisposition.CREATE_IF_NEEDED, method=beam.io.WriteToBigQuery.Method.STORAGE_WRITE_API, )
Where does LogicalType.register_logical_type(MillisInstant)
reference from? I'm trying to find a workaround that's usable for now since this isn't fixed in v2.49.0. My JSON schema is correctly converted to a bigquery.TableSchema
instance but when passing it to apache beam it gives the error you mentioned above.
In my case the timestamp type is incorrectly translated to datetime (LOGICAL_TYPE<beam:logical_type:micros_instant:v1>
➔ DATETIME NOT NULL
).
I'd love to help solve this, please let me know of anything I can help with for testing etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@JoeCMoore are you getting this error even when setting LogicalType.register_logical_type(MillisInstant)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ahmedabu98 No, this was a workaround for the bug. If you need any more details I'd be happy to help.
Also for the STORAGE_WRITE_API use case, we should mention that an expansion service would have to be built before running the pipeline. If running from a Beam git clone, they can build it with If they want to specify their own expansion-service, they can build it and specify with the beam/sdks/python/apache_beam/io/gcp/bigquery.py Lines 1804 to 1806 in 0b43074
|
@ahmedabu98 if run a released version of beam, expansion service can be automatically downloaded and started. The only thing needed is a Java environment (to run the expansion service) |
avoiding "already exists" error
avoiding "Line too long" pylint
As @ahmedabu98 mentioned, some of CI failed due to the service expansion error.
Can I add a gradle task |
Is there another workaround? adding |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
left some suggestions to clarify things a little
website/www/site/content/en/documentation/io/built-in/google-bigquery.md
Outdated
Show resolved
Hide resolved
website/www/site/content/en/documentation/io/built-in/google-bigquery.md
Outdated
Show resolved
Hide resolved
Python SDK has some limitations on types. Co-authored-by: Ahmed Abualsaud <[email protected]>
Co-authored-by: Ahmed Abualsaud <[email protected]>
I tried to make small changes to the CI pipeline. But this is not a good approach because I should change a lot or result longer build times. How about creating a new
|
move examples that requires expansion-service from snippets.py
Codecov Report
@@ Coverage Diff @@
## master #26889 +/- ##
==========================================
+ Coverage 71.97% 72.13% +0.15%
==========================================
Files 747 836 +89
Lines 101306 102091 +785
==========================================
+ Hits 72920 73647 +727
- Misses 26927 26985 +58
Partials 1459 1459
Flags with carried forward coverage won't be shown. Click here to find out more.
... and 183 files with indirect coverage changes 📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
Hey, sorry for the delay, I was on vacation for a while. Would it be feasible to create a new beam/sdks/python/apache_beam/io/external/xlang_bigqueryio_it_test.py Lines 53 to 56 in 644f539
This means it won't run with the Python Coverage Commit tests, but will run on Let me know what you think of this suggestion :) |
4467f06
to
ee20836
Compare
@ahmedabu98 Hi, I'm also sorry for my delay. I got a sick and had to fix my P1 body issue. 😭
Yes, it makes sense. 👍 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should do it
Co-authored-by: Ahmed Abualsaud <[email protected]>
Co-authored-by: Ahmed Abualsaud <[email protected]>
Run Python_Xlang_Gcp_Direct PostCommit |
Run Python_Xlang_Gcp_Dataflow PostCommit |
Ahh looks like we're running into a validation error:
This check is done during pipeline construction time, it compares the input data schema with the table schema... I think we can bypass this by giving it a table that doesn't exist, it will skip the validation check. |
pipeline, write_project='', write_dataset='', write_table=''): | ||
"""Examples for cross-language BigQuery sources and sinks.""" | ||
|
||
table_spec = 'clouddataflow-readonly:samples.weather_stations' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
table_spec = 'clouddataflow-readonly:samples.weather_stations' | |
table_spec = 'clouddataflow-readonly:samples.<non-existent-table>' |
so everything else can stay the same, just put a table name that doesn't exist. can even use str(uuid.uuid4())
in the name to make sure
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thank you, I added your suggestion. It seems it works! 👍
to avoid CI error
Run Python_Xlang_Gcp_Direct PostCommit |
Run Python_Xlang_Gcp_Dataflow PostCommit |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like tests are running and passing now, thank you for these changes! LGTM
@Abacn for final review and merge
Thanks! LGTM merging for now |
…` for Python SDK (apache#26889) * Update BigQuery document; Python SDK(apache#26693) Add usage of BigQuery StorageWriteAPI in Python. --------- Co-authored-by: Ahmed Abualsaud <[email protected]>
…` for Python SDK (apache#26889) * Update BigQuery document; Python SDK(apache#26693) Add usage of BigQuery StorageWriteAPI in Python. --------- Co-authored-by: Ahmed Abualsaud <[email protected]>
Please add a meaningful description for your change here
This pull request contains these changes below.👏
WriteToBigQuery.Method.STORAGE_WRITE_API
to the document.ReadFromBigQuery.Method.DIRECT_READ
to the document.Here is an example code to check the features.
https://gist.github.com/RyuSA/84968c2322771ce411e63154593b2319
context
In the original issue(#26693), we discussed about only
STORAGE_WRITE_API
, but I found the BigQuery read api is also supported now. I just add the usage of the read api too. If any concerns you see please let me know.Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:
addresses #123
), if applicable. This will automatically add a link to the pull request in the issue. If you would like the issue to automatically close on merging the pull request, commentfixes #<ISSUE NUMBER>
instead.CHANGES.md
with noteworthy changes.See the Contributor Guide for more tips on how to make review process smoother.
To check the build health, please visit https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md
GitHub Actions Tests Status (on master branch)
See CI.md for more information about GitHub Actions CI.