You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In the BigQuery Storage API batch connector, we use Pending streams to write to BigQuery. The final step in the connector is to commit stream contents to the table.
Currently we do one single batch commit for all streams. There is a quota placed on the number of bytes we can commit per operation: 1TB for small regions, 10TB for multi-regions. Essentially any batch write job's size will be restricted to this limit. Would it be a good idea to break this up into multiple back-to-back commits?
@Abacn brings up a good point in this comment about whether this is done intentionally to avoid partially written data in the rare case where the whole pipeline fails between commits (and is unable to retry).
However, limiting it to one commit would place a hard restriction on the amount of data one can write with this connector.
Issue Priority
Priority: 2 (default / most normal work should be filed as P2)
Issue Components
Component: Python SDK
Component: Java SDK
Component: Go SDK
Component: Typescript SDK
Component: IO connector
Component: Beam YAML
Component: Beam examples
Component: Beam playground
Component: Beam katas
Component: Website
Component: Spark Runner
Component: Flink Runner
Component: Samza Runner
Component: Twister2 Runner
Component: Hazelcast Jet Runner
Component: Google Cloud Dataflow Runner
The text was updated successfully, but these errors were encountered:
What needs to happen?
In the BigQuery Storage API batch connector, we use Pending streams to write to BigQuery. The final step in the connector is to commit stream contents to the table.
Currently we do one single batch commit for all streams. There is a quota placed on the number of bytes we can commit per operation: 1TB for small regions, 10TB for multi-regions. Essentially any batch write job's size will be restricted to this limit. Would it be a good idea to break this up into multiple back-to-back commits?
@Abacn brings up a good point in this comment about whether this is done intentionally to avoid partially written data in the rare case where the whole pipeline fails between commits (and is unable to retry).
However, limiting it to one commit would place a hard restriction on the amount of data one can write with this connector.
Issue Priority
Priority: 2 (default / most normal work should be filed as P2)
Issue Components
The text was updated successfully, but these errors were encountered: