-
Notifications
You must be signed in to change notification settings - Fork 4.4k
Description
What happened?
Description
The default batch size for PubsubIO.Write
in the Java SDK is smaller than the maximum allowed message size in Google Cloud Pub/Sub. This prevents messages that are near the 10MB limit from being sent in a batch, causing unexpected behavior and failures.
The Google Cloud Pub/Sub documentation specifies a maximum message size of 10MB. However, the default batch size in the Apache Beam Java SDK is set to a value that is less than this limit. This results in an exception when trying to send a message that is larger than the default batch size, even if it is smaller than the 10MB Pub/Sub limit.
This can be confusing for developers who expect to be able to send messages up to the documented Pub/Sub limit. It also requires a manual workaround to set a larger batch size, which may not be obvious to all users.
Steps to Reproduce
- Create a pipeline that uses
PubsubIO.Write
to send a message to a Pub/Sub topic. - Create a message that is larger than the default batch size, but smaller than 10MB (e.g., 8MB).
- Run the pipeline without explicitly setting the
maxBatchBytesSize
.
Expected Behavior
The pipeline should be able to send a message that is smaller than the 10MB Pub/Sub limit, even if it is larger than the default batch size. The default batch size should be at least as large as the maximum allowed message size.
Actual Behavior
The pipeline fails with a javax.naming.SizeLimitExceededException
, indicating that the message size exceeds the batch size limit. The error message is similar to the following:
javax.naming.SizeLimitExceededException: Pubsub message of length 8000000 exceeds maximum of 7500000 bytes, when considering the payload and attributes. See https://cloud.google.com/pubsub/quotas#resource_limits
Proposed Solution
There are a few possible solutions to this issue:
- Increase the default
maxBatchBytesSize
to 10MB. This would align the default behavior with the documented Pub/Sub limit and allow larger messages to be sent without any additional configuration. - Improve the documentation to make it clear that the default batch size is smaller than the maximum message size. This would help developers understand the limitation and know that they need to manually configure the batch size for larger messages.
Given the principle of least surprise, increasing the default batch size seems like the most appropriate solution. It would make the library more intuitive to use and prevent unexpected failures. If there is a reason for the default batch size to be smaller than 10MB, this should be clearly documented.
Issue Priority
Priority: 2 (default / most bugs should be filed as P2)
Issue Components
- Component: Python SDK
- Component: Java SDK
- Component: Go SDK
- Component: Typescript SDK
- Component: IO connector
- Component: Beam YAML
- Component: Beam examples
- Component: Beam playground
- Component: Beam katas
- Component: Website
- Component: Infrastructure
- Component: Spark Runner
- Component: Flink Runner
- Component: Samza Runner
- Component: Twister2 Runner
- Component: Hazelcast Jet Runner
- Component: Google Cloud Dataflow Runner