Skip to content

Conversation

@sopel39
Copy link
Member

@sopel39 sopel39 commented May 13, 2025

Default of 16MB causes S3 throttling issues with large clusters during inserts

Description

Additional context and related issues

Release notes

( ) This is not user-visible or is docs only, and no release notes are required.
( ) Release notes are required. Please propose a release note for me.
(x) Release notes are required, with the following suggested text:

## Hive, Iceberg, Delta Lake
* Reduce S3 throttling failures by increasing streaming upload part size and reducing number of requests. ({issue}`25781`)

@pettyjamesm
Copy link
Member

Agreed that 16MB seems too small, but 256MB might be too large for a default value- especially given that S3FileSystemConfig has 256MB as the maximum allowed value. Obviously this is a somewhat arbitrary value and partially depends on how much memory you're willing to set aside for S3 upload buffering, how close to the throttling limit you run on your S3 bucket, etc- but there is a downside to going too large: if your average file size is smaller than 2x the part size then you lose performance on upload by limiting the opportunity for parallelism. Would 64MB or 128MB make more sense here?

@sopel39
Copy link
Member Author

sopel39 commented May 13, 2025

S3FileSystemConfig has 256MB as the maximum allowed value

256MB max is arbitrary. According to https://docs.aws.amazon.com/AmazonS3/latest/userguide/qfacts.html actual max part is 5GB.

Would 64MB or 128MB make more sense here?

We already have writer scaling within worker and across cluster. We don't need extra parallelism, but rather reliability. It's pretty easy to exhaust S3 limits even with moderate cluster (5-10 nodes).

@pettyjamesm
Copy link
Member

pettyjamesm commented May 13, 2025

256MB max is arbitrary

I agree that the maximum configurable value was probably an arbitrary choice, but S3FileSystemConfig#getStreamingPartSize() is still annotated with @MaxDataSize("256MB") so it would be weird to have the maximum and defaults be the same in this instance, no?

We already have writer scaling within worker and across cluster. We don't need extra parallelism, but rather reliability.

This will still hurt pipelined parallelism for each writer instance when file size is smaller than 2 * partSize which means worse overall performance on each of those file writers observed as a long delay when attempting to close the upload stream. It also means that the memory cost for writer scaling will be higher because of the amount of memory buffered for each writer is increasing by 16x which is another way to hurt reliability here due to OOM.

It's pretty easy to exhaust S3 limits even with moderate cluster (5-10 nodes).

You should be able to push 3,500 operations per second through the API and each stream has an upload speed of ~100MB/s maximum- so you would need a very large number of open writers on a cluster with that many nodes to bump up against the limit (and based on a value of 256MB, you would now expect #close() to take ~2.5 seconds to complete in a best case scenario when the buffer is full)

Again, I'm just saying that 256MB seems high as a default and that 64MB or 128MB seems like more conservative increase knowing that users can always configure larger values if it makes sense for their workload. If you still think 256MB is the right arbitrary default value to pick though, then you probably also want to increase the maximum allowed value that the config permits.

Copy link
Member

@raunaqmorarka raunaqmorarka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default max row group size in parquet writer is 128mb, so raising s3.streaming.part-size beyond that by default does not make sense.
We had lots of customers complaining about increased memory requirements for writes when task level writer scaling was rolled out, so I would rather not aggressively push memory requirements for write again.
I think the default config bump should be the minimal amount needed to get past throttling, like 32MB.
If we have to push beyond that then I recommend adding a note about tuning this higher in the documentation and making it easier for users to observe if they are being throttled.

Default of 16MB causes S3 throttling issues with large clusters during inserts
@sopel39
Copy link
Member Author

sopel39 commented May 14, 2025

Changed to 32MB. Empirically 16MB is too small even for 10 nodes. With 256MB throttling issues go away even with large clusters and node cap is not required. I'm not sure if 32MB will solve the issue though.

@sopel39 sopel39 requested a review from raunaqmorarka May 14, 2025 09:51
@raunaqmorarka raunaqmorarka requested a review from pettyjamesm May 14, 2025 10:12
@sopel39 sopel39 merged commit 82f3f03 into trinodb:master May 14, 2025
67 checks passed
@sopel39 sopel39 changed the title Increase streaming part size to 256MB Increase streaming part size to 32MB May 14, 2025
@sopel39 sopel39 deleted the ks/bump_size branch May 14, 2025 10:42
@github-actions github-actions bot added this to the 476 milestone May 14, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Development

Successfully merging this pull request may close these issues.

3 participants