-
Notifications
You must be signed in to change notification settings - Fork 219
Description
The problem
Suppose you use the defaults:
s3_object_key_format
of%{path}%{time_slice}_%{index}.%{file_extension}
time_slice_format
of%Y%m%d%H
And suppose you flush every 30 seconds. So 120 files per hour.
The first file will check whether foo-2016010109_1.gz
exists via an S3 HEAD request, see it doesn't exist, and then upload to that filename.
The next file will be first check whether foo-2016010109_1.gz
exists via an S3 HEAD request, see it exists, and so increment the index to foo-2016010109_2.gz
, check whether it exists via with an S3 HEAD request, see it doesn't exist, and then upload to the filename.
This will continue. When we get to the final file of the hour (the 120th file), we'll first do 119 HEAD requests!
That's 1+2+...+119 = 7140 S3 requests over the hour. And that's per input log file, per instance.
S3 HEAD requests are "$0.004 per 10,000 requests". So the monthly cost of the above, for 5 log files for 100 instances amounts to 7140_5_100_24_30*$0.004/1000 = $1028
More generally, 1+2+...+n is O(n^2) and we can reduce this to O(n).
Solutions
(a) The user can modify time_slice_format
to include %M
. Or the default could include %M
.
(b) fluent-plugin-s3 could remember the last index it uploaded to, and so not have to check whether the n-1 earlier files already exist: fluentd would know they do.
If either solution was implemented, we'd've reduced the number of HEAD requests from O(n^2) to O(n). (Technically (a) doesn't reduce the solution to O(n^2), it just makes our n tiny.)
So rather than 7140 S3 requests per hour per log file per instance, we'd only do 120.
This reduces the monthly cost from $1028 to $17.