-
Notifications
You must be signed in to change notification settings - Fork 218
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
S3 cost optimisation: Remember the last index value that was used #160
Comments
The issue with (b) is if you have multiple instances writing to the same path, you'd still need to have the collision check. |
Oops. I updated (b) to clarify that the HEAD request would still occur for the index we're trying, but it then wouldn't need to try the n-1 earlier requests. If we're up to index n, we needn't check 1, 2, ..., n-1. But fluentd-plugin-s3 seems to: See https://github.com/fluent/fluent-plugin-s3/blob/master/lib/fluent/plugin/out_s3.rb#L190 and line 219. |
I should say also it was @cnorthwood who spotted this issue. :) |
For less sensationalism, perhaps a more common case of having 20 instances uploading 5 files every 5 minutes costs $2.2 with the O(n^2) and would cost $0.34 with the O(n). |
Hard point of (b) is s3 plugin has Hmm... |
If default includes Some users add For reducing HEAD cost, one way is adding |
we also encountered the same problem. The main reason for us to use s3 plugin was to reduce costs for storing huge log files. We use fluentd+s3 next to regular ELK-like solution. But now actually we are paying more for s3 as we would pay to increase our main log storage. |
We ran into this issue as well. There's been months where we received literally billions of HEAD requests. It's hard to say how much it has cost us exactly but based on our historical usage I'd guess around 5,000 dollars†. That's right, Jeff Bezos has gotten so much money off this bug he could buy a used 2005 mazda mazda3: (Just kidding, but the point is we should probably warn newcomers or fix the bug unless we want to buy Bezos another car :P )
You could create a variable to store the last log name and lock it behind a mutex so threads can set & access it safely. I'm tempted to try it out but I don't know ruby... hmm. There's also a slight problem with the approach in that at the start of the program you would still have to go through all the prefixed s3 items to find the last index. If td-agent is restarted frequently or if there is a crazy amount of logs in the bucket this could also lead to high costs. I think it would be better to make As a temporary workaround we have simplified our td-agent config and upgraded to td-agent v4. I'm not even sure how we ran into this problem - the config is set to flush only once every 24 hours, but somehow we flushed about once a minute. The really odd thing is the rapid flushing happened in a periodic cycle, where it would continue for weeks and then pause for long time before starting back up again. I'll be monitoring the # of requests to see if our flushing problem (and thus this problem) is solved for us. † This is a very rough guess, I won't know for sure until I see how much costs come down by. |
I think this has slow upload issue with multiple threads. If previous upload takes 20 seconds, other threads wait 20 seconds or more. It causes buffer overflow easily. Dlder s3 plugin doesn't have |
The s3 plugin uses a default object key that is problematic in a few ways. 1. It makes HEAD requests for each chunk it uploads, starting from 1 each time. If you have uploaded 2000 log files within the same time slice, it will make 2001 HEAD requests to figure out if it exists. fluent/fluent-plugin-s3#160 2. The above check is not thread-safe, and two threads can race and decide to use the same `%{index}` value, with the loser of the race overwriting the chunk from the winner. fluent/fluent-plugin-s3#326 This is planned to change for v2, but there's no clear path to v2 right now. The plugin does warn already if you use multiple threads and don't use either `%{chunk_id}` or `%{uuid_hash}` in the object key.
The s3 plugin uses a default object key that is problematic in a few ways. 1. It makes HEAD requests for each chunk it uploads, starting from 1 each time. If you have uploaded 2000 log files within the same time slice, it will make 2001 HEAD requests to figure out if it exists. fluent/fluent-plugin-s3#160 2. The above check is not thread-safe, and two threads can race and decide to use the same `%{index}` value, with the loser of the race overwriting the chunk from the winner. fluent/fluent-plugin-s3#326 This is planned to change for v2, but there's no clear path to v2 right now. The plugin does warn already if you use multiple threads and don't use either `%{chunk_id}` or `%{uuid_hash}` in the object key.
The s3 plugin uses a default object key that is problematic in a few ways. 1. It makes HEAD requests for each chunk it uploads, starting from 1 each time. If you have uploaded 2000 log files within the same time slice, it will make 2001 HEAD requests to figure out if it exists. fluent/fluent-plugin-s3#160 2. The above check is not thread-safe, and two threads can race and decide to use the same `%{index}` value, with the loser of the race overwriting the chunk from the winner. fluent/fluent-plugin-s3#326 This is planned to change for v2, but there's no clear path to v2 right now. The plugin does warn already if you use multiple threads and don't use either `%{chunk_id}` or `%{uuid_hash}` in the object key.
As mentioned in a warning, as well as fluent#326 and fluent#160, the process of determining the index added to the default object key is not thread-safe. This adds some thread-safety until version 2.x is out where chunk_id is used instead of an index value. This is not a perfect implementation, since there can still be races between different workers if workers are enabled in fluentd, or if there are multiple fluentd instances uploading to the same bucket. This commit is just to resolve this problem short-term in a way that's backwards compatible.
As mentioned in a warning, as well as fluent#326 and fluent#160, the process of determining the index added to the default object key is not thread-safe. This adds some thread-safety until version 2.x is out where chunk_id is used instead of an index value. This is not a perfect implementation, since there can still be races between different workers if workers are enabled in fluentd, or if there are multiple fluentd instances uploading to the same bucket. This commit is just to resolve this problem short-term in a way that's backwards compatible.
As mentioned in a warning, as well as fluent#326 and fluent#160, the process of determining the index added to the default object key is not thread-safe. This adds some thread-safety until version 2.x is out where chunk_id is used instead of an index value. This is not a perfect implementation, since there can still be races between different workers if workers are enabled in fluentd, or if there are multiple fluentd instances uploading to the same bucket. This commit is just to resolve this problem short-term in a way that's backwards compatible. Signed-off-by: William Orr <[email protected]>
see fluent#160 Signed-off-by: caleb15 <[email protected]>
@caleb15 Apart from the warning added - is there a recommended solution to avoid the index checking ? which in turn will avoid the HEAD requests? This is the %{path}%{hostname}-%{time_slice}_%{index}.%{file_extension} s3_object_key_format we are using. We are seeing high number of HEAD requests. Does this #326 locally cache the last index for the file between threads to avoid the issue? Thanks! |
For avoiding HEAD request call, set |
thanks @repeatedly what is the implication of avoiding HEAD request call ? Would it overwrite existing file ? or know from cache what index it needs to write to? Guessing %{uuid_flush} is the id then picked up from cache |
@formanojhr uuid = universally unique id. Because the id is totally unique overwriting becomes virtually impossible. https://stackoverflow.com/questions/1155008/how-unique-is-uuid |
@repeatedly I think since you are the dev on this might be something you can answer. Once I add the check_object false and adding s3_object_key_format %{path}%{hostname}-%{uuid_flush}-%{time_slice}_%{index}.%{file_extension} |
This issue has been automatically marked as stale because it has been open 90 days with no activity. Remove stale label or comment or this issue will be closed in 30 days |
@cosmo0920 can you remove the stale lifecycle and add a bug label please. This is a rare but serious issue. |
I agree this seems to be a bug. I have also experienced this default behavior where we only found out about it after $20K bill at the end of the month due to the billions of HEAD requests. I switched to check_object=false, and used a combo of UID and other variables to ensure uniqueness and avoid conflicts - of course the downside is the challenge of looking up objects sequentially. I do believe this needs to be treated with higher urgency or at least switch the default behavior to where it is not making billions of HEAD calls. Some may argue this can be a feature with "suffix" as a variable with options and documenting the pros and cons of the options. |
The problem
Suppose you use the defaults:
s3_object_key_format
of%{path}%{time_slice}_%{index}.%{file_extension}
time_slice_format
of%Y%m%d%H
And suppose you flush every 30 seconds. So 120 files per hour.
The first file will check whether
foo-2016010109_1.gz
exists via an S3 HEAD request, see it doesn't exist, and then upload to that filename.The next file will be first check whether
foo-2016010109_1.gz
exists via an S3 HEAD request, see it exists, and so increment the index tofoo-2016010109_2.gz
, check whether it exists via with an S3 HEAD request, see it doesn't exist, and then upload to the filename.This will continue. When we get to the final file of the hour (the 120th file), we'll first do 119 HEAD requests!
That's 1+2+...+119 = 7140 S3 requests over the hour. And that's per input log file, per instance.
S3 HEAD requests are "$0.004 per 10,000 requests". So the monthly cost of the above, for 5 log files for 100 instances amounts to 7140_5_100_24_30*$0.004/1000 = $1028
More generally, 1+2+...+n is O(n^2) and we can reduce this to O(n).
Solutions
(a) The user can modify
time_slice_format
to include%M
. Or the default could include%M
.(b) fluent-plugin-s3 could remember the last index it uploaded to, and so not have to check whether the n-1 earlier files already exist: fluentd would know they do.
If either solution was implemented, we'd've reduced the number of HEAD requests from O(n^2) to O(n). (Technically (a) doesn't reduce the solution to O(n^2), it just makes our n tiny.)
So rather than 7140 S3 requests per hour per log file per instance, we'd only do 120.
This reduces the monthly cost from $1028 to $17.
The text was updated successfully, but these errors were encountered: