Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

process of determining 'index' is not thread safe #326

Closed
tatsu-yam opened this issue May 12, 2020 · 7 comments
Closed

process of determining 'index' is not thread safe #326

tatsu-yam opened this issue May 12, 2020 · 7 comments
Labels

Comments

@tatsu-yam
Copy link

tatsu-yam commented May 12, 2020

  • fluentd or td-agent version : td-agent 3.4.1-0 (fluentd 1.4.2)
  • fluent-plugin-s3 version : 1.3.1
  • Your problem explanation. If you have an error logs, write it together.

In fluent-plugin-s3, if the value of flush_thread_count is greater than 1 , the data on S3 will be missing. I think that it is because Fluent::Plugin::S3Output#write method is not thread safe.

td-agent.conf

<source>
  @type tail
  format ltsv
  keep_time_key true
  read_from_head true
  path /tmp/td-agent-failure-sample/tmp/1500000.log
  pos_file /var/log/td-agent/1500000.log-ng.pos
  tag app.test_log
  time_format %d/%b/%Y:%H:%M:%S %z
</source>

<match app.**>
  @type s3
  @id out_s3

  s3_bucket "#{ENV['S3_BUCKET']}"
  s3_region "#{ENV['S3_REGION']}"

  path "20200512_1647/ng"
  s3_object_key_format "%{path}/%Y%m%d.%{index}.dat.%{file_extension}"

  check_apikey_on_start false
  check_bucket false

  <buffer tag,time>
    @type file
    path /var/log/td-agent/buffer/ng
    timekey 1d
    timekey_wait 10m
    chunk_limit_size 256m
    flush_thread_count 4

    flush_at_shutdown true
    flush_mode interval
    flush_interval 5 # test setting
  </buffer>
  <format>
    @type json
  </format>
</match>

Test data (/tmp/td-agent-failure-sample/tmp/1500000.log) was transferred to s3 with td-agent.
As a result, the number of records in the original data and the data on s3 do not match. Of course, there are no errors in td-agent.log.

# source data
$ cat /tmp/td-agent-failure-sample/tmp/1500000.log |wc -l
1500000

# s3
$ aws s3 sync s3://$S3_BUCKET/20200512_1647/ ans/20200512_1647/
$ zcat ans/20200512_1647/ng/* |wc -l
1156632

So, I made the following modifications to fluentd and fluent-plugin-s3 and transferred it.

# fluentd code diff
$ git diff
diff --git a/lib/fluent/log.rb b/lib/fluent/log.rb
index 3424aa9..76c09c1 100644
--- a/lib/fluent/log.rb
+++ b/lib/fluent/log.rb
@@ -319,6 +319,7 @@ module Fluent
       return if skipped_type?(type)
       args << block.call if block
       time, msg = event(:info, args)
+      msg = "[#{Thread.current.object_id}] #{msg}"
       puts [@color_info, @formatter.call(type, time, LEVEL_INFO, msg), @color_reset].join
     rescue
     end

# fluent-plugin-s3 code diff
$ git diff
diff --git a/lib/fluent/plugin/out_s3.rb b/lib/fluent/plugin/out_s3.rb
index 8b173c7..9694025 100644
--- a/lib/fluent/plugin/out_s3.rb
+++ b/lib/fluent/plugin/out_s3.rb
@@ -351,6 +351,7 @@ module Fluent::Plugin
             put_options[:metadata][k] = extract_placeholders(v, chunk).gsub(%r(%{[^}]+}), {"%{index}" => sprintf(@index_format, i - 1)})
           end
         end
+        log.info "put object tmp:#{tmp} to s3path:#{s3path}"
         @bucket.object(s3path).put(put_options)

         @values_for_s3_object_chunk.delete(chunk.unique_id)

The resulting td-agent.log.

2020-05-12 16:47:06 +0900 [info]: [70057850679140] gem 'fluentd' version '1.4.2'
2020-05-12 16:47:06 +0900 [info]: [70057850679140] adding match pattern="app.**" type="s3"
2020-05-12 16:47:06 +0900 [info]: [70057850679140] adding source type="tail"
2020-05-12 16:47:06 +0900 [info]: #0 [70057850679140] starting fluentd worker pid=8749 ppid=8744 worker=0
2020-05-12 16:47:06 +0900 [info]: #0 [70057850679140] following tail of /tmp/td-agent-failure-sample/tmp/1500000.log
2020-05-12 16:47:17 +0900 [info]: #0 [70057784455520] [out_s3] put object tmp:#<File:0x00007f6f327c9a68> to s3path:20200512_1647/ng/20200512.0.dat.gz
2020-05-12 16:47:20 +0900 [info]: #0 [70057784455080] [out_s3] put object tmp:#<File:0x00007f6f32786df8> to s3path:20200512_1647/ng/20200512.0.dat.gz
2020-05-12 16:47:25 +0900 [info]: #0 [70057784454700] [out_s3] put object tmp:#<File:0x00007f6f3232ec88> to s3path:20200512_1647/ng/20200512.1.dat.gz
2020-05-12 16:47:30 +0900 [info]: #0 [70057784455940] [out_s3] put object tmp:#<File:0x00007f6f324464b8> to s3path:20200512_1647/ng/20200512.2.dat.gz
2020-05-12 16:47:34 +0900 [info]: #0 [70057850679140] fluentd worker is now running worker=0
2020-05-12 16:47:35 +0900 [info]: #0 [70057784455520] [out_s3] put object tmp:#<File:0x00007f6f324c0f88> to s3path:20200512_1647/ng/20200512.3.dat.gz
2020-05-12 16:47:37 +0900 [info]: #0 [70057784455940] [out_s3] put object tmp:#<File:0x00007f6f32478b48> to s3path:20200512_1647/ng/20200512.4.dat.gz

Two threads uploaded to filename 20200512.0.dat.gz. I think the process of determining %{index} is not thread safe.

I think using uuid_flush will probably work around this problem. However, since the default value of s3_object_key_format is "%{path}%{time_slice}_%{index}. %{file_extension}", so I think this will affect a lot of users.

I think this issue is relevant.
#315

@repeatedly
Copy link
Member

Yes. This problem is similar to process wide conflict.
This problem happens when S3 processing is slower than next chunk flush.
To avoid this problem, show warning message like "Buffer configuration uses multiple flush threads. Recommend to use chunk_id or uuid_flush in object path to avoid object conflict".
How about this?

@withgod
Copy link

withgod commented May 13, 2020

I encountered this bug and consulted with tatsu-yama.
I think it would be better to change the default to %{uuid_flush} or %{chunk_flush} and describe the risk of using %{index} in the s3_object_key_format in the documentation.

@repeatedly
Copy link
Member

Changing default value affects existing users. So need 2 steps for it.

  1. Add warning for multiple threads and release it in v1
  2. Release v2 with default change

@withgod
Copy link

withgod commented May 14, 2020

I certainly forgot about the impact on existing users ;)

repeatedly added a commit that referenced this issue May 15, 2020
@repeatedly
Copy link
Member

Patch for #327

repeatedly added a commit that referenced this issue May 18, 2020
Add warning for object conflict case. ref #326
worr added a commit to worr/logging-operator that referenced this issue Oct 22, 2020
The s3 plugin uses a default object key that is problematic in a few ways.

1. It makes HEAD requests for each chunk it uploads, starting from 1 each time.
If you have uploaded 2000 log files within the same time slice, it will make
2001 HEAD requests to figure out if it exists.

fluent/fluent-plugin-s3#160

2. The above check is not thread-safe, and two threads can race and decide to
use the same `%{index}` value, with the loser of the race overwriting the chunk
from the winner.

fluent/fluent-plugin-s3#326

This is planned to change for v2, but there's no clear path to v2 right now.
The plugin does warn already if you use multiple threads and don't use either
`%{chunk_id}` or `%{uuid_hash}` in the object key.
worr added a commit to worr/logging-operator that referenced this issue Oct 22, 2020
The s3 plugin uses a default object key that is problematic in a few ways.

1. It makes HEAD requests for each chunk it uploads, starting from 1 each time.
If you have uploaded 2000 log files within the same time slice, it will make
2001 HEAD requests to figure out if it exists.

fluent/fluent-plugin-s3#160

2. The above check is not thread-safe, and two threads can race and decide to
use the same `%{index}` value, with the loser of the race overwriting the chunk
from the winner.

fluent/fluent-plugin-s3#326

This is planned to change for v2, but there's no clear path to v2 right now.
The plugin does warn already if you use multiple threads and don't use either
`%{chunk_id}` or `%{uuid_hash}` in the object key.
worr added a commit to worr/logging-operator that referenced this issue Oct 22, 2020
The s3 plugin uses a default object key that is problematic in a few ways.

1. It makes HEAD requests for each chunk it uploads, starting from 1 each time.
If you have uploaded 2000 log files within the same time slice, it will make
2001 HEAD requests to figure out if it exists.

fluent/fluent-plugin-s3#160

2. The above check is not thread-safe, and two threads can race and decide to
use the same `%{index}` value, with the loser of the race overwriting the chunk
from the winner.

fluent/fluent-plugin-s3#326

This is planned to change for v2, but there's no clear path to v2 right now.
The plugin does warn already if you use multiple threads and don't use either
`%{chunk_id}` or `%{uuid_hash}` in the object key.
worr added a commit to worr/fluent-plugin-s3 that referenced this issue Oct 23, 2020
As mentioned in a warning, as well as fluent#326
and fluent#160, the process of
determining the index added to the default object key is not thread-safe. This
adds some thread-safety until version 2.x is out where chunk_id is used
instead of an index value.

This is not a perfect implementation, since there can still be races between
different workers if workers are enabled in fluentd, or if there are multiple
fluentd instances uploading to the same bucket. This commit is just to resolve
this problem short-term in a way that's backwards compatible.
worr added a commit to worr/fluent-plugin-s3 that referenced this issue Oct 24, 2020
As mentioned in a warning, as well as fluent#326
and fluent#160, the process of
determining the index added to the default object key is not thread-safe. This
adds some thread-safety until version 2.x is out where chunk_id is used
instead of an index value.

This is not a perfect implementation, since there can still be races between
different workers if workers are enabled in fluentd, or if there are multiple
fluentd instances uploading to the same bucket. This commit is just to resolve
this problem short-term in a way that's backwards compatible.
worr added a commit to worr/fluent-plugin-s3 that referenced this issue Oct 24, 2020
As mentioned in a warning, as well as fluent#326
and fluent#160, the process of
determining the index added to the default object key is not thread-safe. This
adds some thread-safety until version 2.x is out where chunk_id is used
instead of an index value.

This is not a perfect implementation, since there can still be races between
different workers if workers are enabled in fluentd, or if there are multiple
fluentd instances uploading to the same bucket. This commit is just to resolve
this problem short-term in a way that's backwards compatible.

Signed-off-by: William Orr <[email protected]>
@github-actions
Copy link

github-actions bot commented Jul 6, 2021

This issue has been automatically marked as stale because it has been open 90 days with no activity. Remove stale label or comment or this issue will be closed in 30 days

@github-actions github-actions bot added the stale label Jul 6, 2021
@github-actions
Copy link

github-actions bot commented Aug 5, 2021

This issue was automatically closed because of stale in 30 days

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants