process of determining 'index' is not thread safe #326

tatsu-yam · 2020-05-12T09:02:20Z

fluentd or td-agent version : td-agent 3.4.1-0 (fluentd 1.4.2)
fluent-plugin-s3 version : 1.3.1
Your problem explanation. If you have an error logs, write it together.

In fluent-plugin-s3, if the value of flush_thread_count is greater than 1 , the data on S3 will be missing. I think that it is because Fluent::Plugin::S3Output#write method is not thread safe.

td-agent.conf

<source>
  @type tail
  format ltsv
  keep_time_key true
  read_from_head true
  path /tmp/td-agent-failure-sample/tmp/1500000.log
  pos_file /var/log/td-agent/1500000.log-ng.pos
  tag app.test_log
  time_format %d/%b/%Y:%H:%M:%S %z
</source>

<match app.**>
  @type s3
  @id out_s3

  s3_bucket "#{ENV['S3_BUCKET']}"
  s3_region "#{ENV['S3_REGION']}"

  path "20200512_1647/ng"
  s3_object_key_format "%{path}/%Y%m%d.%{index}.dat.%{file_extension}"

  check_apikey_on_start false
  check_bucket false

  <buffer tag,time>
    @type file
    path /var/log/td-agent/buffer/ng
    timekey 1d
    timekey_wait 10m
    chunk_limit_size 256m
    flush_thread_count 4

    flush_at_shutdown true
    flush_mode interval
    flush_interval 5 # test setting
  </buffer>
  <format>
    @type json
  </format>
</match>

Test data (/tmp/td-agent-failure-sample/tmp/1500000.log) was transferred to s3 with td-agent.
As a result, the number of records in the original data and the data on s3 do not match. Of course, there are no errors in td-agent.log.

# source data
$ cat /tmp/td-agent-failure-sample/tmp/1500000.log |wc -l
1500000

# s3
$ aws s3 sync s3://$S3_BUCKET/20200512_1647/ ans/20200512_1647/
$ zcat ans/20200512_1647/ng/* |wc -l
1156632

So, I made the following modifications to fluentd and fluent-plugin-s3 and transferred it.

# fluentd code diff
$ git diff
diff --git a/lib/fluent/log.rb b/lib/fluent/log.rb
index 3424aa9..76c09c1 100644
--- a/lib/fluent/log.rb
+++ b/lib/fluent/log.rb
@@ -319,6 +319,7 @@ module Fluent
       return if skipped_type?(type)
       args << block.call if block
       time, msg = event(:info, args)
+      msg = "[#{Thread.current.object_id}] #{msg}"
       puts [@color_info, @formatter.call(type, time, LEVEL_INFO, msg), @color_reset].join
     rescue
     end

# fluent-plugin-s3 code diff
$ git diff
diff --git a/lib/fluent/plugin/out_s3.rb b/lib/fluent/plugin/out_s3.rb
index 8b173c7..9694025 100644
--- a/lib/fluent/plugin/out_s3.rb
+++ b/lib/fluent/plugin/out_s3.rb
@@ -351,6 +351,7 @@ module Fluent::Plugin
             put_options[:metadata][k] = extract_placeholders(v, chunk).gsub(%r(%{[^}]+}), {"%{index}" => sprintf(@index_format, i - 1)})
           end
         end
+        log.info "put object tmp:#{tmp} to s3path:#{s3path}"
         @bucket.object(s3path).put(put_options)

         @values_for_s3_object_chunk.delete(chunk.unique_id)

The resulting td-agent.log.

2020-05-12 16:47:06 +0900 [info]: [70057850679140] gem 'fluentd' version '1.4.2'
2020-05-12 16:47:06 +0900 [info]: [70057850679140] adding match pattern="app.**" type="s3"
2020-05-12 16:47:06 +0900 [info]: [70057850679140] adding source type="tail"
2020-05-12 16:47:06 +0900 [info]: #0 [70057850679140] starting fluentd worker pid=8749 ppid=8744 worker=0
2020-05-12 16:47:06 +0900 [info]: #0 [70057850679140] following tail of /tmp/td-agent-failure-sample/tmp/1500000.log
2020-05-12 16:47:17 +0900 [info]: #0 [70057784455520] [out_s3] put object tmp:#<File:0x00007f6f327c9a68> to s3path:20200512_1647/ng/20200512.0.dat.gz
2020-05-12 16:47:20 +0900 [info]: #0 [70057784455080] [out_s3] put object tmp:#<File:0x00007f6f32786df8> to s3path:20200512_1647/ng/20200512.0.dat.gz
2020-05-12 16:47:25 +0900 [info]: #0 [70057784454700] [out_s3] put object tmp:#<File:0x00007f6f3232ec88> to s3path:20200512_1647/ng/20200512.1.dat.gz
2020-05-12 16:47:30 +0900 [info]: #0 [70057784455940] [out_s3] put object tmp:#<File:0x00007f6f324464b8> to s3path:20200512_1647/ng/20200512.2.dat.gz
2020-05-12 16:47:34 +0900 [info]: #0 [70057850679140] fluentd worker is now running worker=0
2020-05-12 16:47:35 +0900 [info]: #0 [70057784455520] [out_s3] put object tmp:#<File:0x00007f6f324c0f88> to s3path:20200512_1647/ng/20200512.3.dat.gz
2020-05-12 16:47:37 +0900 [info]: #0 [70057784455940] [out_s3] put object tmp:#<File:0x00007f6f32478b48> to s3path:20200512_1647/ng/20200512.4.dat.gz

Two threads uploaded to filename 20200512.0.dat.gz. I think the process of determining %{index} is not thread safe.

I think using uuid_flush will probably work around this problem. However, since the default value of s3_object_key_format is "%{path}%{time_slice}_%{index}. %{file_extension}", so I think this will affect a lot of users.

I think this issue is relevant.
#315

The text was updated successfully, but these errors were encountered:

repeatedly · 2020-05-13T05:27:12Z

Yes. This problem is similar to process wide conflict.
This problem happens when S3 processing is slower than next chunk flush.
To avoid this problem, show warning message like "Buffer configuration uses multiple flush threads. Recommend to use chunk_id or uuid_flush in object path to avoid object conflict".
How about this?

withgod · 2020-05-13T06:12:37Z

I encountered this bug and consulted with tatsu-yama.
I think it would be better to change the default to %{uuid_flush} or %{chunk_flush} and describe the risk of using %{index} in the s3_object_key_format in the documentation.

repeatedly · 2020-05-14T15:24:14Z

Changing default value affects existing users. So need 2 steps for it.

Add warning for multiple threads and release it in v1
Release v2 with default change

withgod · 2020-05-14T17:39:12Z

I certainly forgot about the impact on existing users ;)

Signed-off-by: Masahiro Nakagawa <[email protected]>

repeatedly · 2020-05-15T15:11:02Z

Patch for #327

Add warning for object conflict case. ref #326

The s3 plugin uses a default object key that is problematic in a few ways. 1. It makes HEAD requests for each chunk it uploads, starting from 1 each time. If you have uploaded 2000 log files within the same time slice, it will make 2001 HEAD requests to figure out if it exists. fluent/fluent-plugin-s3#160 2. The above check is not thread-safe, and two threads can race and decide to use the same `%{index}` value, with the loser of the race overwriting the chunk from the winner. fluent/fluent-plugin-s3#326 This is planned to change for v2, but there's no clear path to v2 right now. The plugin does warn already if you use multiple threads and don't use either `%{chunk_id}` or `%{uuid_hash}` in the object key.

As mentioned in a warning, as well as fluent#326 and fluent#160, the process of determining the index added to the default object key is not thread-safe. This adds some thread-safety until version 2.x is out where chunk_id is used instead of an index value. This is not a perfect implementation, since there can still be races between different workers if workers are enabled in fluentd, or if there are multiple fluentd instances uploading to the same bucket. This commit is just to resolve this problem short-term in a way that's backwards compatible.

As mentioned in a warning, as well as fluent#326 and fluent#160, the process of determining the index added to the default object key is not thread-safe. This adds some thread-safety until version 2.x is out where chunk_id is used instead of an index value. This is not a perfect implementation, since there can still be races between different workers if workers are enabled in fluentd, or if there are multiple fluentd instances uploading to the same bucket. This commit is just to resolve this problem short-term in a way that's backwards compatible. Signed-off-by: William Orr <[email protected]>

github-actions · 2021-07-06T10:04:47Z

This issue has been automatically marked as stale because it has been open 90 days with no activity. Remove stale label or comment or this issue will be closed in 30 days

github-actions · 2021-08-05T10:04:57Z

This issue was automatically closed because of stale in 30 days

repeatedly added a commit that referenced this issue May 15, 2020

Add warning for object conflict case. ref #326

72016e2

repeatedly added a commit that referenced this issue May 15, 2020

Add warning for object conflict case. ref #326

e3d87f1

Signed-off-by: Masahiro Nakagawa <[email protected]>

repeatedly added a commit that referenced this issue May 18, 2020

Merge pull request #327 from fluent/warn-for-multi-flush-threads

86d8191

Add warning for object conflict case. ref #326

worr mentioned this issue Oct 22, 2020

Use a more resilient s3 object name by default kube-logging/logging-operator#606

Closed

worr mentioned this issue Oct 24, 2020

Update index safely between threads. Fixes #160 and #326. #355

Closed

formanojhr mentioned this issue Nov 11, 2020

S3 cost optimisation: Remember the last index value that was used #160

Open

github-actions bot added the stale label Jul 6, 2021

github-actions bot closed this as completed Aug 5, 2021

ashie mentioned this issue Aug 16, 2021

%{hex_random} vs %{uuid_flush} / ${chunk_id} #372

Closed

daipom mentioned this issue Mar 8, 2022

The default s3_object_key_format is racy with multi workers #391

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

process of determining 'index' is not thread safe #326

process of determining 'index' is not thread safe #326

tatsu-yam commented May 12, 2020 •

edited

Loading

repeatedly commented May 13, 2020

withgod commented May 13, 2020

repeatedly commented May 14, 2020

withgod commented May 14, 2020

repeatedly commented May 15, 2020

github-actions bot commented Jul 6, 2021

github-actions bot commented Aug 5, 2021

process of determining 'index' is not thread safe #326

process of determining 'index' is not thread safe #326

Comments

tatsu-yam commented May 12, 2020 • edited Loading

repeatedly commented May 13, 2020

withgod commented May 13, 2020

repeatedly commented May 14, 2020

withgod commented May 14, 2020

repeatedly commented May 15, 2020

github-actions bot commented Jul 6, 2021

github-actions bot commented Aug 5, 2021

tatsu-yam commented May 12, 2020 •

edited

Loading