Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Always lock chunks first to avoid deadlock #1721

Merged
merged 4 commits into from
Oct 23, 2017

Conversation

abicky
Copy link
Contributor

@abicky abicky commented Oct 20, 2017

This PR resolves #1549.

For example, deadlock occurs by the following steps before this commit:

  1. input plugin thread receives multiple events
    (metadata_and_data.size > 1)
  2. input plugin thread processes the first metadata_and_data element
    and aquires the lock of chunk1 (chunk.mon_exit in Buffer#write)
  3. enqueue thread aquires the lock of buffer
    (synchronize in Buffer#enqueue_chunk)
  4. enqueue thread tries to aquire the lock of chunk1
    (chunk.synchronize in Buffer#enqueue_chunk)
  5. input plugin thread processes the second metadata_and_data element
    and tries to aquire the lock of buffer (synchronize in Buffer#write_once)

cf. #1549 (comment)

All chunk classes which inherit Chunk have enqueued! method.
@abicky
Copy link
Contributor Author

abicky commented Oct 20, 2017

@repeatedly Could you review the change? I want to try the latest version (v0.14.22.rc2?) which includes this change in our production environment if it looks good to you.

@repeatedly repeatedly self-requested a review October 20, 2017 07:46
@repeatedly repeatedly self-assigned this Oct 20, 2017
@repeatedly repeatedly added bug Something isn't working v0.14 labels Oct 20, 2017
@repeatedly
Copy link
Member

Thanks! I will review it!

@abicky abicky changed the title Always lock chunks first to avoid deadlock [WIP] Always lock chunks first to avoid deadlock Oct 20, 2017
@abicky
Copy link
Contributor Author

abicky commented Oct 20, 2017

I found there were errors like unexpected error while checking flushed chunks. ignored. error_class=RuntimeError error="can't enqueue buffer file: path = /fluentd/log/..., error = 'closed stream'" and I'll investigate the cause 🙇

This commit resolves fluent#1549.

For example, deadlock occurs by the following steps before this commit:

1. input plugin thread receives multiple events
  (metadata_and_data.size > 1)
2. input plugin thread processes the first metadata_and_data element
  and aquires the lock of chunk1 (chunk.mon_exit in Buffer#write)
3. enqueue thread aquires the lock of buffer
  (synchronize in Buffer#enqueue_chunk)
4. enqueue thread tries to aquire the lock of buffer
  (chunk.synchronize in Buffer#enqueue_chunk)
5. input plugin thread processes the second metadata_and_data element
  and tries to aquire the lock of buffer (synchronize in Buffer#write_once)
@abicky abicky force-pushed the hotfix/lock-chunks-first branch from 93088bf to 723038c Compare October 20, 2017 13:38
In flush threads @buffer could call destructive
methods of chunks in @Queue without chunk lock,
so buffer lock is required.
@abicky abicky force-pushed the hotfix/lock-chunks-first branch from 723038c to 1a6f16b Compare October 21, 2017 06:54
@abicky abicky changed the title [WIP] Always lock chunks first to avoid deadlock Always lock chunks first to avoid deadlock Oct 21, 2017
@abicky
Copy link
Contributor Author

abicky commented Oct 21, 2017

I fixed "closed stream" error in 1a6f16b.

@repeatedly Please review. I've checked the behavior in our development environment and it works fine. I want your review before I try it in our production environment.
Some tests failed in Travis CI, but these tests seem to sometimes fails also in master branch, aren't they?

end
if block_given?
synchronize{ @stage.keys }.each do |metadata|
chunk = @stage[metadata]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need synchronized guard here?

Copy link
Member

@repeatedly repeatedly Oct 23, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm... deleting metadata from @stage happens in only enqueue_chunk so it seems safe in CRuby.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think data race doesn't occur in CRuby thanks to GIL and synchronized is not necessary, but I'll change the code like below for other Ruby implementations:

diff --git a/lib/fluent/plugin/buffer.rb b/lib/fluent/plugin/buffer.rb
index 337b3b79..61a904ed 100644
--- a/lib/fluent/plugin/buffer.rb
+++ b/lib/fluent/plugin/buffer.rb
@@ -351,11 +351,15 @@ module Fluent
       def enqueue_all
         log.trace "enqueueing all chunks in buffer", instance: self.object_id
         if block_given?
-          synchronize{ @stage.keys }.each do |metadata|
-            chunk = @stage[metadata]
-            next unless chunk
-            v = yield metadata, chunk
-            enqueue_chunk(metadata) if v
+          metadata_array = []
+          synchronize do
+            @stage.each do |metadata, chunk|
+              v = yield metadata, chunk
+              metadata_array << metadata if v
+            end
+          end
+          metadata_array.each do |metadata|
+            enqueue_chunk(metadata)
           end
         else
           synchronize{ @stage.keys }.each do |metadata|

Copy link
Member

@repeatedly repeatedly Oct 23, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes and currently fluentd focus on CRuby so I think code change is not needed for now.
But adding comment is better for the future.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okey, I added a comment 2261428

@repeatedly
Copy link
Member

I re-run failed tests and all tests are passed.
So your patch doesn't break existing expected behaviour.

abicky added a commit to abicky/fluentd that referenced this pull request Oct 23, 2017
@repeatedly repeatedly merged commit 0848224 into fluent:master Oct 23, 2017
@abicky abicky deleted the hotfix/lock-chunks-first branch October 24, 2017 03:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working v0.14
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Fluentd 0.14.11 deadlocked in file output
2 participants