-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make stage_size & stage computation thread safe #2734
Conversation
bf1c94c
to
4383919
Compare
Signed-off-by: Harish Nelakurthi <[email protected]>
4383919
to
a1731f1
Compare
Thanks!. We will check issue and patch later. |
Signed-off-by: Harish Nelakurthi <[email protected]>
64a72d5
to
065751f
Compare
@repeatedly Can we get this merged into the upcoming release if it looks good as this is causing a critical issue of data loss. We have tested this in large scale environment and the fix seems to work. Thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the delay. I added a comment.
# but this block **may** run multiple times from write_step_by_step and previous write may be rollbacked | ||
# So we should be counting the stage_size only for the last successful write | ||
# | ||
staged_bytesizes_by_chunk[chunk] = adding_bytesize |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the following situation can happen. what do you think of this?
staged_bytesizes_by_chunk
stores chunk1- rollback happens in write_step_by_step (ShouldRetry raises)
- another thread euqueues chunk1 before this thread enters into
fluentd/lib/fluent/plugin/buffer.rb
Line 673 in 065751f
chunk = get_next_chunk.call - create new chunk(chunk2) in
fluentd/lib/fluent/plugin/buffer.rb
Line 663 in 065751f
synchronize{ @stage[metadata] ||= generate_chunk(metadata).staged! } staged_bytesizes_by_chunk
stores chunk2 (still remtains chunk1 instaged_bytesizes_by_chunk
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ganmacs Steps (3) can't happen because enqueue_chunk
can't get a lock and waits until the lock is released by the thread which is writing to the chunk.
The lock is acquired here: https://github.com/fluent/fluentd/blob/master/lib/fluent/plugin/buffer.rb#L279
And released here:
https://github.com/fluent/fluentd/blob/master/lib/fluent/plugin/buffer.rb#L319
So enqueue_chunk
can happen only after the commit is done.
Hope this helps.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make sense. You're right. other thread can get the chunk but can't enter
fluentd/lib/fluent/plugin/buffer.rb
Line 674 in 065751f
chunk.synchronize do |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👏
# but this block **may** run multiple times from write_step_by_step and previous write may be rollbacked | ||
# So we should be counting the stage_size only for the last successful write | ||
# | ||
staged_bytesizes_by_chunk[chunk] = adding_bytesize |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make sense. You're right. other thread can get the chunk but can't enter
fluentd/lib/fluent/plugin/buffer.rb
Line 674 in 065751f
chunk.synchronize do |
Thanks! |
Which issue(s) this PR fixes:
#2712
Fixes #
What this PR does / why we need it:
When writing chunks to buffer, the
stage_size
computation can be thread-unsafe as mentioned in the bug which can result inBufferOverflowErrors
even though the buffer is not actually full. This happened on a live long running fluentd. The methodbuffer.write
is supposed to write to a chunk atmost once and hencestage_size
should be added for a chunk only once but due towrite_step_by_step
the block which updates the value ofstaged_bytesize
may be called multiple times and the chunk can be rollbacked but the value ofstaged_bytesize
is not reverted causing thestage_size
to be more than the actual value.This PR addresses this issue.
Docs Changes:
Release Note: