-
Notifications
You must be signed in to change notification settings - Fork 4k
pkg/util/log: Implement handling for oversized log messages in bufferedSink #157964
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
pkg/util/log: Implement handling for oversized log messages in bufferedSink #157964
Conversation
476e3ad to
75f2048
Compare
pkg/util/log/buffered_sink_test.go
Outdated
| // Give a moment for any async warning to be logged. | ||
| time.Sleep(100 * time.Millisecond) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this intended to give time for the mock.DO() function to be called? Instead, can we put a boolean flag into that closure and wait for it to be true?:
var receivedError error
...
mock.
...
DO(func() { ... receivedError = true... }
..
testutils.SucceedsSoon(t, func() error {
if !receivedError {
return errors.New("Waiting for dropped oversized error log")
}
return nil
})
I think this is less likely to cause a flake, compared to using time.Sleep
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated the test
aa-joshi
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @Abhinav1299, @arjunmahishi, and @dhartunian)
-- commits line 4 at r1:
Lets add details in depth about the issue and the resolution.
pkg/util/log/buffered_sink.go line 246 at r1 (raw file):
// Set flag to log warning after releasing the lock (to avoid recursion/deadlock) logOversizedWarning = true maxSize = bs.mu.buf.maxSizeBytes
Is there any specific reason that we are capturing value as part of maxSize? Wouldn't this value available during logging time?
pkg/util/log/buffered_sink.go line 303 at r1 (raw file):
// Log the oversized message warning after releasing the lock to avoid deadlock. // This warning will be logged through the normal logging system and will appear
What do you mean by normal logging system?
75f2048 to
7cc4cb8
Compare
Abhinav1299
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @aa-joshi, @arjunmahishi, @dhartunian, and @kyle-a-wong)
Previously, aa-joshi (Akshay Joshi) wrote…
Lets add details in depth about the issue and the resolution.
updated
pkg/util/log/buffered_sink.go line 246 at r1 (raw file):
Previously, aa-joshi (Akshay Joshi) wrote…
Is there any specific reason that we are capturing value as part of maxSize? Wouldn't this value available during logging time?
We can't access bs.mu.buf.maxSizeBytes at logging time in Ops.Warningf() because we're outside the lock, and accessing mutex-protected data without the lock will lead to a race condition. Thats the reason we capture the buffer size in maxSize while we still safely hold the lock.
pkg/util/log/buffered_sink.go line 303 at r1 (raw file):
Previously, aa-joshi (Akshay Joshi) wrote…
What do you mean by normal logging system?
By normal logging I mean this log will route through OPS channel and will appear in log files file.
I'll update the wordings here.
pkg/util/log/buffered_sink_test.go
Outdated
| // Give a moment for any async warning to be logged. | ||
| time.Sleep(100 * time.Millisecond) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated the test
7cc4cb8 to
2360b35
Compare
…edSink Previously, when a single log message exceeded the configured max-buffer-size for a buffered sink with exit-on-error enabled, the error would propagate up and trigger process termination. This was overly aggressive for what amounts to a logging configuration issue - a single oversized SQL query (e.g., with a multi-megabyte string literal) could crash an entire CockroachDB node. This commit modifies bufferedSink.output() to detect the errMsgTooLarge error and handle it gracefully. When an oversized message is encountered, instead of propagating the error, we drop the message and log a warning via Ops.Warningf() indicating that the message exceeded the buffer size limit. This allows the node to continue operating normally while still providing visibility into the issue through logged warnings. The implementation uses a two-phase approach to avoid deadlock: first, while holding the sink's mutex, we detect the oversized message and set a flag with the relevant information; then, after releasing the lock, we emit the warning. This is necessary because calling Ops.Warningf() while holding the mutex would cause the warning message to attempt re-entry into the same sink, resulting in a deadlock when it tries to acquire the already-held lock. Part of: CRDB-53951 Epic: CRDB-56325 Release note: None
2360b35 to
003b9b2
Compare
kyle-a-wong
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
IMO we should backport this to all versions, using backport-all label.
Previously, when a single log message exceeded the configured max-buffer-size
for a buffered sink with exit-on-error enabled, the error would propagate up
and trigger process termination. This was overly aggressive for what amounts
to a logging configuration issue - a single oversized SQL query (e.g., with a
multi-megabyte string literal) could crash an entire CockroachDB node.
This commit modifies bufferedSink.output() to detect the errMsgTooLarge error
and handle it gracefully. When an oversized message is encountered, instead of
propagating the error, we drop the message and log a warning via Ops.Warningf()
indicating that the message exceeded the buffer size limit. This allows the
node to continue operating normally while still providing visibility into the
issue through logged warnings.
The implementation uses a two-phase approach to avoid deadlock: first, while
holding the sink's mutex, we detect the oversized message and set a flag with
the relevant information; then, after releasing the lock, we emit the warning.
This is necessary because calling Ops.Warningf() while holding the mutex would
cause the warning message to attempt re-entry into the same sink, resulting in
a deadlock when it tries to acquire the already-held lock.
This resolves #152635
Part of: CRDB-53951
Epic: CRDB-56325
Release note: None