Skip to content

Conversation

@Abhinav1299
Copy link
Contributor

@Abhinav1299 Abhinav1299 commented Nov 18, 2025

Previously, when a single log message exceeded the configured max-buffer-size
for a buffered sink with exit-on-error enabled, the error would propagate up
and trigger process termination. This was overly aggressive for what amounts
to a logging configuration issue - a single oversized SQL query (e.g., with a
multi-megabyte string literal) could crash an entire CockroachDB node.

This commit modifies bufferedSink.output() to detect the errMsgTooLarge error
and handle it gracefully. When an oversized message is encountered, instead of
propagating the error, we drop the message and log a warning via Ops.Warningf()
indicating that the message exceeded the buffer size limit. This allows the
node to continue operating normally while still providing visibility into the
issue through logged warnings.

The implementation uses a two-phase approach to avoid deadlock: first, while
holding the sink's mutex, we detect the oversized message and set a flag with
the relevant information; then, after releasing the lock, we emit the warning.
This is necessary because calling Ops.Warningf() while holding the mutex would
cause the warning message to attempt re-entry into the same sink, resulting in
a deadlock when it tries to acquire the already-held lock.

This resolves #152635

Part of: CRDB-53951
Epic: CRDB-56325
Release note: None

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@Abhinav1299 Abhinav1299 marked this pull request as ready for review November 18, 2025 06:26
@Abhinav1299 Abhinav1299 requested review from a team as code owners November 18, 2025 06:26
@Abhinav1299 Abhinav1299 requested review from aa-joshi, arjunmahishi, dhartunian and kyle-a-wong and removed request for a team November 18, 2025 06:26
@Abhinav1299 Abhinav1299 force-pushed the oversized-log-exit-on-error branch from 476e3ad to 75f2048 Compare November 18, 2025 06:42
Comment on lines 673 to 674
// Give a moment for any async warning to be logged.
time.Sleep(100 * time.Millisecond)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this intended to give time for the mock.DO() function to be called? Instead, can we put a boolean flag into that closure and wait for it to be true?:

var receivedError error
...
mock.
   ...
   DO(func() { ... receivedError = true... }
..
testutils.SucceedsSoon(t, func() error {
  if !receivedError {
    return errors.New("Waiting for dropped oversized error log")
  }

  return nil
})

I think this is less likely to cause a flake, compared to using time.Sleep

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the test

Copy link
Contributor

@aa-joshi aa-joshi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @Abhinav1299, @arjunmahishi, and @dhartunian)


-- commits line 4 at r1:
Lets add details in depth about the issue and the resolution.


pkg/util/log/buffered_sink.go line 246 at r1 (raw file):

				// Set flag to log warning after releasing the lock (to avoid recursion/deadlock)
				logOversizedWarning = true
				maxSize = bs.mu.buf.maxSizeBytes

Is there any specific reason that we are capturing value as part of maxSize? Wouldn't this value available during logging time?


pkg/util/log/buffered_sink.go line 303 at r1 (raw file):

	// Log the oversized message warning after releasing the lock to avoid deadlock.
	// This warning will be logged through the normal logging system and will appear

What do you mean by normal logging system?

@Abhinav1299 Abhinav1299 force-pushed the oversized-log-exit-on-error branch from 75f2048 to 7cc4cb8 Compare November 20, 2025 10:22
Copy link
Contributor Author

@Abhinav1299 Abhinav1299 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @aa-joshi, @arjunmahishi, @dhartunian, and @kyle-a-wong)


-- commits line 4 at r1:

Previously, aa-joshi (Akshay Joshi) wrote…

Lets add details in depth about the issue and the resolution.

updated


pkg/util/log/buffered_sink.go line 246 at r1 (raw file):

Previously, aa-joshi (Akshay Joshi) wrote…

Is there any specific reason that we are capturing value as part of maxSize? Wouldn't this value available during logging time?

We can't access bs.mu.buf.maxSizeBytes at logging time in Ops.Warningf() because we're outside the lock, and accessing mutex-protected data without the lock will lead to a race condition. Thats the reason we capture the buffer size in maxSize while we still safely hold the lock.


pkg/util/log/buffered_sink.go line 303 at r1 (raw file):

Previously, aa-joshi (Akshay Joshi) wrote…

What do you mean by normal logging system?

By normal logging I mean this log will route through OPS channel and will appear in log files file.
I'll update the wordings here.

Comment on lines 673 to 674
// Give a moment for any async warning to be logged.
time.Sleep(100 * time.Millisecond)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the test

@Abhinav1299 Abhinav1299 force-pushed the oversized-log-exit-on-error branch from 7cc4cb8 to 2360b35 Compare November 21, 2025 05:23
…edSink

Previously, when a single log message exceeded the configured max-buffer-size
for a buffered sink with exit-on-error enabled, the error would propagate up
and trigger process termination. This was overly aggressive for what amounts
to a logging configuration issue - a single oversized SQL query (e.g., with a
multi-megabyte string literal) could crash an entire CockroachDB node.

This commit modifies bufferedSink.output() to detect the errMsgTooLarge error
and handle it gracefully. When an oversized message is encountered, instead of
propagating the error, we drop the message and log a warning via Ops.Warningf()
indicating that the message exceeded the buffer size limit. This allows the
node to continue operating normally while still providing visibility into the
issue through logged warnings.

The implementation uses a two-phase approach to avoid deadlock: first, while
holding the sink's mutex, we detect the oversized message and set a flag with
the relevant information; then, after releasing the lock, we emit the warning.
This is necessary because calling Ops.Warningf() while holding the mutex would
cause the warning message to attempt re-entry into the same sink, resulting in
a deadlock when it tries to acquire the already-held lock.

Part of: CRDB-53951
Epic: CRDB-56325
Release note: None
@Abhinav1299 Abhinav1299 force-pushed the oversized-log-exit-on-error branch from 2360b35 to 003b9b2 Compare November 21, 2025 08:16
@Abhinav1299 Abhinav1299 added backport-25.4.x Flags PRs that need to be backported to 25.4 backport-25.4.1-rc Wednesday, 11/26: release-25.4.1-rc will be frozen labels Nov 21, 2025
Copy link
Contributor

@kyle-a-wong kyle-a-wong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

IMO we should backport this to all versions, using backport-all label.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backport-25.4.x Flags PRs that need to be backported to 25.4 backport-25.4.1-rc Wednesday, 11/26: release-25.4.1-rc will be frozen

Projects

None yet

Development

Successfully merging this pull request may close these issues.

log: gracefully handle "message dropped because it is too large"

4 participants