pkg/util/log: Implement handling for oversized log messages in bufferedSink #157964

Abhinav1299 · 2025-11-18T00:28:56Z

Previously, when a single log message exceeded the configured max-buffer-size
for a buffered sink with exit-on-error enabled, the error would propagate up
and trigger process termination. This was overly aggressive for what amounts
to a logging configuration issue - a single oversized SQL query (e.g., with a
multi-megabyte string literal) could crash an entire CockroachDB node.

This commit modifies bufferedSink.output() to detect the errMsgTooLarge error
and handle it gracefully. When an oversized message is encountered, instead of
propagating the error, we drop the message and log a warning via Ops.Warningf()
indicating that the message exceeded the buffer size limit. This allows the
node to continue operating normally while still providing visibility into the
issue through logged warnings.

The implementation uses a two-phase approach to avoid deadlock: first, while
holding the sink's mutex, we detect the oversized message and set a flag with
the relevant information; then, after releasing the lock, we emit the warning.
This is necessary because calling Ops.Warningf() while holding the mutex would
cause the warning message to attempt re-entry into the same sink, resulting in
a deadlock when it tries to acquire the already-held lock.

This resolves #152635

Part of: CRDB-53951
Epic: CRDB-56325
Release note: None

cockroach-teamcity · 2025-11-18T00:29:08Z

This change is

kyle-a-wong · 2025-11-18T14:59:20Z

pkg/util/log/buffered_sink_test.go

+	// Give a moment for any async warning to be logged.
+	time.Sleep(100 * time.Millisecond)


Is this intended to give time for the mock.DO() function to be called? Instead, can we put a boolean flag into that closure and wait for it to be true?:

var receivedError error ... mock. ... DO(func() { ... receivedError = true... } .. testutils.SucceedsSoon(t, func() error { if !receivedError { return errors.New("Waiting for dropped oversized error log") } return nil })

I think this is less likely to cause a flake, compared to using time.Sleep

Updated the test

aa-joshi

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @Abhinav1299, @arjunmahishi, and @dhartunian)

-- commits line 4 at r1:
Lets add details in depth about the issue and the resolution.

pkg/util/log/buffered_sink.go line 246 at r1 (raw file):

				// Set flag to log warning after releasing the lock (to avoid recursion/deadlock)
				logOversizedWarning = true
				maxSize = bs.mu.buf.maxSizeBytes

Is there any specific reason that we are capturing value as part of maxSize? Wouldn't this value available during logging time?

pkg/util/log/buffered_sink.go line 303 at r1 (raw file):

	// Log the oversized message warning after releasing the lock to avoid deadlock.
	// This warning will be logged through the normal logging system and will appear

What do you mean by normal logging system?

Abhinav1299

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @aa-joshi, @arjunmahishi, @dhartunian, and @kyle-a-wong)

-- commits line 4 at r1:

Previously, aa-joshi (Akshay Joshi) wrote…

Lets add details in depth about the issue and the resolution.

updated

pkg/util/log/buffered_sink.go line 246 at r1 (raw file):

Previously, aa-joshi (Akshay Joshi) wrote…

Is there any specific reason that we are capturing value as part of maxSize? Wouldn't this value available during logging time?

We can't access bs.mu.buf.maxSizeBytes at logging time in Ops.Warningf() because we're outside the lock, and accessing mutex-protected data without the lock will lead to a race condition. Thats the reason we capture the buffer size in maxSize while we still safely hold the lock.

pkg/util/log/buffered_sink.go line 303 at r1 (raw file):

Previously, aa-joshi (Akshay Joshi) wrote…

What do you mean by normal logging system?

By normal logging I mean this log will route through OPS channel and will appear in log files file.
I'll update the wordings here.

Abhinav1299 · 2025-11-20T10:22:48Z

pkg/util/log/buffered_sink_test.go

+	// Give a moment for any async warning to be logged.
+	time.Sleep(100 * time.Millisecond)


Updated the test

…edSink Previously, when a single log message exceeded the configured max-buffer-size for a buffered sink with exit-on-error enabled, the error would propagate up and trigger process termination. This was overly aggressive for what amounts to a logging configuration issue - a single oversized SQL query (e.g., with a multi-megabyte string literal) could crash an entire CockroachDB node. This commit modifies bufferedSink.output() to detect the errMsgTooLarge error and handle it gracefully. When an oversized message is encountered, instead of propagating the error, we drop the message and log a warning via Ops.Warningf() indicating that the message exceeded the buffer size limit. This allows the node to continue operating normally while still providing visibility into the issue through logged warnings. The implementation uses a two-phase approach to avoid deadlock: first, while holding the sink's mutex, we detect the oversized message and set a flag with the relevant information; then, after releasing the lock, we emit the warning. This is necessary because calling Ops.Warningf() while holding the mutex would cause the warning message to attempt re-entry into the same sink, resulting in a deadlock when it tries to acquire the already-held lock. Part of: CRDB-53951 Epic: CRDB-56325 Release note: None

kyle-a-wong

LGTM.

IMO we should backport this to all versions, using backport-all label.

Abhinav1299 marked this pull request as ready for review November 18, 2025 06:26

Abhinav1299 requested review from a team as code owners November 18, 2025 06:26

Abhinav1299 requested review from aa-joshi, arjunmahishi, dhartunian and kyle-a-wong and removed request for a team November 18, 2025 06:26

Abhinav1299 force-pushed the oversized-log-exit-on-error branch from 476e3ad to 75f2048 Compare November 18, 2025 06:42

kyle-a-wong reviewed Nov 18, 2025

View reviewed changes

aa-joshi reviewed Nov 19, 2025

View reviewed changes

Abhinav1299 force-pushed the oversized-log-exit-on-error branch from 75f2048 to 7cc4cb8 Compare November 20, 2025 10:22

Abhinav1299 commented Nov 20, 2025

View reviewed changes

Abhinav1299 force-pushed the oversized-log-exit-on-error branch from 7cc4cb8 to 2360b35 Compare November 21, 2025 05:23

Abhinav1299 force-pushed the oversized-log-exit-on-error branch from 2360b35 to 003b9b2 Compare November 21, 2025 08:16

Abhinav1299 added backport-25.4.x Flags PRs that need to be backported to 25.4 backport-25.4.1-rc Wednesday, 11/26: release-25.4.1-rc will be frozen labels Nov 21, 2025

Abhinav1299 requested a review from kyle-a-wong November 21, 2025 10:58

kyle-a-wong approved these changes Nov 21, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

pkg/util/log: Implement handling for oversized log messages in bufferedSink #157964

pkg/util/log: Implement handling for oversized log messages in bufferedSink #157964

Abhinav1299 commented Nov 18, 2025 •

edited

Loading

Uh oh!

cockroach-teamcity commented Nov 18, 2025

Uh oh!

kyle-a-wong Nov 18, 2025

Uh oh!

Abhinav1299 Nov 20, 2025

Uh oh!

aa-joshi left a comment

Uh oh!

Abhinav1299 left a comment

Uh oh!

Abhinav1299 Nov 20, 2025

Uh oh!

kyle-a-wong left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

		// Give a moment for any async warning to be logged.
		time.Sleep(100 * time.Millisecond)

pkg/util/log: Implement handling for oversized log messages in bufferedSink #157964

Are you sure you want to change the base?

pkg/util/log: Implement handling for oversized log messages in bufferedSink #157964

Conversation

Abhinav1299 commented Nov 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cockroach-teamcity commented Nov 18, 2025

Uh oh!

kyle-a-wong Nov 18, 2025

Choose a reason for hiding this comment

Uh oh!

Abhinav1299 Nov 20, 2025

Choose a reason for hiding this comment

Uh oh!

aa-joshi left a comment

Choose a reason for hiding this comment

Uh oh!

Abhinav1299 left a comment

Choose a reason for hiding this comment

Uh oh!

Abhinav1299 Nov 20, 2025

Choose a reason for hiding this comment

Uh oh!

kyle-a-wong left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Abhinav1299 commented Nov 18, 2025 •

edited

Loading