log: gracefully handle "message dropped because it is too large"

**Describe the problem**

An incident occurred due to multiple node's exiting with a exit code 6: `LoggingFileUnavailable`. Grepping the logs showed that this error was causing this exit: 
```
122363 1@util/log/event_log.go:43 ⋮ [-]   logging error: message dropped because it is too large
```
Which is en error returned [here](https://github.com/cockroachdb/cockroach/blob/9988da7ad7033edd0e20abdfbae54909688a503f/pkg/util/log/buffered_sink.go#L418). As a result of this function returning an error, the error is returned from the sink's `output` function which ultimately resulted in [this line of code ](https://github.com/cockroachdb/cockroach/blob/48ac1152f03cc2f27080d4bc361035abc016133c/pkg/util/log/clog.go#L415) being executed because `exit-on-error` was set to true in the log config. 

**To Reproduce**
Using a log config along the following:
```
sinks:
  file-groups:
    default:
      channels: [ALL]
      dir: logs
      format: crdb-v2-tty
      exit-on-error: true
      buffered-writes: false
      buffering:
        max-staleness: 1s
        flush-trigger-size: 256KiB
        max-buffer-size: 2MiB

  stderr:
    channels: [OPS] 
    format: crdb-v2-tty
```

* Start a local cluster
* run a query that is more than 2MiB. 
  * This can just be something like: `SELECT <X>;` where `<X>` is some long string that makes the query over 2 Mib.

WIth the above config, the server will crash. If you  set exit-on-error:false, it will no longer crash


**Expected behavior**
This shouldn't cause the node to crash. We should gracefully handle situations where the log is too big.

Some potential solutions may be: 
* In the buffered sink append function, don't return an error if the the message is to big, instead we may want to log the error and return
* The append function can return some type of "non critical error" that clog.outputLogEntry checks for and only exists for critical errors

**Additional context**
* It seems likely that the customer had the cluster setting: `sql.log.all_statements.enabled` set to true, which results in all executed queries to be logged (`QueryExecuted` events). This includes both queries executed from clients as well as internal query executions. 

* we aren't able to actually see which log actually causes this error. If possible, we should see if we can log part of to help debug what logs are causing this in the future.

* the customer had the max buffer size set to `50MiB` (i believe this is the default for all cloud clusters). This means that the entry being logged had to be bigger than this, which is pretty crazy. Besides `QueryExecuted` structured events, I'm not sure how a log could be this big.

Jira issue: CRDB-53951

Epic CRDB-56325

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

log: gracefully handle "message dropped because it is too large" #152635

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

log: gracefully handle "message dropped because it is too large" #152635

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions