Skip to content

log: gracefully handle "message dropped because it is too large" #152635

@kyle-a-wong

Description

@kyle-a-wong

Describe the problem

An incident occurred due to multiple node's exiting with a exit code 6: LoggingFileUnavailable. Grepping the logs showed that this error was causing this exit:

122363 1@util/log/event_log.go:43 ⋮ [-]   logging error: message dropped because it is too large

Which is en error returned here. As a result of this function returning an error, the error is returned from the sink's output function which ultimately resulted in this line of code being executed because exit-on-error was set to true in the log config.

To Reproduce
Using a log config along the following:

sinks:
  file-groups:
    default:
      channels: [ALL]
      dir: logs
      format: crdb-v2-tty
      exit-on-error: true
      buffered-writes: false
      buffering:
        max-staleness: 1s
        flush-trigger-size: 256KiB
        max-buffer-size: 2MiB

  stderr:
    channels: [OPS] 
    format: crdb-v2-tty
  • Start a local cluster
  • run a query that is more than 2MiB.
    • This can just be something like: SELECT <X>; where <X> is some long string that makes the query over 2 Mib.

WIth the above config, the server will crash. If you set exit-on-error:false, it will no longer crash

Expected behavior
This shouldn't cause the node to crash. We should gracefully handle situations where the log is too big.

Some potential solutions may be:

  • In the buffered sink append function, don't return an error if the the message is to big, instead we may want to log the error and return
  • The append function can return some type of "non critical error" that clog.outputLogEntry checks for and only exists for critical errors

Additional context

  • It seems likely that the customer had the cluster setting: sql.log.all_statements.enabled set to true, which results in all executed queries to be logged (QueryExecuted events). This includes both queries executed from clients as well as internal query executions.

  • we aren't able to actually see which log actually causes this error. If possible, we should see if we can log part of to help debug what logs are causing this in the future.

  • the customer had the max buffer size set to 50MiB (i believe this is the default for all cloud clusters). This means that the entry being logged had to be bigger than this, which is pretty crazy. Besides QueryExecuted structured events, I'm not sure how a log could be this big.

Jira issue: CRDB-53951

Epic CRDB-56325

Metadata

Metadata

Assignees

Labels

C-bugCode not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.O-postmortemOriginated from a Postmortem action item.T-supportabilitybranch-masterFailures and bugs on the master branch.

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions