-
Notifications
You must be signed in to change notification settings - Fork 4k
Description
Describe the problem
An incident occurred due to multiple node's exiting with a exit code 6: LoggingFileUnavailable. Grepping the logs showed that this error was causing this exit:
122363 1@util/log/event_log.go:43 ⋮ [-] logging error: message dropped because it is too large
Which is en error returned here. As a result of this function returning an error, the error is returned from the sink's output function which ultimately resulted in this line of code being executed because exit-on-error was set to true in the log config.
To Reproduce
Using a log config along the following:
sinks:
file-groups:
default:
channels: [ALL]
dir: logs
format: crdb-v2-tty
exit-on-error: true
buffered-writes: false
buffering:
max-staleness: 1s
flush-trigger-size: 256KiB
max-buffer-size: 2MiB
stderr:
channels: [OPS]
format: crdb-v2-tty
- Start a local cluster
- run a query that is more than 2MiB.
- This can just be something like:
SELECT <X>;where<X>is some long string that makes the query over 2 Mib.
- This can just be something like:
WIth the above config, the server will crash. If you set exit-on-error:false, it will no longer crash
Expected behavior
This shouldn't cause the node to crash. We should gracefully handle situations where the log is too big.
Some potential solutions may be:
- In the buffered sink append function, don't return an error if the the message is to big, instead we may want to log the error and return
- The append function can return some type of "non critical error" that clog.outputLogEntry checks for and only exists for critical errors
Additional context
-
It seems likely that the customer had the cluster setting:
sql.log.all_statements.enabledset to true, which results in all executed queries to be logged (QueryExecutedevents). This includes both queries executed from clients as well as internal query executions. -
we aren't able to actually see which log actually causes this error. If possible, we should see if we can log part of to help debug what logs are causing this in the future.
-
the customer had the max buffer size set to
50MiB(i believe this is the default for all cloud clusters). This means that the entry being logged had to be bigger than this, which is pretty crazy. BesidesQueryExecutedstructured events, I'm not sure how a log could be this big.
Jira issue: CRDB-53951
Epic CRDB-56325