forked from andikleen/mcelog
-
Notifications
You must be signed in to change notification settings - Fork 0
/
mcelog.8
336 lines (292 loc) · 10.1 KB
/
mcelog.8
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
.TH MCELOG 8 "Mar 2015" "" "Linux's Administrator's Manual"
.SH NAME
mcelog \- Decode kernel machine check log on x86 machines
.SH SYNOPSIS
mcelog [options] [device]
.br
mcelog [options] \-\-daemon
.br
mcelog [options] \-\-client
.br
mcelog [options] \-\-ascii
.br
.\"mcelog [options] \-\-drop-old-memory
.\".br
.\"mcelog [options] \-\-reset-memory locator
.\".br
.\"mcelog [options] \-\-dump-memory[=locator]
.br
mcelog [options] \-\-is\-cpu\-supported
.br
mcelog \-\-version
.SH DESCRIPTION
X86 CPUs report errors detected by the CPU as
.I machine check events (MCEs).
These can be data corruption detected in the CPU caches,
in main memory by an integrated memory controller, data
transfer errors on the front side bus or CPU interconnect or other internal
errors.
Possible causes can be cosmic radiation, instable power supplies,
cooling problems, broken hardware, running systems out of specification,
or bad luck.
Most errors can be corrected by the CPU by internal error correction
mechanisms. Uncorrected errors cause machine check exceptions which
may kill processes or panic the machine. A small number of corrected
errors is usually not a cause for worry, but a large number can indicate
future failure.
When a corrected or recovered error happens, the x86 kernel writes a record describing
the MCE into a internal ring buffer available through the
.I /dev/mcelog
device.
.I mcelog
retrieves errors from
.I /dev/mcelog,
decodes them into a human readable format and prints them
on the standard output or optionally into the system log.
Optionally it can also take more options like keeping statistics or
triggering shell scripts on specific events. By default mcelog
supports offlining memory pages with persistent corrected errors,
offlining CPU cores if they developed cache problems,
and otherwise logging specific events to the system log after
they crossed a threshold.
The normal operating modes for mcelog are: running
as a regular cron job (traditional way, deprecated),
running as a trigger directly executed by the kernel,
or running as a daemon with the
.I \-\-daemon
option.
When an uncorrected machine check error happens that the kernel
cannot recover from then it will usually panic the system.
In this case when there was a warm reset after the panic
mcelog should pick up the machine check errors after reboot.
This is not possible after a cold reset.
In addition mcelog can be used on the command line to decode the kernel
output for a fatal machine check panic in text format using the
.I \-\-ascii
option. This is typically used to decode the panic console output of a fatal
machine check, if the system was power cycled or mcelog didn't
run immediately after reboot.
When the panic triggers a kdump kexec crash kernel the crash
kernel boot up script should log the machine checks to disk, otherwise
they might be lost.
Note that after mcelog retrieves an error the kernel doesn't
store it anymore (different from
.I dmesg(1)),
so the output should be always saved somewhere and mcelog
not run in uncontrolled ways.
When invoked with the
.I \-\-is\-cpu\-supported
option mcelog exits with code 0 if the current CPU is supported, 1 otherwise.
.SH OPTIONS
When the
.B \-\-syslog
option is specified redirect output to system log. The
.B \-\-syslog-error
option causes the normal machine checks to be logged as
.I LOG_ERR
(implies
.I \-\-syslog
). Normally only fatal errors or high level remarks are logged with error level.
High level one line summaries of specific errors are also logged to the syslog by
default unless mcelog operates in
.I \-\-ascii
mode.
When the
.B \-\-logfile=file
option is specified append log output to the specified file. With the
.B \-\-no-syslog
option mcelog will never log anything to the syslog.
When the
.B \-\-cpu=cputype
option is specified set the to be decoded CPU to
.I cputype.
See
.I mcelog \-\-help
for a list of valid CPUs.
Note that specifying an incorrect CPU can lead to incorrect decoding output.
Default is either the CPU of the machine that reported the machine check (needs
a newer kernel version) or the CPU of the machine mcelog is running on, so normally
this option doesn't have to be used. Older versions of mcelog had separate
options for different CPU types. These are still implemented, but deprecated
and undocumented now.
With the
.B \-\-dmi
option mcelog will look up the DIMMs reported in machine
checks in the
.I SMBIOS/DMI
tables of the BIOS and map the DIMMs to board identifiers.
This only works when the BIOS reports the identifiers correctly.
Unfortunately often the information reported
by the BIOS is either subtly or obviously wrong or useless.
This option requires that mcelog has read access to /dev/mem
(normally requires root) and runs on the same machine
in the same hardware configuration as when the machine check
event happened.
When
.B \-\-ignorenodev
is specified then mcelog will exit silently when the device
cannot be opened. This is useful in virtualized environment
with limited devices.
When
.B \-\-filter
is specified
.I mcelog
will filter out known broken machine check events (default on). When the
.B \-\-no-filter
option is specified mcelog does not filter events.
When
.B \-\-raw
is specified
.I mcelog
will not decode, but just dump the mcelog in a raw hex format. This
can be useful for automatic post processing.
When a device is specified the machine check logs are read from
device instead of the default
.I /dev/mcelog.
With the
.B \-\-ascii
option mcelog decodes a fatal machine check panic generated
by the kernel ("CPU n: Machine Check Exception ...") in ASCII from standard input
and exits afterwards.
Note that when the panic comes from a different machine than
where mcelog is running on you might need to specify the correct
cputype on older kernels. On newer kernels which output the
.I PROCESSOR
field this is not needed anymore.
When the
.B \-\-file filename
option is specified
.I mcelog \-\-ascii
will read the ASCII machine check record from input file
.I filename
instead of standard input.
With the
.B \-\-config-file file
option mcelog reads the specified config file.
Default is
.I /etc/mcelog/mcelog.conf
See also
.I CONFIG FILE
below.
With the
.B \-\-daemon
option mcelog will run in the background. This gives the fastest reaction
time and is the recommended operating mode.
If an output option isn't selected (
.I \-\-logfile
or
.I \-\-syslog
or
.I \-\-syslog-error
), this option implies
.I \-\-logfile=/var/log/mcelog.
Important messages will be logged as one-liner summaries to syslog
unless
.I \-\-no-syslog
is given.
The option
.I \-\-foreground
will prevent mcelog from giving up the terminal in daemon mode. This
is intended for debugging.
With the
.B \-\-client
option mcelog will query a running daemon for accumulated errors.
With the
.B \-\-cpumhz=mhz
option assume the CPU has
.I mhz
frequency for decoding the time of the event using the CPU time stamp
counter. This also forces decoding. Note this can be unreliable.
on some systems with CPU frequency scaling or deep C states, where
the CPU time stamp counter does not increase linearly.
By default the frequency of the current CPU is used when mcelog
determines it is safe to use. Newer kernels report
the time directly in the event and don't need this anymore.
The
.B \-\-pidfile file
option writes the process id of the daemon into file
.I file.
Only valid in daemon mode.
Mcelog will enable extended error reporting from the memory
controller on processors that support it unless you tell it
not to with the
.B \-\-no-imc-log
option. You might need this option when decoding old logs
from a system where this mode was not enabled.
.\".B \-\-database filename
.\"specifies the memory module error database file. Default is
.\"/var/lib/memory-errors. It is only used together with DMI decoding.
.\"
.\"
.\".B \-\-error\-trigger=cmd,thresh
.\"When a memory module accumulates
.\".I thresh
.\"errors in the err database run command
.\".I cmd.
.\"
.\".B \-\-drop-old-memory
.\"Drop old DIMMs in the memory module database that are not plugged in
.\"anymore.
.\"
.\".B \-\-reset\-memory=locator
.\"When the DIMMs have suitable unique serial numbers mcelog
.\"will automatically detect changed DIMMs. When the DIMMs don't
.\"have those the user will have to use this option when changing
.\"a DIMM to reset the error count in the error database.
.\".I Locator
.\"is the memory slot identifier printed on the motherboard.
.\"
.\".B \-\-dump-memory[=locator]
.\"Dump error database information for memory module located
.\"at
.\".I locator.
.\"When no locator is specified dump all.
.B \-\-version
displays the version of mcelog and exits.
.SH CONFIG FILE
mcelog supports a config file to set defaults. Command line options override
the config file. By default the config file is read from
.I /etc/mcelog/mcelog.conf
unless overridden with the
.I --config-file
option.
The general format is
.I optionname = value
White space is not allowed in value currently, except at the end where it is dropped
Comments start with #.
All command line options that are not commands can be specified in the config file.
For example t to enable the
.I --no-syslog
option use
.I no-syslog = yes
(or no to disable). When the option has a argument
use
.I logfile = /tmp/logfile
For more information on the config file please see
.B mcelog.conf(5).
.SH NOTES
The kernel prefers old messages over new. If the log buffer overflows
only old ones will be kept.
The exact output in the log file depends on the CPU, unless the \-\-raw option is used.
mcelog will report serious errors to the syslog during decoding.
.SH SIGNALS
When
.I mcelog
runs in daemon mode and receives a
.I SIGUSR1
it will close and reopen the log files. This can be used to rotate logs without
restarting the daemon.
.SH FILES
/dev/mcelog (char 10, minor 227)
/etc/mcelog/mcelog.conf
/var/log/mcelog
/var/run/mcelog.pid
.\"/var/lib/memory-errors
.SH SEE ALSO
.BR mcelog.conf(5),
.BR mcelog.triggers(5)
http://www.mcelog.org
AMD x86-64 architecture programmer's manual, Volume 2, System programming
Intel 64 and IA32 Architectures Software Developer's manual, Volume 3, System programming guide
Chapter 15 and 16. http://www.intel.com/sdm
Datasheet of your CPU.