-
Notifications
You must be signed in to change notification settings - Fork 1
/
spausedd.8
186 lines (186 loc) · 5.99 KB
/
spausedd.8
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
.\"
.\" Copyright (c) 2018-2021, Red Hat, Inc.
.\"
.\" Permission to use, copy, modify, and/or distribute this software for any
.\" purpose with or without fee is hereby granted, provided that the above
.\" copyright notice and this permission notice appear in all copies.
.\"
.\" THE SOFTWARE IS PROVIDED "AS IS" AND RED HAT, INC. DISCLAIMS ALL WARRANTIES
.\" WITH REGARD TO THIS SOFTWARE INCLUDING ALL IMPLIED WARRANTIES
.\" OF MERCHANTABILITY AND FITNESS. IN NO EVENT SHALL RED HAT, INC. BE LIABLE
.\" FOR ANY SPECIAL, DIRECT, INDIRECT, OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES
.\" WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN ACTION
.\" OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF OR IN
.\" CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE.
.\"
.\" Author: Jan Friesse <[email protected]>
.\"
.Dd Jul 15, 2021
.Dt SPAUSEDD 8
.Os
.Sh NAME
.Nm spausedd
.Nd Utility to detect and log scheduler pause
.Sh SYNOPSIS
.Nm
.Op Fl dDfhp
.Op Fl m Ar steal_threshold
.Op Fl P Ar mode
.Op Fl t Ar timeout
.Sh DESCRIPTION
The
.Nm
utility is used for detecting and logging scheduler pause. This means, when process
should have been scheduled, but it was not. It's also able to use steal
time (time spent in other operating systems when running in a virtualized
environment) so it is (to some extend) able to detect if problem is on the VM
or host side.
.Nm
is able to read information about steal time ether from kernel or (if compiled in)
also use VMGuestLib.
Internally
.Nm
works as following pseudocode:
.Bd -literal -offset indent
repeat:
store current monotonic time
store current steal time
sleep for (timeout / 3)
set time_diff to (current monotonic time - stored monotonic time)
if time_diff > timeout:
display error
set steal_time_diff to (current steal time - stored steal time)
if (steal_time_diff / time_diff) * 100 > steal_threshold:
display steal time error
.Ed
.Pp
.Nm
arguments are as follows:
.Bl -tag -width Ds
.It Fl d
Display debug messages (specify twice to display also trace messages).
.It Fl D
Run on background (daemonize).
.It Fl f
Run on foreground (do not demonize - default).
.It Fl h
Show help.
.It Fl p
Do not set RR scheduler.
.It Fl m Ar steal_threshold
Set steal threshold percent. (default is 10 if kernel information is used and
100 if VMGuestLib is used).
.It Fl P Ar mode
Set mode of moving process to root cgroup. Default is
.Cm auto
which first checks if setting of RR scheduler is enabled. If so, it tries to set RR scheduler.
If this fails, process is moved to root cgroup and set of RR scheduler is retried.
Another options are
.Cm on
when process is always moved to root cgroup and
.Cm off
which makes
.Nm
to never try move pid into root cgroup.
It's worth noting that currently (May 3 2021) cgroup v2 doesn’t yet
support control of realtime processes and, for systems with CONFIG_RT_GROUP_SCHED
kernel option enabled, the cpu controller can only be
enabled when all RT processes are in the root cgroup. So when moving to
root cgroup is disabled and used together with systemd, it may be
impossible to make systemd options like CPUQuota working correctly until
.Nm
is stopped.
Also when moving to root cgroup is used together with cgroup v2 and systemd
it makes impossible (most of the time) for journald to add systemd specific
metadata (most importantly _SYSTEMD_UNIT) properly, because
.Nm
is moved out of cgroup created by systemd. This means
it is not possible to filter
.Nm
logged messages based on these metadata
(for example using -u or _SYSTEMD_UNIT=UNIT pattern) and also running
systemctl status doesn't display (all)
.Nm
log messages.
Problem is even worse because journald caches pid for some time
(approx. 5 sec) so initial
.Nm
messages have correct metadata.
.It Fl t Ar timeout
Set timeout value in milliseconds (default 200).
.El
.Pp
If
.Nm
receives a SIGUSR1 signal, the current statistics are show.
.Sh EXAMPLES
To generate CPU load
.Xr yes 1
together with
.Xr chrt 1
is used in following examples:
.Pp
.Dl chrt -r 99 yes >/dev/null &
.Pp
If chrt fails it may help to use
.Xr cgexec 1
like:
.Pp
.Dl cgexec -g cpu:/ chrt -r 99 yes >/dev/null &
.Pp
First example is physical or virtual machine with 4 CPU threads so
.Xr yes 1
was executed 4 times. In a while
.Nm
should start logging messages similar to:
.Pp
.Dl Mar 20 15:01:54 spausedd: Running main poll loop with maximum timeout 200 and steal threshold 10%
.Dl Mar 20 15:02:15 spausedd: Not scheduled for 0.2089s (threshold is 0.2000s), steal time is 0.0000s (0.00%)
.Dl Mar 20 15:02:16 spausedd: Not scheduled for 0.2258s (threshold is 0.2000s), steal time is 0.0000s (0.00%)
.Dl ...
.Pp
This means that
.Nm
didn't got time to run for longer time than default timeout. It's also visible
that steal time was 0% so
.Nm
is running ether on physical machine or VM where host machine is not overloaded
(VM was scheduled on time).
.Pp
Second example is a host machine with 2 CPU threads running one VM. VM is running
an instance of
.Nm . Two instancies of
.Xr yes 1
was executed on the host machine. After a while
.Nm
should start logging messages similar to:
.Pp
.Dl Mar 20 15:08:20 spausedd: Not scheduled for 0.9598s (threshold is 0.2000s), steal time is 0.7900s (82.31%)
.Dl Mar 20 15:08:20 spausedd: Steal time is > 10.0%, this is usually because of overloaded host machine
.Dl ...
.Pp
This means that
.Nm
didn't got the time to run for almost one second. Also because steal time is
high, it means that
.Nm
was not scheduled because VM wasn't scheduled by host machine.
.Sh DIAGNOSTICS
.Ex -std
.Sh AUTHORS
The
.Nm
utility was written by
.An Jan Friesse Aq [email protected] .
.Sh BUGS
.Bl -dash
.It
OS is not updating steal time as often as monotonic clock. This means that steal
time difference can be (and very often is) bigger than monotonic clock difference,
so steal time percentage can be bigger than 100%. It's happening very often for
highly overloaded host machine when
.Nm
is called with small timeout. This problem is even bigger when VMGuestLib is used.
.It
VMGuestLib seems to randomly block.
.El