Skip to content

Commit 76d6b8c

Browse files
authored
SONiC System Health Monitor HLD (#624)
Add an HLD doc for SONiC System Health Monitor Service.
1 parent e1fa5a6 commit 76d6b8c

File tree

1 file changed

+319
-0
lines changed

1 file changed

+319
-0
lines changed
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,319 @@
1+
# SONiC System Health Monitor High Level Design #
2+
3+
### Revision ###
4+
5+
| Rev | Date | Author | Change Description |
6+
|:---:|:-----------:|:------------------:|-----------------------------------|
7+
| 0.1 | | Kebo Liu | Initial version |
8+
9+
10+
11+
## 1. Overview of the system health monitor
12+
13+
System health monitor is intended to monitor both critical services and peripheral device status and leverage system log, system status LED to and CLI command output to indicate the system status.
14+
15+
In current SONiC implementation, already have Monit which is monitoring the critical services status and also have a set of daemons(psud, thermaltcld, etc.) inside PMON collecting the peripheral devices status.
16+
17+
System health monitoring service will not monitor the critical services or devices directly, it will reuse the result of Monit and PMON daemons to summary the current status and decide the color of the system health LED.
18+
19+
### 1.1 Services under Monit monitoring
20+
21+
For the Monit, now below services and file system is under monitoring:
22+
23+
admin@sonic# monit summary -B
24+
Monit 5.20.0 uptime: 1h 6m
25+
Service Name Status Type
26+
sonic Running System
27+
rsyslog Running Process
28+
telemetry Running Process
29+
dialout_client Running Process
30+
syncd Running Process
31+
orchagent Running Process
32+
portsyncd Running Process
33+
neighsyncd Running Process
34+
vrfmgrd Running Process
35+
vlanmgrd Running Process
36+
intfmgrd Running Process
37+
portmgrd Running Process
38+
buffermgrd Running Process
39+
nbrmgrd Running Process
40+
vxlanmgrd Running Process
41+
snmpd Running Process
42+
snmp_subagent Running Process
43+
sflowmgrd Running Process
44+
lldpd_monitor Running Process
45+
lldp_syncd Running Process
46+
lldpmgrd Running Process
47+
redis_server Running Process
48+
zebra Running Process
49+
fpmsyncd Running Process
50+
bgpd Running Process
51+
staticd Running Process
52+
bgpcfgd Running Process
53+
root-overlay Accessible Filesystem
54+
var-log Accessible Filesystem
55+
56+
57+
By default any above services or file systems is not in good status will be considered as fault condition.
58+
59+
### 1.2 Peripheral devices status which could impact the system health status
60+
61+
- Any fan is missing/broken
62+
- Fan speed is below minimal range
63+
- PSU power voltage is out of range
64+
- PSU temperature is too hot
65+
- PSU is in bad status
66+
- ASIC temperature is too hot
67+
68+
### 1.3 Customization of monitored critical services and devices
69+
70+
#### 1.3.1 Ignore some of monitored critical services and devices
71+
The list of monitored critical services and devices can be customized by a configuration file, the user can rule out some services or device sensors status from the monitor list. System health monitor will load this configuration file at next run and ignore the services or devices during the routine check.
72+
```json
73+
{
74+
"services_to_ignore": ["snmpd","snmp_subagent"],
75+
"devices_to_ignore": ["psu","fan.speed","fan1", "fan2.speed"],
76+
}
77+
```
78+
79+
The filter string is case sensitive. Currently, it support following filters:
80+
81+
- <service_name>: for example, "orchagent", "snmpd", "telemetry"
82+
- asic: ignore all ASIC check
83+
- fan: ignore all fan check
84+
- fan.speed: ignore fan speed check
85+
- <fan_name>: ignore check for a specific fan
86+
- <fan_name>.speed: ignore speed check for a specific fan
87+
- psu: ignore all PSU check
88+
- psu.temperature: ignore temperature check for all PSUs
89+
- psu.voltage: ignore voltage check for all PSUs
90+
- <psu_name>: ignore check for a specific PSU
91+
- <psu_name>.temperature: ignore temperature check for a specific PSU
92+
- <psu_name>.voltage: ignore voltage check for a specific PSU
93+
94+
The default filter is to filter nothing. Unknown filters will be silently ignored. The "serivces_to_ignore" and "devices_to_ignore" section must be an string array or it will use default filter.
95+
96+
This configuration file will be platform specific and shall be added to the platform folder(/usr/share/sonic/device/{platform_name}/system_health_monitoring_config.json).
97+
98+
#### 1.3.2 Extend the monitoring with adding user specific program to Monit
99+
Monit support to check program(scripts) exit status, if user want to monitor something that beyond critical serives or some special device not included in the above list, they can provide a specific scripts and add it to Monit check list, then the result can also be collected by the system health monitor. It requires 2 steps to add an external checker.
100+
101+
1. Prepare program whose command line output must qualify:
102+
103+
```
104+
<category_name>
105+
<item_name1>:<item_status1>
106+
<item_name2>:<item_status2>
107+
```
108+
109+
2. Add the command line string to configuration:
110+
111+
```json
112+
{
113+
"external_checkers": ["program_name -option1 value1 -option2 value2"],
114+
}
115+
```
116+
117+
For example, there is a python script "my_external_checker.py", and its output is like:
118+
119+
```
120+
MyCategory
121+
device1:OK
122+
device2:device2 is out of power
123+
```
124+
125+
The configuration shall be:
126+
127+
```json
128+
{
129+
"external_checkers": ["python my_external_checker.py"],
130+
}
131+
```
132+
133+
### 1.4 system status LED color definition
134+
135+
default system status LED color definition is like
136+
137+
| Color | Status | Description |
138+
|:----------------:|:-------------:|:-----------------------:|
139+
| Off | off | no power |
140+
| Blinking amber | boot up | switch is booting up |
141+
| Red | fault | in fault status |
142+
| Green | Normal | in normal status |
143+
144+
Considering that different vendors platform may have different LED color capability, so LED color for different status also configurable:
145+
146+
```json
147+
{
148+
"led_color": {
149+
"fault": "amber",
150+
"normal": "green",
151+
"booting": "orange_blink"
152+
}
153+
}
154+
```
155+
156+
157+
## 2. System health monitor service business logic
158+
159+
System health monitor daemon will running on the host, periodically(every 60s) check the "monit summary" command output and PSU, fan, thermal status which stored in the state DB, if anything wrong with the services monitored by monit or peripheral devices, system status LED will be set to fault status. When fault condition relieved, system status will be set to normal status.
160+
161+
Before the switch boot up finish, the system health monitoring service shall be able to know the switch is in boot up status(see open question 1).
162+
163+
If monit service is not avalaible, will consider system in fault condition.
164+
FAN/PSU/ASIC data not available will also considered as fault conditon.
165+
Incomplete data in the DB will also be considered as fault condition, e.g., PSU voltage data is there but threshold data not available.
166+
167+
Monit, thermalctld and psud will raise syslog when fault condition encountered, so system health monitor will only generate some general syslog on these situation to avoid redundant. For example, when fault condition meet, "system health status change to fault" can be print out, "system health status change to normal" when it recovered.
168+
169+
this service will be started after system boot up(after database.service and updategraph.service).
170+
171+
## 3. System health data in redis database
172+
173+
System health service will populate system health data to STATE db. A new table "SYSTEM_HEALTH_INFO" will be created to STATE db.
174+
175+
; Defines information for a system health
176+
key = SYSTEM_HEALTH_INFO ; health information for the switch
177+
; field = value
178+
summary = STRING ; summary status for the switch
179+
<item_name> = STRING ; an entry for a service or device
180+
181+
We store items to db only if it is abnormal. Here is an example:
182+
183+
```
184+
admin@sonic:~$ redis-cli -n 6 hgetall SYSTEM_HEALTH_INFO
185+
1) "fan1"
186+
2) "fan1 speed is out of range, speed=21.0, range=[24.0,36.0]"
187+
3) "fan3"
188+
4) "fan3 speed is out of range, speed=21.0, range=[24.0,36.0]"
189+
5) "fan5"
190+
6) "fan5 speed is out of range, speed=22.0, range=[24.0,36.0]"
191+
7) "fan7"
192+
8) "fan7 speed is out of range, speed=21.0, range=[24.0,36.0]"
193+
9) "summary"
194+
10) "Not OK"
195+
```
196+
197+
If the system status is good, the data in redis is like:
198+
199+
```
200+
admin@sonic:~$ redis-cli -n 6 hgetall SYSTEM_HEALTH_INFO
201+
1) "summary"
202+
2) "OK"
203+
```
204+
205+
## 4. Platform API and PMON related change to support this new service
206+
207+
To have system status LED can be set by this new service, a system status LED object need to be added to Chassis class. This system status LED object shall be initialized when platform API loaded from host side.
208+
209+
psud need to collect more PSU data to the DB to satisfy the requirement of this new service. more specifically, psud need to collect psu output voltage, temperature and their threshold.
210+
211+
; Defines information for a psu
212+
key = PSU_INFO|psu_name ; information for the psu
213+
; field = value
214+
presence = BOOLEAN ; presence of the psu
215+
model = STRING ; model name of the psu
216+
serial = STRING ; serial number of the psu
217+
status = BOOLEAN ; status of the psu
218+
change_event = STRING ; change event of the psu
219+
fan = STRING ; fan_name of the psu
220+
led_status = STRING ; led status of the psu
221+
temp = INT ; temperature of the PSU
222+
temp_th = INT ; temperature threshold
223+
voltage = INT ; output voltage of the PSU
224+
voltage_max_th = INT ; max threshold of the output voltage
225+
voltage_min_th = INT ; min threshold of the output voltage
226+
227+
## 5. System health monitor CLI
228+
229+
Add a new "show system-health" command line to the system
230+
231+
admin@sonic# show ?
232+
Usage: show [OPTIONS] COMMAND [ARGS]...
233+
234+
SONiC command line - 'show' command
235+
236+
Options:
237+
-?, -h, --help Show this message and exit.
238+
239+
Commands:
240+
...
241+
startupconfiguration Show startup configuration information
242+
subinterfaces Show details of the sub port interfaces
243+
system-memory Show memory information
244+
system-health Show system health status
245+
...
246+
247+
"show system-health" CLI has three sub command, "summary" and "detail" and "monitor-list". With command "summary" will give brief outpt of system health status while "detail" will be more verbose.
248+
"monitor-list" command will list all the services and devices under monitoring.
249+
250+
admin@sonic# show system-health ?
251+
Usage: show system-health [OPTIONS] COMMAND...
252+
253+
SONiC command line - 'show system-health' command
254+
255+
Options:
256+
-?, -h, --help Show this message and exit.
257+
258+
Commands:
259+
summary Show system-health summary information
260+
detail Show system-health detail information
261+
monitor-list Show system-health monitored services and devices name list
262+
263+
output is like below:
264+
265+
when everything is OK
266+
267+
admin@sonic# show system-health summary
268+
System status LED green
269+
Services OK
270+
Hardware OK
271+
272+
When something is wrong
273+
274+
admin@sonic# show system-health summary
275+
System status LED amber
276+
Services Fault
277+
orchagent is not running
278+
Hardware Fault
279+
PSU 1 temp 85C and threshold is 70C
280+
FAN 2 is broken
281+
282+
for the "detail" sub command output, it will give out all the services and devices status which is under monitoring, and also the ignored service/device list will also be displayed.
283+
284+
"moniter-list" will give a name list of services and devices exclude the ones in the ignore list.
285+
286+
When the CLI been called, it will directly analyze the "monit summary" output and the state DB entries to present a summary about the system health status. The status analyze logic of the CLI shall be aligned/shared with the logic in the system health service.
287+
288+
Fault condition and CLI output string table
289+
| Fault condition |CLI output |
290+
|:-----------------------:|:-------------:|
291+
| critical service failure|[service name] is [service status]|
292+
| Any fan is missing/broken |[FAN name] is missing/broken|
293+
| Fan speed is below minimal range|[FAN name] speed is lower than expected|
294+
| PSU power voltage is out of range|[PSU name] voltage is out of range|
295+
| PSU temp is too hot|[PSU name] is overheated|
296+
| PSU is in bad status|[PSU name] is broken|
297+
| ASIC temperature is too hot|[ASIC name] is overheated|
298+
| monit service is not running| monit is not running|
299+
| PSU data is not available in the DB|PSU data is not available|
300+
| FAN data is not available in the DB|FAN data is not available|
301+
| ASIC data is not available in the DB|ASIC data is not available|
302+
303+
See open question 2 for adding configuration CLIs.
304+
305+
## 6. System health monitor test plan
306+
307+
1. If some critical service missed, check the CLI output, the LED color and error shall be as expected.
308+
2. Simulate PSU/FAN/ASIC and related sensor failure via mock sysfs and check the CLI output, the LED color and error shall be as expected.
309+
3. Change the monitor service/device list then check whether the system health monitor service works as expected; also check whether the result of "show system-health monitor-list" aligned.
310+
311+
## 7. Open Questions
312+
313+
1. How to determine the SONiC system is in boot up stage? The current design is to compare the system up time with a "boot_timeout" value. The system up time is got from "cat /proc/uptime". The default "boot_timeout" is 300 seconds and can be configured by configuration. System health service will not do any check until SONiC system finish booting.
314+
315+
```json
316+
{
317+
"boot_timeout": 300
318+
}
319+
```

0 commit comments

Comments
 (0)