-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[PSU & system health] Support PSU power threshold checking #1060
[PSU & system health] Support PSU power threshold checking #1060
Conversation
Signed-off-by: Stephen Sun <[email protected]>
Signed-off-by: Stephen Sun <[email protected]>
Signed-off-by: Stephen Sun <[email protected]>
Signed-off-by: Stephen Sun <[email protected]>
Signed-off-by: Stephen Sun <[email protected]>
Signed-off-by: Stephen Sun <[email protected]>
@prgeor kinldy reminder to review and provide your feedback |
Conflicts resolved. |
Signed-off-by: Stephen Sun <[email protected]>
Signed-off-by: Stephen Sun <[email protected]>
@stephenxs please update the https://github.com/sonic-net/sonic-utilities/blob/master/doc/Command-Reference.md with the new command line. |
@shyam77git can you review from chassis point of view? |
@prgeor the PR is approved and build passed, are you ok to merge? Thanks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@stephenxs as discussed in the community can we have an agreement that the "warning threshold" is confusing because no warning is raised at "warning" threshold? How about renaming it to "warning-suppress" level?
- `<power>` is the current power of the PSU | ||
- `<threshold>` is the warning threshold of the PSU | ||
- Otherwise: no action | ||
|
||
## 3. DB schema for PSU | ||
|
||
PSU number is stored in chassis table. Please refer to this [document](https://github.com/sonic-net/SONiC/blob/master/doc/pmon/pmon-enhancement-design.md), section 1.5.2. | ||
|
||
PSU information is stored in PSU table: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this table for all PSUs in the system?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes
doc/psud/PSU_daemon_design.md
Outdated
power_overload = "true" / "false" ; whether the PSU's power exceeds the threshold | ||
power_warning_threshold = 1*4.3DIGIT ; The power warning threshold | ||
power_critical_threshold = 1*4.3DIGIT ; The power critical threshold |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are these thresholds the same for all PSUs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For us, yes. But it's possible for other vendors to have different thresholds among all PSUs.
doc/psud/PSU_daemon_design.md
Outdated
- `Not OK` which can be caused by: | ||
- power is not good, which means the PSU is present but no power (Eg. the power is down or power cable is unplugged) | ||
- `WARNING` which can be caused by: | ||
- power exceeds the PSU's power threshold |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
which threshold? critical or warning?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The critical threshold. Will fix it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated
doc/psud/PSU_daemon_design.md
Outdated
|
||
def get_psu_power_warning_threshold(self) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we have a set API for user to override?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not suggest having an API to override the thresholds. As discussed before, it should be controlled by the platform vendor only because users do not have full knowledge to decide what the threshold should be. For example, in our system, the thresholds depend on temperatures as well.
Signed-off-by: Stephen Sun <[email protected]>
1ac1eff
Signed-off-by: Stephen Sun <[email protected]>
voltage_max_th = INT ; max threshold of the output voltage | ||
voltage_min_th = INT ; min threshold of the output voltage | ||
power_overload = "true" / "false" ; whether the PSU's power exceeds the threshold | ||
power_warning_threshold = 1*4.3DIGIT ; The power warning threshold |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
naming not changed to supress
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will fix it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
@@ -275,6 +278,7 @@ When something is wrong | |||
orchagent is not running | |||
Hardware Fault | |||
PSU 1 temp 85C and threshold is 70C | |||
PSU 1 power (66.32w) exceeds thresholds (warning: 60.00w critical: 70.00w) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
'thresholds' which threshold?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's hard to say which threshold is crossed because most times it exceeds the critical threshold but sometimes it can be in (warning_suppress_threshold, critical_threshold) because of hysteresis.
I think we can just put the value of warning_suppress_threshold here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated
max_power = 1*4.3DIGIT ; power capacity of the psu | ||
power_overload = "true" / "false" ; whether the PSU's power exceeds the threshold | ||
power_warning_suppress_threshold = 1*4.3DIGIT ; The power warning threshold | ||
power_critical_threshold = 1*4.3DIGIT ; The power critical threshold | ||
|
||
Now psud only collect and update "presence" and "status" field. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we need one field in the DB which indicates the critical threshold warning is ACTIVE. this is needed for telemetry
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
power_overload is the field to indicate whether it’s in the warning state
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@stephenxs its not clear how is the user notified? if syslog how many times syslog is raised?
Signed-off-by: Stephen Sun <[email protected]>
Fix comments
syslog. it is printed only once when the alarm is raised and cleared. |
Signed-off-by: Stephen Sun <[email protected]>
doc/psud/PSU_daemon_design.md
Outdated
4. PSU's power was in range (warning threshold, critical threshold) and is greater than the critical threshold | ||
1. if warning was raised, no action expected | ||
2. if warning was not raised, a warning should be raised | ||
5. PSU's power was less than the warning threshold and is greater than the critical threshold: a warning should be raised |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
its confusing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's to test the case that the PSU power jumps from a value below the warning suppress threshold to a value above critical threshold.
voltage = INT ; output voltage of the PSU | ||
voltage_max_th = INT ; max threshold of the output voltage | ||
voltage_min_th = INT ; min threshold of the output voltage | ||
power_overload = "true" / "false" ; whether the PSU's power exceeds the threshold |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let me check if this should be INT for telemetry
Signed-off-by: Stephen Sun <[email protected]>
doc/psud/PSU_daemon_design.md
Outdated
We use asymmetric thresholds between raising and clearing the alarm for the purpose of creating a hysteresis and avoiding alarm flapping. | ||
|
||
- an alarm will be raised when a PSU's power is rising accross the critical threshold | ||
- an alarm will be cleared when a PSU's power is dropping across the warning threshold |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
warning threshold -> 'warning suppress' threshold
doc/psud/PSU_daemon_design.md
Outdated
|
||
1. Retrieve the current power | ||
2. If flag `PSU power exceeded threshold` is `true`, compare the current power against the warning threshold | ||
- If `current power` < `warning threshold` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
warning-suppress threshold
doc/psud/PSU_daemon_design.md
Outdated
2. If flag `PSU power exceeded threshold` is `true`, compare the current power against the warning threshold | ||
- If `current power` < `warning threshold` | ||
- Set `PSU power exceeded threshold` to `false` | ||
- Message in NOTICE level should be logged: `PSU <x>: current power <power> is below the warning threshold <threshold>` where |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
warning-suppress threshold
doc/psud/PSU_daemon_design.md
Outdated
`psuutil` fetches the information via calling platform API directly. Both warning and critical thresholds will be exposed in the output of psuutil status. | ||
The "WARNING" state is not exposed because psuutil is a one-time command instead of a daemon, which means it does not store state information. It fetches information via calling platform API so it can not distinguish the following status: | ||
|
||
1. The power exceeded the critical threshold but is in the range between the warning and critical thresholds, which means the alarm should be raised | ||
2. The power didn't exceed the critical threshold and exceeds the warning threshold, which means the alarm should not be raised | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
warning -> 'warning-suppress'
Signed-off-by: Stephen Sun <[email protected]>
PRs
Support PSU power threshold checking
None
is returned orNotImplemented
is thrown by either API, the PSU power threshold checking will not be performed for the PSUSigned-off-by: Stephen Sun [email protected]