-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[image_config] add rasdaemon.timer #14300
Conversation
rasdaemon is a tool to log hardware errors. It takes 100% CPU during boot for a few seconds. It impacts fast/warm boot by delaying control plane restoration for 5 sec on some platforms. Signed-off-by: Stepan Blyschak <[email protected]>
LGTM, @yxieca who can approve it? |
@@ -437,6 +437,11 @@ sudo cp $IMAGE_CONFIGS/corefile_uploader/core_uploader.py $FILESYSTEM_ROOT/usr/b | |||
sudo cp $IMAGE_CONFIGS/corefile_uploader/core_analyzer.rc.json $FILESYSTEM_ROOT_ETC_SONIC/ | |||
sudo chmod og-rw $FILESYSTEM_ROOT_ETC_SONIC/core_analyzer.rc.json | |||
|
|||
# Rasdaemon service configuration. Use timer to start rasdaemon with a delay for better fast/warm boot performance |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a downside here as well? Can the device potentially lose the hardware/memory errors?
I think the asnwer is yes. In that case, I think we should do this only where necessarily needed. Is there a merit of doing this for anything except warmboot?
For other cases (cold/fast/load-mg/config-reload) : The memory/hardware errors are more likely to be hit during bootup time and indicate a possibly bad hw. We want such errors to be logged while the system is booting up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vaibhavhd
load-mg/config-reload - rasdaemon is not touched.
warm/fast - we must also delay it for fast reboot. It is the same kexec operation. Same operation from CPU/mem perspective.
Please note, that we don't have a way to delay per different boot types, e.g pmon, snmp, lldp are delayed regardless of the boot type.
rasdaemon was added as a replacement for mcelog. MCE exceptions are recorded into a kernel ring buffer, so they aren't lost if it reads it later (or at least this is how mcelog worked). rasdaemon states it reads not just /dev/mce but several sources EDAC, MCE, PCI, ... I am not aware how can I test whether events are lost or not as I don't know a way to generate this exceptions. At least an MCE exceptions should not be missed in my understanding.
@saiarcot895 FYI |
@vaibhavhd could you please help to review? |
rasdaemon is a tool to log hardware errors. It takes 100% CPU during boot for a few seconds. It impacts fast/warm boot by delaying control plane restoration for 5 sec on some platforms. Signed-off-by: Stepan Blyschak <[email protected]>
Cherry-pick PR to 202205: #14692 |
rasdaemon is a tool to log hardware errors. It takes 100% CPU during boot for a few seconds. It impacts fast/warm boot by delaying control plane restoration for 5 sec on some platforms. Signed-off-by: Stepan Blyschak <[email protected]>
Cherry-pick PR to 202211: #14762 |
rasdaemon is a tool to log hardware errors. It takes 100% CPU during boot for a few seconds. It impacts fast/warm boot by delaying control plane restoration for 5 sec on some platforms.
Why I did it
Improve fast/warm boot performance.
How I did it
Added a rasdaemon timer.
How to verify it
Perform fast reboot control plane measurement and observe 5 sec improvement.
Which release branch to backport (provide reason below if selected)
Description for the changelog
Ensure to add label/tag for the feature raised. example - PR#2174 under sonic-utilities repo. where, Generic Config and Update feature has been labelled as GCU.
Link to config_db schema for YANG module changes
A picture of a cute animal (not mandatory but encouraged)