Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[image_config] add rasdaemon.timer #14300

Merged
merged 1 commit into from
Apr 17, 2023

Conversation

stepanblyschak
Copy link
Collaborator

rasdaemon is a tool to log hardware errors. It takes 100% CPU during boot for a few seconds. It impacts fast/warm boot by delaying control plane restoration for 5 sec on some platforms.

Why I did it

Improve fast/warm boot performance.

How I did it

Added a rasdaemon timer.

How to verify it

Perform fast reboot control plane measurement and observe 5 sec improvement.

Which release branch to backport (provide reason below if selected)

  • 201811
  • 201911
  • 202006
  • 202012
  • 202106
  • 202111
  • 202205
  • 202211

Description for the changelog

Ensure to add label/tag for the feature raised. example - PR#2174 under sonic-utilities repo. where, Generic Config and Update feature has been labelled as GCU.

Link to config_db schema for YANG module changes

A picture of a cute animal (not mandatory but encouraged)

rasdaemon is a tool to log hardware errors. It takes 100% CPU during
boot for a few seconds. It impacts fast/warm boot by delaying control
plane restoration for 5 sec on some platforms.

Signed-off-by: Stepan Blyschak <[email protected]>
@stepanblyschak stepanblyschak requested a review from lguohan as a code owner March 17, 2023 17:20
@liat-grozovik
Copy link
Collaborator

LGTM, @yxieca who can approve it?

@qiluo-msft qiluo-msft requested a review from vaibhavhd March 27, 2023 07:39
@@ -437,6 +437,11 @@ sudo cp $IMAGE_CONFIGS/corefile_uploader/core_uploader.py $FILESYSTEM_ROOT/usr/b
sudo cp $IMAGE_CONFIGS/corefile_uploader/core_analyzer.rc.json $FILESYSTEM_ROOT_ETC_SONIC/
sudo chmod og-rw $FILESYSTEM_ROOT_ETC_SONIC/core_analyzer.rc.json

# Rasdaemon service configuration. Use timer to start rasdaemon with a delay for better fast/warm boot performance
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a downside here as well? Can the device potentially lose the hardware/memory errors?

I think the asnwer is yes. In that case, I think we should do this only where necessarily needed. Is there a merit of doing this for anything except warmboot?

For other cases (cold/fast/load-mg/config-reload) : The memory/hardware errors are more likely to be hit during bootup time and indicate a possibly bad hw. We want such errors to be logged while the system is booting up.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vaibhavhd
load-mg/config-reload - rasdaemon is not touched.
warm/fast - we must also delay it for fast reboot. It is the same kexec operation. Same operation from CPU/mem perspective.
Please note, that we don't have a way to delay per different boot types, e.g pmon, snmp, lldp are delayed regardless of the boot type.

rasdaemon was added as a replacement for mcelog. MCE exceptions are recorded into a kernel ring buffer, so they aren't lost if it reads it later (or at least this is how mcelog worked). rasdaemon states it reads not just /dev/mce but several sources EDAC, MCE, PCI, ... I am not aware how can I test whether events are lost or not as I don't know a way to generate this exceptions. At least an MCE exceptions should not be missed in my understanding.

@liat-grozovik
Copy link
Collaborator

@saiarcot895 FYI

@liat-grozovik
Copy link
Collaborator

@vaibhavhd could you please help to review?

@yxieca yxieca merged commit d73c810 into sonic-net:master Apr 17, 2023
mssonicbld pushed a commit to mssonicbld/sonic-buildimage that referenced this pull request Apr 18, 2023
rasdaemon is a tool to log hardware errors. It takes 100% CPU during
boot for a few seconds. It impacts fast/warm boot by delaying control
plane restoration for 5 sec on some platforms.

Signed-off-by: Stepan Blyschak <[email protected]>
@mssonicbld
Copy link
Collaborator

Cherry-pick PR to 202205: #14692

mssonicbld pushed a commit to mssonicbld/sonic-buildimage that referenced this pull request Apr 20, 2023
rasdaemon is a tool to log hardware errors. It takes 100% CPU during
boot for a few seconds. It impacts fast/warm boot by delaying control
plane restoration for 5 sec on some platforms.

Signed-off-by: Stepan Blyschak <[email protected]>
@mssonicbld
Copy link
Collaborator

Cherry-pick PR to 202211: #14762

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants