Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SONiC core dump utility #3499

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

rajendra-dendukuri
Copy link
Contributor

- What I did

Added new way to collect and manage application core files.

- How I did it
Core files generated by kernel are created by the host o/s and also stored
on host o/s. This applies for processes running inside container as well.
Containers do not have access to journal as well as core files. Containers
are supposed to have limited access to host o/s and core files and journal
may contain sensitive information.

To improve debugging of crashes inside a container following changes are made:

  • Install systemd-coredump in base o/s

  • Remove existing simple coredump facility

  • Enable persistent journald to store coredump history

  • Minimal default coredump configuration

  • Added a coredump-config service to generate coredump configuration

  • when INSTALL_DEBUG_TOOLS=y is set in the build. systemd-coredump tool
    is installed in all containers beside gdb

  • When SONIC_DEBUGGING_ON=y is set in the build,
    /var/log/journal and /var/lib/systemd/coredump are mapped inside container

To inspect a core file, from a container shell, issue below commands

docker exec -ti coredumpctl

- How to verify it
kill -ABRT
coredumpctl list
coredumpctl info

- Description for the changelog

Use systemd-coredump for core file management in SONiC

- A picture of a cute animal (not mandatory but encouraged)

- Install systemd-coredump in base o/s
- Remove existing simple coredump facility
- Enable persistent journald to store coredump history
- Minimal default coredump configuration
- Added a coredump-config service to generate coredump configuration

Core files generated by kernel are created by the host o/s and also stored
on host o/s. This applies for processes running inside container as well.
Containers do not have access to journal as well as core files. Containers
are supposed to have limited access to host o/s and core files and journal
may contain sensitive information.

Toimprove debugging of crashes inside a container following changes are made:

- when INSTALL_DEBUG_TOOLS=y is set in the build. systemd-coredump tool
is installed in all containers beside gdb

- When SONIC_DEBUGGING_ON=y is set in the build,
  /var/log/journal and /var/lib/systemd/coredump are mapped inside container

To inspect a core file, from a container shell, issue below commands

docker exec -ti <container-name> /bin/bash
@rajendra-dendukuri
Copy link
Contributor Author

Refer to
sonic-net/SONiC#468

slave.mk Outdated
@@ -637,6 +637,7 @@ $(addprefix $(TARGET_PATH)/, $(SONIC_INSTALLERS)) : $(TARGET_PATH)/% : \
export sonic_asic_platform="$(patsubst %-$(CONFIGURED_ARCH),%,$(CONFIGURED_PLATFORM))"
export enable_organization_extensions="$(ENABLE_ORGANIZATION_EXTENSIONS)"
export enable_dhcp_graph_service="$(ENABLE_DHCP_GRAPH_SERVICE)"
export sonic_debugging_on="$(SONIC_DEBUGGING_ON)"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a good reason to separate this from INSTALL_DEBUG_TOOLS? Maybe we can combine them both into a BUILD_DEBUG_IMAGE flag?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There will always be cases where we would want to do something separate for INSTALL_DEBUG_TOOLS compared to SONIC_DEBUGGING_ON. The idea I followed was, INSTALL_DEBUG_TOOLS will install additional debug tools (coredumpctl) installed with in the container. This adds some size to the container. So users may not want to do this bit and do it if they really want to debug using gdb.

SONIC_DEBUGGING_ON, will allow the container to have access to /var/log/journal so that it can find matching core reports.

You are correct we will need both for running the gdb, but wanted to keep them separate to give flexibility to the user.


DISABLE_COREDUMP_CONF="/etc/sysctl.d/50-disable-coredump.conf"

if [ "$(redis-cli -n 4 HGET "COREDUMP|config" "enabled")" = "false" ] ; then
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If key is present && carry value as false, then we disable. In other words the default behavior is "enabled=true". To disable, one has to create this key explicitly.

Instead, why not require a key for disabling, which would imply default is enabled.

"COREDUMP|disabled" == true

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is common place to see configuration knobs with a positive intent. So if we flip this around, it may confuse the user with other usages of true/false kind of configurations.

DISABLE_COREDUMP_CONF="/etc/sysctl.d/50-disable-coredump.conf"

if [ "$(redis-cli -n 4 HGET "COREDUMP|config" "enabled")" = "false" ] ; then
echo "kernel.core_pattern=" > ${DISABLE_COREDUMP_CONF}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does this mean ?

Can you please explain the impact of disable?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When the coredump admin mode is disabled in config db, core files will not be generated. We are creating a sysctl entry to disable core dump.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you give me a use case, where you would not want core file dump ?

Core file dump implies that some unexpected error occurred or user explicitly creating one with kill for a reason. In either case, the dump is required for analysis.

If this is the only purpose of coredump-config.service, I don't see a need.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. corefiles can be a space hog
  2. Multiple corefiles may be generated if a process is in infinite loop
  3. corefiles may contain sensitive information, so some applications may not want it to be recorded.
  4. We plan to re-use coredump-config.service for enable/disable of kernel coredump as well. Also there may be additional parameters that you would like to configure w.r.t core files (e.g limit on the size of core file). Current mode of operation is we chose some fixed numbers. But future extensions may make them configurable. To start with, we are providing a framework to enable/disable the feature.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is already another initiative to do core file rotation and limit the count of core files at any time per process. Turning off is definitely not the solution.

The limitation on count / size is also from Broadcom only. I need to look for it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In fact you are referring that PR #468, which has the following.

" a. Support per-process core file rotation and archiving to optimize disk space "

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am referring to various configurations provided by systemd-coredump. Below is a link to it.
https://www.freedesktop.org/software/systemd/man/coredump.conf.html
User's might want some bits of it be part of ConfigDB.

The PR I was referring to is PR#729 which enables kernel core dump feature. For this feature, it is desirable that users have an enable/disable knob as kdump requires dedicated 512MB of memory.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me add Guohan to this thread. What we need is the "requirement/use case"? I don't see one.

I find it rather risky to have a DB variable to control, as it could get saved and persist across reboots, transparently, which is a big risk as it is likely to block core dumps unintentionally.

I would rather have this ability as a CLI tool, which disables temporarily and all should be back to default/enabled state upon reboot.

What we do need is the ability to limit count of cores per process and overall disk size taken by cores.

@@ -11,7 +11,9 @@ VIM = vim
OPENSSH = openssh-client
SSHPASS = sshpass
STRACE = strace
$(DOCKER_BASE_STRETCH)_DBG_IMAGE_PACKAGES += $(GDB) $(GDBSERVER) $(VIM) $(OPENSSH) $(SSHPASS) $(STRACE)
SYSTEMD_COREDUMP = systemd-coredump
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

W/o this package installed, will there be core dumps created?

Plus this is already installed in build_debian.sh unconditionally. Can you please explain?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

W/O this package installed inside the container, core dumps will still be created.

Core files are always generated on host o/s and stored in the /var/lib/systemd/coredump directory.

The application that has crashed may be part of a container. So if you want to run gdb using the coredumpctl gdb command, it will not find the application binary when executed on host o/s.

So, we map the /var/lib/systemd/coredump directory to the containers (see change in docker_image_ctl.j2) and also install the coredumpctl tool here. Now, the corefile and the handy coredumpctl tool are ready for debugging the application inside the container.

@rajendra-dendukuri
Copy link
Contributor Author

retest vsimage please

mssonicbld added a commit that referenced this pull request Feb 13, 2025
…lly (#21664)

#### Why I did it
src/sonic-swss
```
* 5031aadc - (HEAD -> 202411, origin/202411) Capability query for MACSEC ACL attribute (#3511) (31 hours ago) [mssonicbld]
* 4b357e59 - Fix VRF update handling for loopback interfaces in IntfsOrch (#3512) (31 hours ago) [mssonicbld]
* fe98176b - Add a delay between killing teamd processes (#3510) (2 days ago) [mssonicbld]
* e967711e - Remove RIF from m_rifsToAdd before deleting it (#3499) (6 days ago) [mssonicbld]
* 337c9a10 - Optimize counter polling interval by making it more accurate (#3500) (6 days ago) [mssonicbld]
```
#### How I did it
#### How to verify it
#### Description for the changelog
Junchao-Mellanox pushed a commit to Junchao-Mellanox/sonic-buildimage that referenced this pull request Mar 7, 2025
…lly (sonic-net#701)

#### Why I did it
src/sonic-swss
```
* bdc4fe0 - (HEAD -> 202412, origin/202412) Merge pull request #42 from r12f/code-sync-202412 (9 hours ago) [Riff]
|\ 
| failure_prs.log skip_prs.log 054f912 - Merge remote-tracking branch 'base/202411' into code-sync-202412 (18 hours ago) [r12f]
| failure_prs.log skip_prs.log e21b5a5 - Link orchagent against jemalloc (sonic-net#3530) (2 days ago) [mssonicbld]
| failure_prs.log skip_prs.log aac84b5 - Merge pull request sonic-net#3531 from mssonicbld/cherry/202411/3426 (2 days ago) [Kumaresh Perumal]
| |\ 
| | failure_prs.log skip_prs.log 272f45a - vlanmgrd not to throw exception for Portchannel ip link add because of race condition with PortChannel removal. (2 days ago) [Sonic Build Admin]
| |/ 
| failure_prs.log skip_prs.log 16f5331 - [vlanmgrd]: Fixing an issue causing mismatch between MAC and link-local IPv6 addresses of VLAN and Bridge interfaces (sonic-net#3527) (5 days ago) [mssonicbld]
| failure_prs.log skip_prs.log 4d3bae4 - Avoid removing a VRF routing table when there are pending creation entries in gRouteBulker (sonic-net#3526) (6 days ago) [mssonicbld]
| failure_prs.log skip_prs.log 9875694 - Added change not to create ECMP Group in SAI and program the route if none of ECMP members are active/link-up (sonic-net#3522) (7 days ago) [mssonicbld]
| failure_prs.log skip_prs.log 60fd735 - [BufferOrch] Use SAI bulk API to configure port, PG and queue (sonic-net#3523) (7 days ago) [mssonicbld]
| failure_prs.log skip_prs.log 5071bec - Merge pull request sonic-net#3519 from stepanblyschak/202411-fc-after-apply-view (7 days ago) [Kumaresh Perumal]
| |\ 
| | failure_prs.log skip_prs.log c3437bd - [202411][FC] process FC after apply view (9 days ago) [Stepan Blyschak]
| |/ 
| failure_prs.log skip_prs.log 5031aad - Capability query for MACSEC ACL attribute (sonic-net#3511) (2 weeks ago) [mssonicbld]
| failure_prs.log skip_prs.log 4b357e5 - Fix VRF update handling for loopback interfaces in IntfsOrch (sonic-net#3512) (2 weeks ago) [mssonicbld]
| failure_prs.log skip_prs.log fe98176 - Add a delay between killing teamd processes (sonic-net#3510) (2 weeks ago) [mssonicbld]
| failure_prs.log skip_prs.log e967711 - Remove RIF from m_rifsToAdd before deleting it (sonic-net#3499) (3 weeks ago) [mssonicbld]
| failure_prs.log skip_prs.log 337c9a1 - Optimize counter polling interval by making it more accurate (sonic-net#3500) (3 weeks ago) [mssonicbld]
* b9b4108 - Merge pull request #41 from Azure/revert-40-cherry/msft-202412/3481 (10 hours ago) [Riff]
* 824d021 - (origin/revert-40-cherry/msft-202412/3481) Revert "[hash] add SAI_NATIVE_HASH_FIELD_IPV6_FLOW_LABEL to hash-field map fo…" (30 hours ago) [Riff]
```
#### How I did it
#### How to verify it
#### Description for the changelog
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants