Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

monit errors in the logs "ERR memory_checker: [memory_checker] Failed to execute the command" #10088

Closed
nazariig opened this issue Feb 25, 2022 · 1 comment

Comments

@nazariig
Copy link
Collaborator

Description

The issue is caused by a missing telemetry container.
The situation may take place on a system start when docker container is not created yet.
The root cause is monit async service health state check.

Steps to reproduce the issue:

root@r-boxer-sw01:/home/admin# systemctl stop telemetry
root@r-boxer-sw01:/home/admin# docker rm -f telemetry

root@r-boxer-sw01:/home/admin# monit restart container_memory_telemetry
root@r-boxer-sw01:/home/admin# monit status container_memory_telemetry
Monit 5.20.0 uptime: 2h 18m

Program 'container_memory_telemetry'
  status                       Status ok
  monitoring status            Monitored
  monitoring mode              active
  on reboot                    start
  last exit value              1
  last output                  -
  data collected               Mon, 14 Feb 2022 14:44:18

root@r-boxer-sw01:/home/admin# docker ps
CONTAINER ID   IMAGE                                COMMAND                  CREATED       STATUS       PORTS     NAMES
3f3091a296bf   736fd83ca6a1                         "/usr/local/bin/supe…"   3 hours ago   Up 2 hours             what-just-happened
9260683a3e6c   docker-snmp:latest                   "/usr/local/bin/supe…"   3 hours ago   Up 2 hours             snmp
50bc3a6a8351   docker-sonic-mgmt-framework:latest   "/usr/local/bin/supe…"   3 hours ago   Up 2 hours             mgmt-framework
61dda613653e   7c4f66877495                         "/usr/bin/docker_ini…"   3 hours ago   Up 2 hours             dhcp_relay
02f30d3ab380   docker-router-advertiser:latest      "/usr/bin/docker-ini…"   3 hours ago   Up 2 hours             radv
0331e1717cf6   docker-lldp:latest                   "/usr/bin/docker-lld…"   3 hours ago   Up 2 hours             lldp
adb75e6f35b9   docker-platform-monitor:latest       "/usr/bin/docker_ini…"   3 hours ago   Up 2 hours             pmon
3b29843ee089   docker-syncd-mlnx:latest             "/usr/local/bin/supe…"   3 hours ago   Up 2 hours             syncd
361bceba54e7   docker-teamd:latest                  "/usr/local/bin/supe…"   3 hours ago   Up 2 hours             teamd
2bd90fc2b1a4   docker-orchagent:latest              "/usr/bin/docker-ini…"   3 hours ago   Up 2 hours             swss
7ced7c469af0   docker-fpm-frr:latest                "/usr/bin/docker_ini…"   3 hours ago   Up 2 hours             bgp
b43c5781c33b   docker-database:latest               "/usr/local/bin/dock…"   3 hours ago   Up 3 hours             database

root@r-boxer-sw01:/home/admin# docker stats --no-stream --format {{.MemUsage}} telemetry
Error response from daemon: No such container: telemetry
root@r-boxer-sw01:/home/admin# echo $?
1

Describe the results you received:

root@r-boxer-sw01:/home/admin# tail -F /var/log/syslog | grep memory
Feb 14 14:12:40.904682 r-boxer-sw01 ERR memory_checker: [memory_checker] Failed to execute the command 'docker stats --no-stream --format \{\{.MemUsage\}\} telemetry'. Return code: '1'

Describe the results you expected:

No error messages are expected when docker doesn't exist

Output of show version:

root@r-boxer-sw01:/home/admin# show version

SONiC Software Version: SONiC.202111.10-f08866b66_Internal
Distribution: Debian 11.2
Kernel: 5.10.0-8-2-amd64
Build commit: f08866b66
Build date: Mon Feb  7 08:15:17 UTC 2022
Built by: sw-r2d2-bot@r-build-sonic-ci02-241

Platform: x86_64-mlnx_msn2010-r0
HwSKU: ACS-MSN2010
ASIC: mellanox
ASIC Count: 1
Serial Number: MT1749X10061
Model Number: MSN2010-CB2F
Hardware Revision: A1
Uptime: 15:52:04 up 12 min,  1 user,  load average: 1.02, 0.84, 0.62

Docker images:
REPOSITORY                                         TAG                            IMAGE ID       SIZE
docker-teamd                                       202111.10-f08866b66_Internal   a6461b4fc1b1   438MB
docker-teamd                                       latest                         a6461b4fc1b1   438MB
docker-sflow                                       202111.10-f08866b66_Internal   502f7aeb5296   439MB
docker-sflow                                       latest                         502f7aeb5296   439MB
docker-orchagent                                   202111.10-f08866b66_Internal   22c1035163ec   457MB
docker-orchagent                                   latest                         22c1035163ec   457MB
docker-nat                                         202111.10-f08866b66_Internal   bfaaeef59e80   441MB
docker-nat                                         latest                         bfaaeef59e80   441MB
docker-macsec                                      202111.10-f08866b66_Internal   19ce40820a13   441MB
docker-macsec                                      latest                         19ce40820a13   441MB
docker-fpm-frr                                     202111.10-f08866b66_Internal   f76fdb1c1625   457MB
docker-fpm-frr                                     latest                         f76fdb1c1625   457MB
docker-syncd-mlnx                                  202111.10-f08866b66_Internal   9187d53ee421   1.01GB
docker-syncd-mlnx                                  latest                         9187d53ee421   1.01GB
docker-platform-monitor                            202111.10-f08866b66_Internal   b9bce6dd4fad   809MB
docker-platform-monitor                            latest                         b9bce6dd4fad   809MB
docker-snmp                                        202111.10-f08866b66_Internal   8a604e09da49   465MB
docker-snmp                                        latest                         8a604e09da49   465MB
docker-dhcp-relay                                  latest                         7c4f66877495   436MB
docker-sonic-mgmt-framework                        202111.10-f08866b66_Internal   8e4823d8d271   578MB
docker-sonic-mgmt-framework                        latest                         8e4823d8d271   578MB
docker-sonic-telemetry                             202111.10-f08866b66_Internal   df6917e0f648   511MB
docker-sonic-telemetry                             latest                         df6917e0f648   511MB
docker-router-advertiser                           202111.10-f08866b66_Internal   ab565ec647e8   423MB
docker-router-advertiser                           latest                         ab565ec647e8   423MB
docker-mux                                         202111.10-f08866b66_Internal   4132719ce52e   475MB
docker-mux                                         latest                         4132719ce52e   475MB
docker-lldp                                        202111.10-f08866b66_Internal   5b758f8ab7b9   463MB
docker-lldp                                        latest                         5b758f8ab7b9   463MB
docker-database                                    202111.10-f08866b66_Internal   3d69079016cc   423MB
docker-database                                    latest                         3d69079016cc   423MB

Output of show techsupport:

(paste your output here or download and attach the file here )

Additional information you deem important (e.g. issue happens only occasionally):

Monit summary:

root@r-boxer-sw01:/home/admin# monit summary
Monit 5.20.0 uptime: 2h 15m

 Service Name                     Status                      Type          
 r-boxer-sw01                     Running                     System        
 rsyslog                          Running                     Process       
 root-overlay                     Accessible                  Filesystem    
 var-log                          Accessible                  Filesystem    
 routeCheck                       Status ok                   Program       
 diskCheck                        Status ok                   Program       
 container_checker                Status ok                   Program       
 vnetRouteCheck                   Status ok                   Program       
 container_memory_telemetry       Status ok                   Program       

Monit configuration:

root@r-boxer-sw01:/home/admin# cat /etc/monit/conf.d/monit_telemetry
###############################################################################
## Monit configuration for telemetry container
###############################################################################
check program container_memory_telemetry with path "/usr/bin/memory_checker telemetry 419430400"
    if status == 3 for 10 times within 20 cycles then exec "/usr/bin/restart_service telemetry"

Monit memory checker:

root@r-boxer-sw01:/home/admin# cat /usr/bin/memory_checker
def get_command_result(command):
    """Executes the command and return the resulting output.

    Args:
        command: A string contains the command to be executed.

    Returns:
        A string which contains the output of command.
    """
    command_stdout = ""

    try:
        proc_instance = subprocess.Popen(command, stdout=subprocess.PIPE, stderr=subprocess.PIPE,
                                         shell=True, universal_newlines=True)
        command_stdout, command_stderr = proc_instance.communicate()
        if proc_instance.returncode != 0:
            syslog.syslog(syslog.LOG_ERR, "[memory_checker] Failed to execute the command '{}'. Return code: '{}'"
                          .format(command, proc_instance.returncode))
            sys.exit(1)
    except (OSError, ValueError) as err:
        syslog.syslog(syslog.LOG_ERR, "[memory_checker] Failed to execute the command '{}'. Error: '{}'"
                      .format(command, err))
        sys.exit(2)

    return command_stdout.strip()
@nazariig nazariig changed the title [202111] monit errors in the logs "ERR memory_checker: [memory_checker] Failed to execute the command" monit errors in the logs "ERR memory_checker: [memory_checker] Failed to execute the command" May 10, 2022
@qiluo-msft
Copy link
Collaborator

What is the severity of this issue? Does it have any impact other than syslog ERR message?

yozhao101 added a commit that referenced this issue Jun 17, 2022
…t created (#11129)

Signed-off-by: Yong Zhao [email protected]

Why I did it
This PR aims to fix an issue (#10088) by enhancing the script memory_checker.

Specifically, if container is not created successfully during device is booted/rebooted, then memory_checker do not need check its memory usage.

How I did it
In the script memory_checker, a function is added to get names of running containers. If the specified container name is not in current running container list, then this script will exit without checking its memory usage.

How to verify it
I tested on a lab device by following the steps:

Stops telemetry container with command sudo systemctl stop telemetry.service

Removes telemetry container with command docker rm telemetry

Checks whether the script memory_checker ran by Monit will generate the syslog message saying it will exit without checking memory usage of telemetry.
yxieca pushed a commit that referenced this issue Jun 19, 2022
…t created (#11129)

Signed-off-by: Yong Zhao [email protected]

Why I did it
This PR aims to fix an issue (#10088) by enhancing the script memory_checker.

Specifically, if container is not created successfully during device is booted/rebooted, then memory_checker do not need check its memory usage.

How I did it
In the script memory_checker, a function is added to get names of running containers. If the specified container name is not in current running container list, then this script will exit without checking its memory usage.

How to verify it
I tested on a lab device by following the steps:

Stops telemetry container with command sudo systemctl stop telemetry.service

Removes telemetry container with command docker rm telemetry

Checks whether the script memory_checker ran by Monit will generate the syslog message saying it will exit without checking memory usage of telemetry.
qiluo-msft pushed a commit that referenced this issue Jul 5, 2022
…t created (#11129)

Signed-off-by: Yong Zhao [email protected]

Why I did it
This PR aims to fix an issue (#10088) by enhancing the script memory_checker.

Specifically, if container is not created successfully during device is booted/rebooted, then memory_checker do not need check its memory usage.

How I did it
In the script memory_checker, a function is added to get names of running containers. If the specified container name is not in current running container list, then this script will exit without checking its memory usage.

How to verify it
I tested on a lab device by following the steps:

Stops telemetry container with command sudo systemctl stop telemetry.service

Removes telemetry container with command docker rm telemetry

Checks whether the script memory_checker ran by Monit will generate the syslog message saying it will exit without checking memory usage of telemetry.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants