native healthchecks are unreliable/inconsistent #5915

cjeanner · 2020-04-21T11:30:45Z

Is this a BUG REPORT or FEATURE REQUEST? (leave only one on its own line)

/kind bug

Description
Using native podman healthcheck feature is unreliable since there are inconsistencies between live state and the state reported within "inspect". For instance:

podman inspect neutron_api | jq '.[0]["State"]["Healthcheck"]'                                                                                                                                              
{                                                                                                                                                                                                                                             
  "Status": "healthy",
  "FailingStreak": 0,
  "Log": [
    {
      "Start": "2020-04-21T11:21:38.122154731Z",
      "End": "2020-04-21T11:21:39.199659552Z",
      "ExitCode": 0,
      "Output": ""
    },
    {
      "Start": "2020-04-21T11:21:38.917118406Z",
      "End": "2020-04-21T11:21:39.575425172Z",
      "ExitCode": 0,
      "Output": ""
    },
    {
      "Start": "2020-04-21T11:21:39.251824671Z",
      "End": "2020-04-21T11:21:39.767073492Z",
      "ExitCode": 0,
      "Output": ""
    },
    {
      "Start": "2020-04-21T11:21:47.97086488Z",
      "End": "2020-04-21T11:21:48.443987962Z",
      "ExitCode": 1,
      "Output": ""
    },
    {
      "Start": "2020-04-21T11:21:47.987444237Z",
      "End": "2020-04-21T11:21:48.497487617Z",
      "ExitCode": 0,
      "Output": ""
    }
  ]
}

while:

[CentOS-8.1 - root@undercloud ~]# podman healthcheck run neutron_api
unhealthy

The right state is the latter, I've broken on purpose the healthcheck in order to do some tests....

Steps to reproduce the issue:

Get a container with a working healthcheck
Break that healthcheck
Check the healthcheck using both "podman healthcheck run" and "podman inspect"

Describe the results you received:
We don't get the right result in the "inspect" output

Describe the results you expected:
We should get the correct result

Additional information you deem important (e.g. issue happens only occasionally):
The full inspect output, since there are other healthcheck-related things in the Config block. Maybe we did it wrong?
https://gist.github.com/cjeanner/1b1ba82eae7331c2c688ded4dcf8d111

Output of podman version:

Version:            1.6.4
RemoteAPI Version:  1
Go Version:         go1.13.4
OS/Arch:            linux/amd64

Output of podman info --debug:

debug:
  compiler: gc
  git commit: ""
  go version: go1.13.4
  podman version: 1.6.4
host:
  BuildahVersion: 1.12.0-dev
  CgroupVersion: v1
  Conmon:
    package: conmon-2.0.6-1.module_el8.1.0+298+41f9343a.x86_64
    path: /usr/bin/conmon
    version: 'conmon version 2.0.6, commit: 2721f230f94894671f141762bd0d1af2fb263239'
  Distribution:
    distribution: '"centos"'
    version: "8"
  MemFree: 868073472
  MemTotal: 29264334848
  OCIRuntime:
    name: runc
    package: runc-1.0.0-64.rc9.module_el8.1.0+298+41f9343a.x86_64
    path: /usr/bin/runc
    version: 'runc version spec: 1.0.1-dev'
  SwapFree: 0
  SwapTotal: 0
  arch: amd64
  cpus: 16
  eventlogger: journald
  hostname: undercloud.localdomain
  kernel: 4.18.0-147.8.1.el8_1.x86_64
  os: linux
  rootless: false
  uptime: 4h 17m 54.55s (Approximately 0.17 days)
registries:
  blocked: null
  insecure:
  - undercloud.ctlplane.localdomain
  - 192.168.24.1
  - 192.168.24.3
  - undercloud.ctlplane
  search:
  - registry.redhat.io
  - registry.access.redhat.com
  - registry.fedoraproject.org
  - registry.centos.org
  - docker.io
store:
  ConfigFile: /etc/containers/storage.conf
  ContainerStore:
    number: 84
  GraphDriverName: overlay
  GraphOptions: {}
  GraphRoot: /var/lib/containers/storage
  GraphStatus:
    Backing Filesystem: xfs
    Native Overlay Diff: "true"
    Supports d_type: "true"
    Using metacopy: "false"
  ImageStore:
    number: 35
  RunRoot: /var/run/containers/storage
  VolumePath: /var/lib/containers/storage/volumes

Package info (e.g. output of rpm -q podman or apt list podman):

podman-1.6.4-4.module_el8.1.0+298+41f9343a.x86_64

The text was updated successfully, but these errors were encountered:

mheon · 2020-04-21T14:08:57Z

@baude PTAL

cjeanner · 2020-04-23T08:10:33Z

Hello!
Sooo I took some more time to get a reproducer, and detected some other inconsistencies.

Reproducer

Dockerfile:

FROM alpine:latest

RUN apk add --no-cache vim
COPY healthcheck /healthcheck
RUN chmod +x /healthcheck

COPY init.sh /init.sh
RUN chmod +x /init.sh

CMD ["/init.sh"]

healthcheck

#!/bin/sh
echo 'This is a success'
exit 0

init.sh

#!/bin/sh
while :; do
  sleep 30
done

Methodology

build the container using podman build -t healthcheck .
run the container using podman run --health-cmd /healthcheck --health-interval 60s --health-retries 3 -d --name healthcheck healthcheck
check the healthcheck state using both podman healthcheck run healthcheck and podman inspect healthcheck | jq '.[0]["State"]'

Break the healthcheck

podman exec -uroot -ti healthcheck sh
vi healthcheck
modify its message and exit code - something like echo "this is a failure" and exit 1 is enough
Check the healthcheck via podman inspect healthcheck | jq '.[0]["State"]'

You can run the following in order to always have the healthcheck status as reported by podman inspect: watch "podman inspect healthcheck | jq '.[0][\"State\"]'"

Issues

podman inspect doesn't report the right state. From time to time you might see a failure, but you won't get the output, only the exit-code set to 1, and the FailingStreak will raise to 1. It will later go back to 0 as the failure is removed, replaced by a wrong success
Running podman healthcheck run healthcheck actually DOES show the failure in podman inspect. Running the healthcheck manually over 3 times will toggle the container to "unhealthy"
Starting a container with a failing healthcheck will keep it in "Starting" state without any failure reported in the inspect output. In fact, it seems that it doesn't refresh the healthcheck. In my case, there isn't any report of new healthcheck run since I started the container, a couple of minutes ago. Last report is from yesterday:

{
  "OciVersion": "1.0.1-dev",
  "Status": "running",
  "Running": true,
  "Paused": false,
  "Restarting": false,
  "OOMKilled": false,
  "Dead": false,
  "Pid": 70624,
  "ConmonPid": 70612,
  "ExitCode": 0,
  "Error": "",
  "StartedAt": "2020-04-23T07:50:06.660901138Z",
  "FinishedAt": "0001-01-01T00:00:00Z",
  "Healthcheck": {
    "Status": "starting",
    "FailingStreak": 0,
    "Log": [
      {
        "Start": "2020-04-22T14:46:01.170792682Z",
        "End": "2020-04-22T14:46:01.618066347Z",
        "ExitCode": 0,
        "Output": ""
      },
      {
        "Start": "2020-04-22T14:46:02.233518944Z",
        "End": "2020-04-22T14:46:02.690192626Z",
        "ExitCode": 0,
        "Output": ""
      },
      {
        "Start": "2020-04-22T14:46:03.171000823Z",
        "End": "2020-04-22T14:46:03.616105101Z",
        "ExitCode": 0,
        "Output": ""
      },
      {
        "Start": "2020-04-22T14:46:04.112180387Z",
        "End": "2020-04-22T14:46:04.54862729Z",
        "ExitCode": 0,
        "Output": ""
      },
      {
        "Start": "2020-04-22T14:46:04.97645628Z",
        "End": "2020-04-22T14:46:05.431782197Z",
        "ExitCode": 0,
        "Output": ""
      }
    ]
  }
}

cjeanner · 2020-04-23T08:57:14Z

After even more digging, I found a potential reason:
it seems a log is pushed here: /var/log/containers/stdouts/healthcheck.log
I'm not really sure how this is done/managed - but apparently every healthcheck is overriding the file on its own. It's not in an append fashion, meaning container-1 will override container-2 result.

Since the log doesn't have any mention of the container, it might explain why "podman inspect" doesn't report the right state.

Now, about the location:
/var/log/containers/stdouts is the base directory of the container log we've set up in ALL our containers.
For instance:

"LogPath": "/var/log/containers/stdouts/swift_proxy.log"

So. Question: is the healthcheck.log location built on some "dirname(LogPath)"? This would explain a lot. And a correction would be really simple and easy, something like:
os.path.join(os.path.dirname(LogPath), container_name_healthcheck.log))
(of course, this is just to show the idea)

Does it make any sense? Did I just find the root cause? :)

instead of using the container log path to derive where to put the healthchecks, we now put them into the rundir to avoid collision of health check log files when the log path is set by user. Fixes: containers#5915 Signed-off-by: Brent Baude <[email protected]>

openshift-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Apr 21, 2020

baude mentioned this issue Apr 27, 2020

separate healthcheck and container log paths #6008

Merged

baude mentioned this issue Apr 27, 2020

separate healthcheck and container log paths #6009

Merged

openshift-merge-robot closed this as completed in #6009 Apr 27, 2020

cjeanner mentioned this issue May 11, 2020

please backport PR#6008 to 1.6.x (rhel-8 version) #6171

Closed

github-actions bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Sep 23, 2023

github-actions bot locked as resolved and limited conversation to collaborators Sep 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

native healthchecks are unreliable/inconsistent #5915

native healthchecks are unreliable/inconsistent #5915

cjeanner commented Apr 21, 2020

mheon commented Apr 21, 2020

cjeanner commented Apr 23, 2020

cjeanner commented Apr 23, 2020 •

edited

Loading

native healthchecks are unreliable/inconsistent #5915

native healthchecks are unreliable/inconsistent #5915

Comments

cjeanner commented Apr 21, 2020

mheon commented Apr 21, 2020

cjeanner commented Apr 23, 2020

Reproducer

Methodology

Break the healthcheck

Issues

cjeanner commented Apr 23, 2020 • edited Loading

cjeanner commented Apr 23, 2020 •

edited

Loading