Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add health check probe for k8s upgrade containers. #15223

Merged
merged 8 commits into from
Jul 11, 2023

Conversation

lixiaoyuner
Copy link
Contributor

@lixiaoyuner lixiaoyuner commented May 26, 2023

Why I did it

After k8s upgrade a container, k8s can only know the container is running, don't know the service's status inside container. So we need a probe inside container, k8s will call the probe to check whether the container is really ready.

Work item tracking
  • Microsoft ADO (number only):
    22453004

How I did it

Add a health check probe inside config engine container, the probe will check whether the start service exit normally or not if the start service exists and call the python script to do container self-related specific checks if the script is there. The hook script should be implemented by feature owner if it's needed.

more details: design doc

How to verify it

Check path /usr/bin/readiness_probe.sh inside container.

Which release branch to backport (provide reason below if selected)

  • 201811
  • 201911
  • 202006
  • 202012
  • 202106
  • 202111
  • 202205
  • 202211

Tested branch (Please provide the tested image version)

  • 20220531.28

Description for the changelog

Link to config_db schema for YANG module changes

A picture of a cute animal (not mandatory but encouraged)

Copy link

@losha228 losha228 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM


#### exit code contract, k8s only cares zero or not none-zero, but we want to use none-zero code to indicate different error
# 0: readiness
# 1: python script crach exit code
Copy link
Collaborator

@qiluo-msft qiluo-msft Jun 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

crach

typo #Closed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

# if the start service exists, check if it exits normally
# if the start service doesn't exist normally, exit with code 2
pre_check_service_name="start"
supervisorctl status |awk '{print $1}' |grep -w $pre_check_service_name > /dev/null
Copy link
Collaborator

@qiluo-msft qiluo-msft Jun 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

supervisorctl status

You can use one command

supervisorctl status start
``` #Closed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only do "supervisorctl status start", We can't do judgement by exit code, because start not existing and some failed state exit codes are the same. If only do "supervisorctl status start", need to judge by the outputs "start: ERROR (no such process)", "start EXITED Jun 21 05:28 PM". I do checking whether start exists in advance, I think code logic is more easy to understand here.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an example

root@sonic:/# supervisorctl status start
start                            EXITED    Jul 04 12:38 AM
root@sonic:/# supervisorctl status
dependent-startup                EXITED    Jul 04 12:38 AM
lldp-syncd                       RUNNING   pid 26, uptime 0:03:54
lldpd                            RUNNING   pid 20, uptime 0:03:57
lldpmgrd                         RUNNING   pid 30, uptime 0:03:52
rsyslogd                         RUNNING   pid 11, uptime 0:04:02
start                            EXITED    Jul 04 12:38 AM
supervisor-proc-exit-listener    RUNNING   pid 10, uptime 0:04:04
waitfor_lldp_ready               EXITED    Jul 04 12:38 AM

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

# check if the post_check_script exists
# if the post_check_script exists, run it
# if the post_check_script exits with non-zero code, exit with the code
post_check_script="/usr/bin/readiness_probe.py"
Copy link
Collaborator

@qiluo-msft qiluo-msft Jun 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/usr/bin/readiness_probe.py

Do not assume python3.
How about /usr/bin/readiness_probe_hook.

if [ -x $post_check_script ]; then
    $post_check_script

#Closed

Copy link
Contributor Author

@lixiaoyuner lixiaoyuner Jun 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

# check if the start service exists
# if the start service exists, check if it exits normally
# if the start service doesn't exist normally, exit with code 2
pre_check_service_name="start"
Copy link
Collaborator

@qiluo-msft qiluo-msft Jun 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

start

Will you check all the critical processes? #Closed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The critical processes unexpected event will be handled by the supervisord exit-listener for now, the listener will kill the container, I don't think we need to check them here. Is this correct?

@lguohan
Copy link
Collaborator

lguohan commented Jun 30, 2023

where is public design doc for such health check probe?

@lixiaoyuner
Copy link
Contributor Author

lixiaoyuner commented Jul 3, 2023

where is public design doc for such health check probe?

We have a OneNote page, I put the link into this PR related ADO discussion before. ADO number: 22453004.
I also put the design doc to SONiC repo

qiluo-msft
qiluo-msft previously approved these changes Jul 7, 2023
@qiluo-msft qiluo-msft merged commit c470b7d into sonic-net:master Jul 11, 2023
lixiaoyuner added a commit to lixiaoyuner/sonic-buildimage that referenced this pull request Jul 11, 2023
#### Why I did it
After k8s upgrade a container, k8s can only know the container is running, don't know the service's status inside container. So we need a probe inside container, k8s will call the probe to check whether the container is really ready.
##### Work item tracking
- Microsoft ADO **(number only)**: 22453004
#### How I did it
Add a health check probe inside config engine container, the probe will check whether the start service exit normally or not if the start service exists and call the python script to do container self-related specific checks if the script is there. The python script should be implemented by feature owner if it's needed.

more details: [design doc](https://github.com/sonic-net/SONiC/blob/master/doc/kubernetes/health-check.md)
#### How to verify it
Check path /usr/bin/readiness_probe.sh inside container.

#### Which release branch to backport (provide reason below if selected)

- [ ] 201811
- [ ] 201911
- [ ] 202006
- [ ] 202012
- [ ] 202106
- [ ] 202111
- [x] 202205
- [x] 202211

#### Tested branch (Please provide the tested image version)
- [x] 20220531.28
mssonicbld pushed a commit to mssonicbld/sonic-buildimage that referenced this pull request Jul 13, 2023
#### Why I did it
After k8s upgrade a container, k8s can only know the container is running, don't know the service's status inside container. So we need a probe inside container, k8s will call the probe to check whether the container is really ready.
##### Work item tracking
- Microsoft ADO **(number only)**: 22453004
#### How I did it
Add a health check probe inside config engine container, the probe will check whether the start service exit normally or not if the start service exists and call the python script to do container self-related specific checks if the script is there. The python script should be implemented by feature owner if it's needed.

more details: [design doc](https://github.com/sonic-net/SONiC/blob/master/doc/kubernetes/health-check.md)
#### How to verify it
Check path /usr/bin/readiness_probe.sh inside container.

#### Which release branch to backport (provide reason below if selected)

- [ ] 201811
- [ ] 201911
- [ ] 202006
- [ ] 202012
- [ ] 202106
- [ ] 202111
- [x] 202205
- [x] 202211

#### Tested branch (Please provide the tested image version)
- [x] 20220531.28
@mssonicbld
Copy link
Collaborator

Cherry-pick PR to 202205: #15823

mssonicbld pushed a commit to mssonicbld/sonic-buildimage that referenced this pull request Jul 13, 2023
#### Why I did it
After k8s upgrade a container, k8s can only know the container is running, don't know the service's status inside container. So we need a probe inside container, k8s will call the probe to check whether the container is really ready.
##### Work item tracking
- Microsoft ADO **(number only)**: 22453004
#### How I did it
Add a health check probe inside config engine container, the probe will check whether the start service exit normally or not if the start service exists and call the python script to do container self-related specific checks if the script is there. The python script should be implemented by feature owner if it's needed.

more details: [design doc](https://github.com/sonic-net/SONiC/blob/master/doc/kubernetes/health-check.md)
#### How to verify it
Check path /usr/bin/readiness_probe.sh inside container.

#### Which release branch to backport (provide reason below if selected)

- [ ] 201811
- [ ] 201911
- [ ] 202006
- [ ] 202012
- [ ] 202106
- [ ] 202111
- [x] 202205
- [x] 202211

#### Tested branch (Please provide the tested image version)
- [x] 20220531.28
@mssonicbld
Copy link
Collaborator

Cherry-pick PR to 202211: #15824

mssonicbld pushed a commit that referenced this pull request Jul 13, 2023
#### Why I did it
After k8s upgrade a container, k8s can only know the container is running, don't know the service's status inside container. So we need a probe inside container, k8s will call the probe to check whether the container is really ready.
##### Work item tracking
- Microsoft ADO **(number only)**: 22453004
#### How I did it
Add a health check probe inside config engine container, the probe will check whether the start service exit normally or not if the start service exists and call the python script to do container self-related specific checks if the script is there. The python script should be implemented by feature owner if it's needed.

more details: [design doc](https://github.com/sonic-net/SONiC/blob/master/doc/kubernetes/health-check.md)
#### How to verify it
Check path /usr/bin/readiness_probe.sh inside container.

#### Which release branch to backport (provide reason below if selected)

- [ ] 201811
- [ ] 201911
- [ ] 202006
- [ ] 202012
- [ ] 202106
- [ ] 202111
- [x] 202205
- [x] 202211

#### Tested branch (Please provide the tested image version)
- [x] 20220531.28
mssonicbld pushed a commit that referenced this pull request Jul 14, 2023
#### Why I did it
After k8s upgrade a container, k8s can only know the container is running, don't know the service's status inside container. So we need a probe inside container, k8s will call the probe to check whether the container is really ready.
##### Work item tracking
- Microsoft ADO **(number only)**: 22453004
#### How I did it
Add a health check probe inside config engine container, the probe will check whether the start service exit normally or not if the start service exists and call the python script to do container self-related specific checks if the script is there. The python script should be implemented by feature owner if it's needed.

more details: [design doc](https://github.com/sonic-net/SONiC/blob/master/doc/kubernetes/health-check.md)
#### How to verify it
Check path /usr/bin/readiness_probe.sh inside container.

#### Which release branch to backport (provide reason below if selected)

- [ ] 201811
- [ ] 201911
- [ ] 202006
- [ ] 202012
- [ ] 202106
- [ ] 202111
- [x] 202205
- [x] 202211

#### Tested branch (Please provide the tested image version)
- [x] 20220531.28
mssonicbld pushed a commit to mssonicbld/sonic-buildimage that referenced this pull request Jul 17, 2023
#### Why I did it
After k8s upgrade a container, k8s can only know the container is running, don't know the service's status inside container. So we need a probe inside container, k8s will call the probe to check whether the container is really ready.
##### Work item tracking
- Microsoft ADO **(number only)**: 22453004
#### How I did it
Add a health check probe inside config engine container, the probe will check whether the start service exit normally or not if the start service exists and call the python script to do container self-related specific checks if the script is there. The python script should be implemented by feature owner if it's needed.

more details: [design doc](https://github.com/sonic-net/SONiC/blob/master/doc/kubernetes/health-check.md)
#### How to verify it
Check path /usr/bin/readiness_probe.sh inside container.

#### Which release branch to backport (provide reason below if selected)

- [ ] 201811
- [ ] 201911
- [ ] 202006
- [ ] 202012
- [ ] 202106
- [ ] 202111
- [x] 202205
- [x] 202211

#### Tested branch (Please provide the tested image version)
- [x] 20220531.28
@mssonicbld
Copy link
Collaborator

Cherry-pick PR to 202305: #15867

StormLiangMS pushed a commit that referenced this pull request Jul 19, 2023
#### Why I did it
After k8s upgrade a container, k8s can only know the container is running, don't know the service's status inside container. So we need a probe inside container, k8s will call the probe to check whether the container is really ready.
##### Work item tracking
- Microsoft ADO **(number only)**: 22453004
#### How I did it
Add a health check probe inside config engine container, the probe will check whether the start service exit normally or not if the start service exists and call the python script to do container self-related specific checks if the script is there. The python script should be implemented by feature owner if it's needed.

more details: [design doc](https://github.com/sonic-net/SONiC/blob/master/doc/kubernetes/health-check.md)
#### How to verify it
Check path /usr/bin/readiness_probe.sh inside container.

#### Which release branch to backport (provide reason below if selected)

- [ ] 201811
- [ ] 201911
- [ ] 202006
- [ ] 202012
- [ ] 202106
- [ ] 202111
- [x] 202205
- [x] 202211

#### Tested branch (Please provide the tested image version)
- [x] 20220531.28

Co-authored-by: lixiaoyuner <[email protected]>
sonic-otn pushed a commit to sonic-otn/sonic-buildimage that referenced this pull request Sep 20, 2023
#### Why I did it
After k8s upgrade a container, k8s can only know the container is running, don't know the service's status inside container. So we need a probe inside container, k8s will call the probe to check whether the container is really ready.
##### Work item tracking
- Microsoft ADO **(number only)**: 22453004
#### How I did it
Add a health check probe inside config engine container, the probe will check whether the start service exit normally or not if the start service exists and call the python script to do container self-related specific checks if the script is there. The python script should be implemented by feature owner if it's needed.

more details: [design doc](https://github.com/sonic-net/SONiC/blob/master/doc/kubernetes/health-check.md)
#### How to verify it
Check path /usr/bin/readiness_probe.sh inside container.

#### Which release branch to backport (provide reason below if selected)

- [ ] 201811
- [ ] 201911
- [ ] 202006
- [ ] 202012
- [ ] 202106
- [ ] 202111
- [x] 202205
- [x] 202211

#### Tested branch (Please provide the tested image version)
- [x] 20220531.28
@lixiaoyuner lixiaoyuner deleted the add-health-check-probe branch February 7, 2024 06:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants