Skip to content

WIP: Do not merge: Test periodic reaper with the router#111

Closed
mrunalp wants to merge 1 commit intoopenshift:masterfrom
mrunalp:test_periodic_router
Closed

WIP: Do not merge: Test periodic reaper with the router#111
mrunalp wants to merge 1 commit intoopenshift:masterfrom
mrunalp:test_periodic_router

Conversation

@mrunalp
Copy link
Member

@mrunalp mrunalp commented Apr 6, 2020

Signed-off-by: Mrunal Patel mrunalp@gmail.com

@openshift-ci-robot openshift-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 6, 2020

// parseProcForZombies parses the current procfs mounted at /proc
// to find proccess in the zombie state.
func parseProcForZombies() ([]int, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the direct use of procfs couple this to Linux?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't support containers on non linux right now - router wouldn't work on windows

@ironcladlou
Copy link
Contributor

Interesting... can you explain the theory of operation? My assumption based on prior R&D is that switching from signal handling to a polling approach shortens (but does not eliminate) the window of time the reaper can race with explicit wait calls.

// from a pid 1 process.
// The zombie processes are reaped at the beginning of next cycle, so os.Exec calls
// have an oppurtunity to reap their children within `period` seconds.
func StartPeriodicReaper(period int64) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of forcing the caller to a certain time unit, how about instead receiving a time.Duration?

@frobware
Copy link
Contributor

frobware commented Apr 7, 2020

All of this seems like band aid to me. In calls where we lose/race when calling Wait() we're likely to log an error. Customers than report that, we say: "not really an error" which is perhaps "OK" for the case we have now. It just seems like this will mask problems.

Why don't we have some form of real init where we won't race?

See also: #78

@ironcladlou
Copy link
Contributor

@frobware agree, not sure if we need to re-litigate it here

@smarterclayton
Copy link
Contributor

Mrunals change is fundamentally different. The old reaper was completely wrong - you don't wait for all children, you wait for the unmanaged defunct ones. And you only need to reap defunct processes. So this logic correct now - only zombie defunct non parented children get reaped. There's no race because a defunct process stays defunct

@smarterclayton
Copy link
Contributor

smarterclayton commented Apr 13, 2020

Bringing in a new init system to manage defunct processes vs 20 lines of go that will be used in every image that performs the required function. There is no real init responsibility except reaping defunct processes.

@frobware
Copy link
Contributor

Mrunals change is fundamentally different. The old reaper was completely wrong - you don't wait for all children, you wait for the unmanaged defunct ones.

Ugh. I clearly didn't spend enough time looking at this because it's clearly waiting on zombie pids only.

I did some validation of the change here https://github.com/frobware/haproxy-hacks/blob/master/reaper/main.go#L21, this based on our previous experiments.

In the following output we see the router's reload script being invoked every second:

2020/04/15 13:01:18 Launching periodic reaper
2020/04/15 13:01:18 Sleeping for 6 seconds
cmd.CombinedOutput() err=exit status 1, out=OLD_PIDS: 
 - Checking http://localhost:80 ...
 - Exceeded max wait time (0) in health check - 0 retry attempt(s).
   30 pts/0    S+     0:00 grep defunct
#defunct 1

cmd.CombinedOutput() err=exit status 1, out=OLD_PIDS: 17
 - Checking http://localhost:80 ...
 - Exceeded max wait time (0) in health check - 0 retry attempt(s).
   52 pts/0    S+     0:00 grep defunct
#defunct 1

cmd.CombinedOutput() err=exit status 1, out=OLD_PIDS: 39 17
 - Checking http://localhost:80 ...
 - Exceeded max wait time (0) in health check - 0 retry attempt(s).
   17 ?        Zs     0:00 [haproxy] <defunct>
   73 pts/0    S+     0:00 grep defunct
#defunct 2

cmd.CombinedOutput() err=exit status 1, out=OLD_PIDS: 60 39 17
 - Checking http://localhost:80 ...
 - Exceeded max wait time (0) in health check - 0 retry attempt(s).
   17 ?        Zs     0:00 [haproxy] <defunct>
   39 ?        Zs     0:00 [haproxy] <defunct>
   94 pts/0    S+     0:00 grep defunct
#defunct 3

And later we see the reaper picking up some zombies:

cmd.CombinedOutput() err=exit status 1, out=OLD_PIDS: 1699 1678 1657 1636 1615 1594 1573 1552 1531 1510 1489 1468 1447
 - Checking http://localhost:80 ...
 - Exceeded max wait time (0) in health check - 0 retry attempt(s).
 1447 ?        Zs     0:00 [haproxy] <defunct>
 1468 ?        Zs     0:00 [haproxy] <defunct>
 1489 ?        Zs     0:00 [haproxy] <defunct>
 1510 ?        Zs     0:00 [haproxy] <defunct>
 1531 ?        Zs     0:00 [haproxy] <defunct>
 1552 ?        Zs     0:00 [haproxy] <defunct>
 1573 ?        Zs     0:00 [haproxy] <defunct>
 1594 ?        Zs     0:00 [haproxy] <defunct>
 1615 ?        Zs     0:00 [haproxy] <defunct>
 1636 ?        Zs     0:00 [haproxy] <defunct>
 1657 ?        Zs     0:00 [haproxy] <defunct>
 1678 ?        Zs     0:00 [haproxy] <defunct>
 1733 pts/0    S+     0:00 grep defunct
#defunct 13

2020/04/15 13:02:42 Zombie reaped: 1447
2020/04/15 13:02:42 Zombie reaped: 1468
2020/04/15 13:02:42 Zombie reaped: 1489
2020/04/15 13:02:42 Zombie reaped: 1510
2020/04/15 13:02:42 Zombie reaped: 1531
2020/04/15 13:02:42 Zombie reaped: 1552
2020/04/15 13:02:42 Sleeping for 6 seconds

And based on timing, we see the number of defunct drop:

cmd.CombinedOutput() err=exit status 1, out=OLD_PIDS: 1741 1720 1699 1678 1657 1636 1615 1594 1573
 - Checking http://localhost:80 ...
 - Exceeded max wait time (0) in health check - 0 retry attempt(s).
 1573 ?        Zs     0:00 [haproxy] <defunct>
 1594 ?        Zs     0:00 [haproxy] <defunct>
 1615 ?        Zs     0:00 [haproxy] <defunct>
 1636 ?        Zs     0:00 [haproxy] <defunct>
 1657 ?        Zs     0:00 [haproxy] <defunct>
 1678 ?        Zs     0:00 [haproxy] <defunct>
 1699 ?        Zs     0:00 [haproxy] <defunct>
 1720 ?        Zs     0:00 [haproxy] <defunct>
 1775 pts/0    S+     0:00 grep defunct
#defunct 9

2020/04/15 13:02:48 Zombie reaped: 1573
2020/04/15 13:05:36 Zombie reaped: 1594
2020/04/15 13:05:36 Zombie reaped: 1615
2020/04/15 13:05:36 Zombie reaped: 1636
2020/04/15 13:05:36 Zombie reaped: 1657
2020/04/15 13:05:36 Zombie reaped: 1678
2020/04/15 13:05:36 Sleeping for 6 seconds
cmd.CombinedOutput() err=exit status 1, out=OLD_PIDS: 1762 1741 1720 1699
 - Checking http://localhost:80 ...
 - Exceeded max wait time (0) in health check - 0 retry attempt(s).
 1699 ?        Zs     0:00 [haproxy] <defunct>
 1720 ?        Zs     0:00 [haproxy] <defunct>
 1741 ?        Zs     0:00 [haproxy] <defunct>
 1796 pts/0    S+     0:00 grep defunct
#defunct 4

@frobware
Copy link
Contributor

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Apr 15, 2020
var zs []int
var err error
for {
for _, z := range zs {
Copy link
Contributor

@frobware frobware Apr 15, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious as to why the don't pull the call to:

zs, err = parseProcForZombies()

before this loop?


// StartPeriodicReaper starts a goroutine to reap processes periodically if called
// from a pid 1 process.
// The zombie processes are reaped at the beginning of next cycle, so os.Exec calls
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment isn't necessary, since os.Exec doesn't care about zombies (it will never itself have a zombie because it never quit so the child isn't reparented)

@smarterclayton
Copy link
Contributor

Also testing in openshift/ci-search#73

// from a pid 1 process.
// The zombie processes are reaped at the beginning of next cycle, so os.Exec calls
// have an oppurtunity to reap their children within `period` seconds.
func StartPeriodicReaper(period int64) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Take a time.duration yeah

// have an oppurtunity to reap their children within `period` seconds.
func StartPeriodicReaper(period int64) {
if os.Getpid() == 1 {
klog.V(4).Infof("Launching periodic reaper")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once you've tested strip out all logging.

@@ -3,13 +3,87 @@
package proc
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rename this file proc_linux.go and then add a proc.go with // +build !linux that exposes StartReaper

}

proc.StartReaper()
proc.StartPeriodicReaper(6)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Prefer StartReaper since the function hasn't changed really.

@mrunalp
Copy link
Member Author

mrunalp commented Apr 22, 2020

I will update the PR with the comments here at openshift/library-go#767 tomorrow. Thanks!

Signed-off-by: Mrunal Patel <mrunalp@gmail.com>
@mrunalp mrunalp force-pushed the test_periodic_router branch from 4b02e46 to d0207d8 Compare April 23, 2020 02:21
@openshift-ci-robot openshift-ci-robot removed the lgtm Indicates that a PR is ready to be merged. label Apr 23, 2020
@openshift-ci-robot
Copy link
Contributor

New changes are detected. LGTM label has been removed.

@openshift-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: frobware, mrunalp

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 23, 2020
@openshift-ci-robot
Copy link
Contributor

@mrunalp: The following tests failed, say /retest to rerun all failed tests:

Test name Commit Details Rerun command
ci/prow/verify d0207d8 link /test verify
ci/prow/e2e d0207d8 link /test e2e
ci/prow/e2e-metal-ipi d0207d8 link /test e2e-metal-ipi

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@sgreene570
Copy link
Contributor

Closing in favor of #190
/close

@openshift-ci-robot
Copy link
Contributor

@sgreene570: Closed this PR.

Details

In response to this:

Closing in favor of #190
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants