Skip to content

Conversation

@frobware
Copy link
Contributor

@frobware frobware commented Jan 8, 2020

DO NOT MERGE.

We were getting races when reloading haproxy via the reload-haproxy script because we had our own process reaper (StartReaper). Occasionally the reload would report no child processes and this happened when StartReaper had already reaped the reload script that we were independently waiting on elsewhere.

This summarises the situation we currently have: krallin/tini#8 (comment)

It seems better to separate these concerns so using catatonit[1] to do that, so:

  • This changes the entrypoint of the haproxy container to use catatonit
  • This binary is the new PID 1 and is responsible for reaping processes

TODO: decide how/where we get catatonit from, options are:

  • build locally from source in base image (currently done in this PR)
  • from brew/RPM

[1] https://github.com/openSUSE/catatonit

@openshift-ci-robot openshift-ci-robot added the size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. label Jan 8, 2020
@frobware frobware force-pushed the handle-race-in-cmd-CombinedOutput branch 2 times, most recently from 4223c25 to 2b7d19c Compare January 8, 2020 13:25
@frobware frobware changed the title template/router: explicitly handle "no child processes" error [WIP] template/router: explicitly handle "no child processes" error Jan 8, 2020
@openshift-ci-robot openshift-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jan 8, 2020
@frobware
Copy link
Contributor Author

frobware commented Jan 8, 2020

/hold

Until I'm utterly convinced by my assertions and reality.

@openshift-ci-robot openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jan 8, 2020
@frobware
Copy link
Contributor Author

frobware commented Jan 8, 2020

/test e2e-aws

@frobware
Copy link
Contributor Author

frobware commented Jan 8, 2020

/test images

@frobware
Copy link
Contributor Author

frobware commented Jan 8, 2020

Latest hack looks like:

diff --git a/Makefile b/Makefile
index e866dff..173e4f2 100644
--- a/Makefile
+++ b/Makefile
@@ -20,7 +20,7 @@ define version-ldflags
 -X $(1).buildDate="$(shell date -u +'%Y-%m-%dT%H:%M:%SZ')"
 endef
 GO_LD_EXTRAFLAGS ?=
-GO_LDFLAGS ?=-ldflags "-s -w $(call version-ldflags,$(PACKAGE)/pkg/version) $(GO_LD_EXTRAFLAGS)"
+GO_LDFLAGS ?=-ldflags "$(call version-ldflags,$(PACKAGE)/pkg/version) $(GO_LD_EXTRAFLAGS)"
 
 GO=GO111MODULE=on GOFLAGS=-mod=vendor go
 GO_BUILD_RECIPE=CGO_ENABLED=0 $(GO) build -o $(BIN) $(GO_GCFLAGS) $(GO_LDFLAGS) $(MAIN_PACKAGE)
@@ -30,7 +30,7 @@ all: build
 build:
 	$(GO_BUILD_RECIPE)
 
-images/router/*/Dockerfile: images/router/base/Dockerfile
+images/router/*/Dockerfile:
 	imagebuilder -t registry.svc.ci.openshift.org/openshift/origin-v4.0:`basename $(@D)`-router -f images/router/`basename $(@D)`/Dockerfile .
 
 images/router/*/Dockerfile.rhel: images/router/base/Dockerfile.rhel
diff --git a/images/router/haproxy/Dockerfile b/images/router/haproxy/Dockerfile
index 1a646b1..4cb8328 100644
--- a/images/router/haproxy/Dockerfile
+++ b/images/router/haproxy/Dockerfile
@@ -1,5 +1,11 @@
-FROM registry.svc.ci.openshift.org/openshift/origin-v4.0:base-router
-RUN INSTALL_PKGS="haproxy20 rsyslog sysvinit-tools" && \
+# FROM registry.access.redhat.com/ubi8/ubi-init
+# RUN yum -y install strace
+
+FROM centos:centos7
+RUN yum -y install strace wget
+RUN rpm -ivh http://spicy.frobware.com/~aim/x86_64/haproxy21-2.1.2-1.el7.x86_64.rpm
+RUN /usr/sbin/haproxy -vv
+RUN INSTALL_PKGS="strace procps-ng socat rsyslog sysvinit-tools" && \
     yum install -y $INSTALL_PKGS && \
     rpm -V $INSTALL_PKGS && \
     yum clean all && \
@@ -9,7 +15,10 @@ RUN INSTALL_PKGS="haproxy20 rsyslog sysvinit-tools" && \
     setcap 'cap_net_bind_service=ep' /usr/sbin/haproxy && \
     chown -R :0 /var/lib/haproxy && \
     chmod -R g+w /var/lib/haproxy
+RUN wget -O /usr/local/bin/dumb-init https://github.com/Yelp/dumb-init/releases/download/v1.2.2/dumb-init_1.2.2_amd64
+RUN chmod +x /usr/local/bin/dumb-init
 COPY images/router/haproxy/* /var/lib/haproxy/
+COPY openshift-router /usr/bin/openshift-router
 LABEL io.k8s.display-name="OpenShift HAProxy Router" \
       io.k8s.description="This component offers ingress to an OpenShift cluster via Ingress and Route rules." \
       io.openshift.tags="openshift,router,haproxy"
@@ -18,4 +27,5 @@ EXPOSE 80 443
 WORKDIR /var/lib/haproxy/conf
 ENV TEMPLATE_FILE=/var/lib/haproxy/conf/haproxy-config.template \
     RELOAD_SCRIPT=/var/lib/haproxy/reload-haproxy
-ENTRYPOINT ["/usr/bin/openshift-router"]
+ENTRYPOINT ["/usr/local/bin/dumb-init",  "--"]
+CMD ["/usr/bin/openshift-router"]
diff --git a/pkg/router/template/router.go b/pkg/router/template/router.go
index 7122cee..6e3788f 100644
--- a/pkg/router/template/router.go
+++ b/pkg/router/template/router.go
@@ -549,7 +549,7 @@ func (r *templateRouter) reloadRouter() error {
 	// CombinedOutput(). The logic there calls Start(), then
 	// Wait() and that could be racy if there is a GC pause (or
 	// other scheduling activity).
-	if err != nil && !noChildProcessesRegExp.MatchString(err.Error()) {
+	if err != nil {
 		return fmt.Errorf("error reloading router: %v\n%s", err, string(out))
 	}
 	log.V(0).Info("router reloaded", "output", string(out))

@openshift-ci-robot openshift-ci-robot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Jan 8, 2020
@openshift-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: frobware
To complete the pull request process, please assign ironcladlou
You can assign the PR to them by writing /assign @ironcladlou in a comment when ready.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@frobware frobware force-pushed the handle-race-in-cmd-CombinedOutput branch from 5a98727 to 70f6441 Compare January 8, 2020 19:51
@frobware frobware force-pushed the handle-race-in-cmd-CombinedOutput branch from b6d2cdb to a4ada1d Compare January 10, 2020 12:20
@openshift-ci-robot openshift-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Jan 10, 2020
@frobware
Copy link
Contributor Author

@frobware frobware force-pushed the handle-race-in-cmd-CombinedOutput branch from 863c3ea to 72db938 Compare January 10, 2020 14:47
@smarterclayton
Copy link
Contributor

I don’t think we should do this.

If this is a “reap happens too fast and impacts the app” bug, let’s just fix reap.

Pid 1 isn’t hard.

@frobware frobware force-pushed the handle-race-in-cmd-CombinedOutput branch from 72db938 to 5e4c64b Compare January 10, 2020 15:47
@frobware frobware force-pushed the handle-race-in-cmd-CombinedOutput branch from 5e4c64b to dbdbb64 Compare January 10, 2020 15:50
@frobware
Copy link
Contributor Author

If this is a “reap happens too fast and impacts the app” bug, let’s just fix reap.

But doesn't this slow it down for all consumers now? Is this desirable? IIRC, this library was being used by, or was developed for, oc observe, so wouldn't this impact interactive use?

@frobware
Copy link
Contributor Author

Pid 1 isn’t hard.

The issue we have is that sometimes the StartReaper() Go routine is scheduled and reaps the PID that the cmd is waiting on. I put a reproducer here which is based on our haproxy RPM and our reload script: https://github.com/frobware/haproxy-hacks/tree/master/reaper.

@openshift-ci-robot
Copy link
Contributor

@frobware: The following tests failed, say /retest to rerun all failed tests:

Test name Commit Details Rerun command
ci/prow/e2e-aws dbdbb64 link /test e2e-aws
ci/prow/verify dbdbb64 link /test verify

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@frobware
Copy link
Contributor Author

Can be closed by #111

@frobware frobware closed this Apr 21, 2020
@frobware frobware deleted the handle-race-in-cmd-CombinedOutput branch May 1, 2024 12:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants