Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Validate webhook fails open #1634

Closed
anyasabo opened this issue Aug 26, 2019 · 1 comment · Fixed by #2524
Closed

Validate webhook fails open #1634

anyasabo opened this issue Aug 26, 2019 · 1 comment · Fixed by #2524
Assignees
Labels
>bug Something isn't working

Comments

@anyasabo
Copy link
Contributor

From a report on the discuss forums:
https://discuss.elastic.co/t/error-from-server-notfound-services-quickstart-es-http-not-found/193658/15

It appears that if GKE has network policies enforced, but no network policy allowing the API server to communicate with the webhook, that admission of new elastic resources will time out. The goal of setting the failure policy to Ignore (see #1386) was to allow creation to still progress even if the webhook was unavailable, but it appears we may be hitting a combination of timeouts that complicates that. We should investigate and ensure that our webhook does indeed fail open in this environment.

It may also be worth updating the docs with an example network policy to allow it to function in GKE with network policies enforced as well.

Similar k/k issue:
kubernetes/kubernetes#71508

@anyasabo anyasabo added the >bug Something isn't working label Aug 26, 2019
@charith-elastic charith-elastic self-assigned this Feb 5, 2020
@charith-elastic
Copy link
Contributor

Simply enabling network policy enforcement in GKE has no effect on the webhook because all pods are non-isolated by default. In order to interfere with the correct operation of the webhook, users have to make a conscious decision to block all traffic to/from the operator pods. I managed to do so by creating the following network policy:

kind: NetworkPolicy
apiVersion: networking.k8s.io/v1
metadata:
  name: default-deny-all
  namespace: elastic-system
spec:
  podSelector: {}
  ingress: []

With the above policy in place, attempting to create an Elasticsearch resource times out with the following error message:

Error from server (Timeout): error when creating "es.yaml": Timeout: request did not complete within requested timeout 30s

As noted in the linked upstream issue, this is simply the case of the client timing out before the server. By setting the webhook failure policy to Fail and increasing the client timeout (kubectl --request-timeout=1m apply -f es.yaml), the following error can be observed:

Error from server (InternalError): error when creating "es.yaml": Internal error occurred: failed calling webhook "elastic-es-validation-v1.k8s.elastic.co": Post https://elastic-webhook-server.elastic-system.svc:443/validate-elasticsearch-k8s-elastic-co-v1-elasticsearch?timeout=30s: context deadline exceeded

When the webhook failure policy is set to Ignore and kubectl is invoked with the increased client timeout value, the resource gets created after waiting for about 30 seconds (server-side request timeout to the webhook). This is the intended behaviour and it seems our implementation is working as expected.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants