Validate webhook fails open #1634

anyasabo · 2019-08-26T13:44:39Z

From a report on the discuss forums:
https://discuss.elastic.co/t/error-from-server-notfound-services-quickstart-es-http-not-found/193658/15

It appears that if GKE has network policies enforced, but no network policy allowing the API server to communicate with the webhook, that admission of new elastic resources will time out. The goal of setting the failure policy to Ignore (see #1386) was to allow creation to still progress even if the webhook was unavailable, but it appears we may be hitting a combination of timeouts that complicates that. We should investigate and ensure that our webhook does indeed fail open in this environment.

It may also be worth updating the docs with an example network policy to allow it to function in GKE with network policies enforced as well.

Similar k/k issue:
kubernetes/kubernetes#71508

The text was updated successfully, but these errors were encountered:

charith-elastic · 2020-02-05T15:16:59Z

Simply enabling network policy enforcement in GKE has no effect on the webhook because all pods are non-isolated by default. In order to interfere with the correct operation of the webhook, users have to make a conscious decision to block all traffic to/from the operator pods. I managed to do so by creating the following network policy:

kind: NetworkPolicy
apiVersion: networking.k8s.io/v1
metadata:
  name: default-deny-all
  namespace: elastic-system
spec:
  podSelector: {}
  ingress: []

With the above policy in place, attempting to create an Elasticsearch resource times out with the following error message:

Error from server (Timeout): error when creating "es.yaml": Timeout: request did not complete within requested timeout 30s

As noted in the linked upstream issue, this is simply the case of the client timing out before the server. By setting the webhook failure policy to Fail and increasing the client timeout (kubectl --request-timeout=1m apply -f es.yaml), the following error can be observed:

Error from server (InternalError): error when creating "es.yaml": Internal error occurred: failed calling webhook "elastic-es-validation-v1.k8s.elastic.co": Post https://elastic-webhook-server.elastic-system.svc:443/validate-elasticsearch-k8s-elastic-co-v1-elasticsearch?timeout=30s: context deadline exceeded

When the webhook failure policy is set to Ignore and kubectl is invoked with the increased client timeout value, the resource gets created after waiting for about 30 seconds (server-side request timeout to the webhook). This is the intended behaviour and it seems our implementation is working as expected.

anyasabo added the >bug Something isn't working label Aug 26, 2019

charith-elastic self-assigned this Feb 5, 2020

charith-elastic mentioned this issue Feb 5, 2020

Add webhook network policy troubleshooting information #2524

Merged

charith-elastic closed this as completed in #2524 Feb 6, 2020

anyasabo mentioned this issue Feb 13, 2020

Consider reducing webhook timeout #2563

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Validate webhook fails open #1634

Validate webhook fails open #1634

anyasabo commented Aug 26, 2019

charith-elastic commented Feb 5, 2020

Validate webhook fails open #1634

Validate webhook fails open #1634

Comments

anyasabo commented Aug 26, 2019

charith-elastic commented Feb 5, 2020