Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Long time-to-first-byte problem #245

Closed
djensen47 opened this issue May 1, 2018 · 9 comments
Closed

Long time-to-first-byte problem #245

djensen47 opened this issue May 1, 2018 · 9 comments

Comments

@djensen47
Copy link

I've been experiencing long wait times for time-to-first-byte (TTFB) using the ingress-gce on GKE.

I compared going through the ingress-gce versus connecting directly to a pod. Going directly to the pod via a portforward, TTFB times are in the 300ms range.

Via the ingress I have noticed:

  • TTFB times between 1 - 5s
  • Happens a lot on GET OPTION calls but not always
  • Randomly occurs on other calls
  • These are all fetch calls (ajax-style) and in a single browser reload, it happens only once

We have two rules in our configuration, three when the echoserver is up, and also tls.

I also tried this against the "echo server" and I see long >300ms TTFB on GET /favicon.ico

My best guess at reproduction is to:

  • Set up a cluster
  • Deploy the "echo server" gcr.io/google_containers/echoserver:1.4
  • Deploy another webserver that the ingress can communicate with
  • Create an ingress that has two backends and tls
  • Open Chrome with developer tools open to the Network tab
  • Hit the echo server
  • Try this several times
  • Notice that favicon.ico will very between acceptable TTFB times of <100ms to >300ms possibly even as high as 1s.
@djensen47
Copy link
Author

djensen47 commented May 1, 2018

Here is out ingress config:

apiVersion: extensions/v1beta1
kind: Ingress
metadata:
  name: brewd-ingress
spec:
  tls:
  - hosts:
    - stage-api2.example.com
    - stage-app2.example.com
    - stage-echo.example.com
    secretName: redacted
  rules:
  - host: stage-api2.example.com
    http:
      paths:
      - backend:
          serviceName: gateway-service
          servicePort: 7000
  - host: stage-app2.example.com
    http:
      paths:
      - backend:
          serviceName: web-service
          servicePort: 8080
  - host: stage-echo.example.com
    http:
      paths:
        - backend:
            serviceName: echoserver
            servicePort: 8080

And the gateway service config (the web service is similar):

apiVersion: v1
kind: Service
metadata:
  name: gateway-service
  labels: 
    app: gateway
spec:
  type: NodePort
  ports:
  - port: 7000
  selector:
    app: gateway
---
apiVersion: apps/v1beta2
kind: Deployment
metadata:
  name: gateway-deployment
spec:
  selector:
    matchLabels:
      app: gateway
  replicas: 1
  template:
    metadata:
      labels:
        app: gateway
    spec:
      containers:
      - name: gateway
        image: us.gcr.io/redacted/gateway:1.3.0
        imagePullPolicy: Always
        ports:
        - containerPort: 7000
        env:
        - name: REDACTED_ENV
          value: stage

@nicksardo
Copy link
Contributor

As Ahmet mentioned in https://groups.google.com/forum/#!topic/kubernetes-users/omg-b8_FcBM, this is better answered by GCP support.

Could you try simplifying the repro by testing outside the context of GKE/kubernetes. Spin up an instance running echoserver and create an L7 LB through the GCP .Console.

@djensen47
Copy link
Author

I now have a GCP support ticket open.

However, at least 2 others have now chimed in that they are experiencing the same problem with a similar setup.

@nicksardo
Copy link
Contributor

Copying my response to kubernetes-users which djensen47 indicates worked for him.

I created an HTTP LB setup on GCP using a golang HTTP server without kubernetes and was able to see rare long-tail latencies in >1 second. After I set IdleTimeout to larger than ten minutes, I stopped seeing those slow responses. The echoheaders image uses nginx and doesn't set keepalive_timeout (sent PR to update this).
This expected timeout behavior is explained in the GCP documentation at https://cloud.google.com/compute/docs/load-balancing/http/#timeouts_and_retries

@djensen47
Copy link
Author

djensen47 commented May 18, 2018 via email

@djensen47
Copy link
Author

Hi, this problem has returned. I'm not sure why I didn't experience the issue immediately after setting the IdleTimeout but a few days after, this problem has returned. Not only that, a few others are experiencing this same problem.

I've tried to bring this up with paid support but all they're doing is deflecting the ticket.

@b99andla
Copy link

b99andla commented Jan 8, 2020

@djensen47 Any news on this? We have the same problem, wordpress deployment that gets 5-6 second latencies intermittently when using GCP Ingress...

@djensen47
Copy link
Author

I did what @nicksardo suggested and it eventually worked, I think that plus an upgrade to the cluster is actually when it started working. Not sure how to fix it for WordPress. (My recommendation: don't use WordPress 😉).

@fsjones
Copy link

fsjones commented Feb 11, 2022

@djensen47 Any news on this? We have the same problem, wordpress deployment that gets 5-6 second latencies intermittently when using GCP Ingress...

Did you find a solution for this for wordpress? I may be having a similar issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants