Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Grafana loki deploys sometimes with a bad read pod #15191

Open
someStrangerFromTheAbyss opened this issue Nov 29, 2024 · 3 comments
Open

Grafana loki deploys sometimes with a bad read pod #15191

someStrangerFromTheAbyss opened this issue Nov 29, 2024 · 3 comments

Comments

@someStrangerFromTheAbyss
Copy link
Contributor

someStrangerFromTheAbyss commented Nov 29, 2024

When deploying grafana loki in Simple Scalable with multiple read pods set in a kubernetes cluster, you sometimes end with a loki read pod that cannot execute any queries. This problem reflect in Grafana by a 504 Gateway timeout similar to this issue and is also linked to the 499 nginx issue found here

Expected behavior
Grafana loki should deploy no problem and should have a "tainted" read pods for no reason

Environment:

  • Deployed using the official Helm chart of loki, version 6.10.0
  • Deploying Grafana loki using version 3.2.1
  • Deployed on Internal Cloud. Using Cilium version 1.16.4
  • Storage is Azure Blob Storage

How to replicate:

  • Deploy using the helm upgrade or helm install. Here is my final loki config file:
auth_enabled: false
common:
  compactor_address: 'http://loki-backend:3100'
  path_prefix: /var/loki
  replication_factor: 1
  storage:
    azure:
      account_key: CREDENTIALS
      account_name: CREDENTIALS
      container_name: loki
      request_timeout: 30s
      use_federated_token: false
      use_managed_identity: false
compactor:
  compaction_interval: 10m
  delete_request_cancel_period: 24h
  delete_request_store: azure
  retention_delete_delay: 2h
  retention_delete_worker_count: 150
  retention_enabled: true
  working_directory: /tmp
frontend:
  scheduler_address: ""
  tail_proxy_url: ""
frontend_worker:
  scheduler_address: ""
index_gateway:
  mode: simple
ingester:
  chunk_idle_period: 30m
  chunk_target_size: 1572864
  flush_check_period: 15s
  wal:
    replay_memory_ceiling: 1024MB
limits_config:
  allow_structured_metadata: true
  ingestion_burst_size_mb: 30
  ingestion_rate_mb: 30
  ingestion_rate_strategy: local
  max_cache_freshness_per_query: 10m
  max_chunks_per_query: 100
  max_concurrent_tail_requests: 100
  max_entries_limit_per_query: 1000
  max_global_streams_per_user: 50000
  max_label_names_per_series: 17
  max_line_size_truncate: false
  max_query_parallelism: 128
  max_query_series: 50
  max_streams_matchers_per_query: 100
  per_stream_rate_limit: 5Mb
  per_stream_rate_limit_burst: 20Mb
  query_timeout: 300s
  reject_old_samples: true
  reject_old_samples_max_age: 168h
  retention_period: 14d
  shard_streams:
    desired_rate: 3000000
    enabled: true
  split_queries_by_interval: 1h
  tsdb_max_bytes_per_shard: 2GB
  tsdb_max_query_parallelism: 2048
  volume_enabled: true
memberlist:
  join_members:
  - loki-memberlist
pattern_ingester:
  enabled: false
query_range:
  align_queries_with_step: true
ruler:
  alertmanager_url: SECRET
  enable_alertmanager_v2: true
  enable_api: true
  storage:
    azure:
      account_key: CREDENTIAL
      account_name: CREDENTIAL
      container_name: loki
      request_timeout: 30s
      use_federated_token: false
      use_managed_identity: false
    type: azure
runtime_config:
  file: /etc/loki/runtime-config/runtime-config.yaml
schema_config:
  configs:
  - from: "2024-07-29"
    index:
      period: 24h
      prefix: index_
    object_store: azure
    schema: v13
    store: tsdb
server:
  grpc_listen_port: 9095
  grpc_server_max_recv_msg_size: 60000000
  grpc_server_max_send_msg_size: 60000000
  http_listen_port: 3100
  http_server_idle_timeout: 600s
  http_server_read_timeout: 600s
  http_server_write_timeout: 600s
  log_level: debug
storage_config:
  hedging:
    at: 250ms
    max_per_second: 20
    up_to: 3
  tsdb_shipper:
    index_gateway_client:
      log_gateway_requests: true
      server_address: dns+loki-backend-headless.6723a512e7641cd9c37269ed.svc.cluster.local:9095
tracing:
  enabled: true
  • Deploy with the following 3 read pods, 1 backend pod,1 write pod and 1 gateway pod. Here are my helm values:
backend:
  affinity:
    podAffinity:
      preferredDuringSchedulingIgnoredDuringExecution:
      - podAffinityTerm:
          labelSelector:
            matchExpressions:
            - key: projectId
              operator: In
              values:
              - 27edfc6c-eb78-4790-a6c8-ed82a0478f7c
          topologyKey: kubernetes.io/hostname
        weight: 100
    podAntiAffinity: null
  extraEnv:
  - name: GOMEMLIMIT
    valueFrom:
      resourceFieldRef:
        resource: limits.memory
  - name: GOMAXPROCS
    valueFrom:
      resourceFieldRef:
        resource: limits.cpu
  nodeSelector:
    dedicated: logs-instance
  persistence:
    enableStatefulSetAutoDeletePVC: true
    size: 16Gi
    storageClass: csi-cinder-sc-delete
  podAnnotations:
    port: "3100"
    type: loki-backend
  podLabels:
    app: logs
    team: mops
  replicas: 1
  resources:
    limits:
      memory: 1280Mi
    requests:
      cpu: 1
      memory: 1280Mi
  tolerations:
  - effect: NoSchedule
    key: dedicated
    operator: Equal
    value: logs-instance
chunksCache:
  enabled: false
distributor:
  receivers:
    otlp:
      grpc:
        max_recv_msg_size_mib: 60000000
enterprise:
  enabled: false
frontend:
  max_outstanding_per_tenant: 1000
  scheduler_worker_concurrency: 15
fullnameOverride: loki
gateway:
  enabled: true
  ingress:
    enabled: false
  nodeSelector:
    dedicated: logs-instance
  tolerations:
  - effect: NoSchedule
    key: dedicated
    operator: Equal
    value: logs-instance
index:
  in-memory-sorted-index:
    retention_period: 24h
  period: 168h
  prefix: index_
ingress:
  enabled: false
loki:
  auth_enabled: false
  commonConfig:
    replication_factor: 1
  compactor:
    compaction_interval: 10m
    delete_request_cancel_period: 24h
    delete_request_store: azure
    retention_delete_delay: 2h
    retention_delete_worker_count: 150
    retention_enabled: true
    working_directory: /tmp
  configStorageType: Secret
  image:
    pullPolicy: IfNotPresent
    tag: 3.2.1
  ingester:
    chunk_idle_period: 30m
    chunk_target_size: 1572864
    flush_check_period: 15s
    wal:
      replay_memory_ceiling: 1024MB
  limits_config:
    allow_structured_metadata: true
    ingestion_burst_size_mb: 30
    ingestion_rate_mb: 30
    ingestion_rate_strategy: local
    max_cache_freshness_per_query: 10m
    max_chunks_per_query: 100
    max_concurrent_tail_requests: 100
    max_entries_limit_per_query: 1000
    max_global_streams_per_user: 50000
    max_label_names_per_series: 17
    max_line_size_truncate: false
    max_query_parallelism: 128
    max_query_series: 50
    max_streams_matchers_per_query: 100
    per_stream_rate_limit: 5Mb
    per_stream_rate_limit_burst: 20Mb
    reject_old_samples: true
    reject_old_samples_max_age: 168h
    retention_period: 14d
    shard_streams:
      desired_rate: 3000000
      enabled: true
    split_queries_by_interval: 1h
    tsdb_max_bytes_per_shard: 2GB
    tsdb_max_query_parallelism: 2048
  rulerConfig:
    alertmanager_url: SECRET
    enable_alertmanager_v2: true
    enable_api: true
  schemaConfig:
    configs:
    - from: "2024-07-29"
      index:
        period: 24h
        prefix: index_
      object_store: azure
      schema: v13
      store: tsdb
  server:
    grpc_listen_port: 9095
    grpc_server_max_recv_msg_size: 60000000
    grpc_server_max_send_msg_size: 60000000
    http_listen_port: 3100
    http_server_idle_timeout: 600s
    log_level: debug
  storage:
    azure:
      accountKey: CREDENTIALS
      accountName: CREDENTIALS
      requestTimeout: 30s
      useManagedIdentity: false
    bucketNames:
      admin: loki
      chunks: loki
      ruler: loki
    type: azure
  storage_config:
    boltdb_shipper: null
    tsdb_shipper:
      index_gateway_client:
        grpc_client_config:
          connect_timeout: 1s
        log_gateway_requests: true
  tracing:
    enabled: true
lokiCanary:
  enabled: false
minio:
  enabled: false
monitoring:
  dashboards:
    enabled: false
  lokiCanary:
    enabled: false
  rules:
    alerting: false
    enabled: false
  selfMonitoring:
    enabled: false
    grafanaAgent:
      installOperator: false
  serviceMonitor:
    enabled: false
    metricsInstance:
      enabled: false
nameOverride: loki
querier:
  extra_query_delay: 500ms
  frontend_worker:
    grpc_client_config:
      max_send_msg_size: 60000000
  max_concurrent: 6
query_range:
  max_concurrent: 6
  parallelise_shardable_queries: true
  results_cache:
    cache_results: true
    cache_validity: 5m
  split_queries_by_interval: 1h
query_scheduler:
  max_outstanding_requests_per_tenant: 1000
rbac:
  namespaced: true
read:
  affinity:
    podAffinity:
      preferredDuringSchedulingIgnoredDuringExecution:
      - podAffinityTerm:
          labelSelector:
            matchExpressions:
            - key: projectId
              operator: In
              values:
              - 27edfc6c-eb78-4790-a6c8-ed82a0478f7c
          topologyKey: kubernetes.io/hostname
        weight: 100
    podAntiAffinity: null
  extraEnv:
  - name: GOMEMLIMIT
    valueFrom:
      resourceFieldRef:
        resource: limits.memory
  - name: GOMAXPROCS
    valueFrom:
      resourceFieldRef:
        resource: limits.cpu
  nodeSelector:
    dedicated: logs-instance
  persistence:
    enableStatefulSetAutoDeletePVC: true
    size: 16Gi
    storageClass: csi-cinder-sc-delete
  podAnnotations:
    port: "3100"
    type: loki-read
  podLabels:
    name: dev-test-multiple-change
    scrape: "true"
    projectId: 27edfc6c-eb78-4790-a6c8-ed82a0478f7c
    app: logs
    team: mops
  replicas: 3
  resources:
    limits:
      cpu: 3
      memory: 2560Mi
    requests:
      cpu: 100m
      memory: 1280Mi
  tolerations:
  - effect: NoSchedule
    key: dedicated
    operator: Equal
    value: logs-instance
resultsCache:
  enabled: false
sidecar:
  rules:
    enabled: false
test:
  enabled: false
write:
  affinity:
    podAffinity:
      preferredDuringSchedulingIgnoredDuringExecution:
      - podAffinityTerm:
          labelSelector:
            matchExpressions:
            - key: projectId
              operator: In
              values:
              - 27edfc6c-eb78-4790-a6c8-ed82a0478f7c
          topologyKey: kubernetes.io/hostname
        weight: 100
    podAntiAffinity: null
  autoscaling:
    behavior:
      scaleDown:
        policies:
        - periodSeconds: 1800
          type: Pods
          value: 1
        stabilizationWindowSeconds: 3600
      scaleUp:
        policies:
        - periodSeconds: 900
          type: Pods
          value: 1
    enabled: true
    maxReplicas: 25
    minReplicas: 1
    targetCPUUtilizationPercentage: 60
    targetMemoryUtilizationPercentage: 80
  extraEnv:
  - name: GOMEMLIMIT
    valueFrom:
      resourceFieldRef:
        resource: limits.memory
  - name: GOMAXPROCS
    valueFrom:
      resourceFieldRef:
        resource: limits.cpu
  nodeSelector:
    dedicated: logs-instance
  persistence:
    enableStatefulSetAutoDeletePVC: true
    size: 16Gi
    storageClass: csi-cinder-sc-delete
  podAnnotations:
    port: "3100"
    type: loki-write
  podLabels:
    scrape: "true"
    projectId: 27edfc6c-eb78-4790-a6c8-ed82a0478f7c
    app: logs
    team: mops
  replicas: 1
  resources:
    limits:
      memory: 2474Mi
    requests:
      cpu: 625m
      memory: 2474Mi
  tolerations:
  - effect: NoSchedule
    key: dedicated
    operator: Equal
    value: logs-instance

  • If you have no query problems, redeploy using helm upgrade. Do the process multiple times (can take 1 time or it can 20)
  • At some point, you will get a "bad deploymend" and one of the read pods will no longer be able to make queries

Screenshots, Promtail config, or terminal output

So at first, i thought my problem was linked to the previously mentionned issue and it might be linked to what some users are facing since this problem does end up with 504 Gateway timeout when it happens. It does also make this error in our nginx pods that is linked to our ingress:

10.171.106.139 - - [28/Nov/2024:20:59:22 +0000] "GET /loki/api/v1/labels?start=1732826612721000000&end=1732827512721000000 HTTP/1.1" 499 0 "-" "Grafana/11.1.3" 3985 49.972 [6723a512e7641cd9c37269ed-loki-read-3100] [] 172.16.6.5:3100 0 49.971 - 8ee94e8c2e23c4261ecbd3e8c037d4bb

In this case, the bad pod being 172.16.6.5 among my 3 read pods. So seing this error, i tried:

  • upgrading to latest version of cilium, now being 1.16.4. Did not help on the issue
  • Looking at some threads around github, i tried to set:
    socketLB:
      enabled: true
      terminatePodConnections: true

on my cilium setup, it did not help

  • I removed the old index config attached to my grafana loki (boltdb-shipper) and keep only the tsdb_shipper. It did not help.

I then tried to execute the query directly in the pod. Using port forwarding, i then executed the following request to the "tainted" read pod:

http://localhost:50697/loki/api/v1/query_range?direction=backward&end=1732889832897000000&limit=1000&query=%7Bubi_project_id%3D%2227edfc6c-eb78-4790-a6c8-ed82a0478f7c%22%7D+%7C%3D+%60%60&start=1732886232897000000&step=2000ms

Image

This result in a request that just returned...nothing. I waited 30 minutes and POSTMAN and only got this log line confirming the pod received the query:

level=info ts=2024-11-29T15:58:49.425174467Z caller=roundtrip.go:364 org_id=fake traceID=761b6d90af6ae5c8 msg="executing query" type=range query="{ubi_project_id=\"27edfc6c-eb78-4790-a6c8-ed82a0478f7c\"} |= ``" start=2024-11-29T13:17:12.897Z end=2024-11-29T14:17:12.897Z start_delta=2h41m36.528171657s end_delta=1h41m36.528171836s length=1h0m0s step=2000 query_hash=2028412130

When trying the same request to a different pod on the same helm release, i get the correct http response:
Image

And i can see the grafana loki logs telling me the query was a success.

If you do query through grafana, at some point, you will hit the bad pod and result in either EOF error or a 504 gateway timeout. I already posted the nginx eror log, see above.

How to fix

This is a temporary fix, but the only known workaround is to simply restart the read pods. Thats it. If you redo a helm upgrade, there is a chance that the same scenario repeats itself. This should not be a problem and at this point, i almost certain its a Loki problem and not a networking problem. Although, if some Grafana loki dev could help me find a way to check what my read pod is missing or why its in a bad state, ill take any suggestion.

@JStickler
Copy link
Contributor

JStickler commented Dec 2, 2024

Deploy with the following 3 read pods, 1 backend pod,1 write pod and 1 gateway pod.

Curious why you aren't following the recommendations from the documentation which is for 3 read, 3 write, 3 backend components? My suspicion is that only having a single Query Scheduler and a single Index Gateway may be part of the problem here, as those support the Read components during querying.

@someStrangerFromTheAbyss
Copy link
Contributor Author

someStrangerFromTheAbyss commented Dec 2, 2024

Yeah i just added that last friday to our DEV/QA ENV and the bad Read pod problems seems to dissapear. I havent updated the issue since i want client confirmation that the fix works.

However, should't work even with one backend pod ? Also, it seems many users also has the 499 bug like mentionned in the other issues. I assume they have the same issue. If its necessary for loki to have more then 1 backend pod, should't the app stop working if it has 1 instance of a backend ?

@JStickler
Copy link
Contributor

If its necessary for loki to have more then 1 backend pod, should't the app stop working if it has 1 instance of a backend ?

Not necessarily. From my understanding of Kubernetes (I'm a technical writer, not a developer or SysAdmin), you might be scaling up or down, upgrading, pods might be restarting. The idea of having more than one pod is to allow for fluctuations in the number of running pods so that the system stays up if one pod goes down for some reason. It also helps balance the load if there's more work than one pod can handle.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants