Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] [Loki] Loki hangs on labels request #7496

Open
zs-dima opened this issue Oct 22, 2022 · 31 comments
Open

[BUG] [Loki] Loki hangs on labels request #7496

zs-dima opened this issue Oct 22, 2022 · 31 comments
Labels
component/querier need-investigation type/bug Somehing is not working as expected

Comments

@zs-dima
Copy link

zs-dima commented Oct 22, 2022

Often Loki hangs on labels request. However sometimes after restart it working correctly.

I tried latest, 2.5.0 и main grafana/loki versions - issue still there.

Server Oracle Cloud A1 ARM with Ubuntu 22.04.1 LTS
Docker version 20.10.20, build 9fdeb9c

$ curl http://127.0.0.1:3100/ready
ready
$ curl http://127.0.0.1:3100/loki/api/v1/labels

<- infinite hangs

loki.yml
https://gist.github.com/zs-dima/e23237f7c7e1ff6011588fb77670412f

System is not overloaded as well

top - 11:00:50 up  1:18,  1 user,  load average: 0.07, 0.04, 0.01
Tasks: 177 total,   1 running, 176 sleeping,   0 stopped,   0 zombie
%Cpu(s):  0.5 us,  0.7 sy,  0.0 ni, 98.8 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
MiB Mem :  11940.3 total,   9997.1 free,    554.8 used,   1388.4 buff/cache
MiB Swap:      0.0 total,      0.0 free,      0.0 used.  11185.9 avail Mem

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
   1090 root      20   0 1499876 155628  43832 S   0.7   1.3   0:23.61 dockerd
   1170 root      20   0  769844  49120  38308 S   0.7   0.4   0:13.55 docker-driver
   2296 10001     20   0  758176  59764  40592 S   0.3   0.5   0:06.19 loki
   2613 root      20   0 1792028  77404  46788 S   0.3   0.6   0:16.96 promtail
      1 root      20   0  102456  11660   7724 S   0.0   0.1   0:03.22 systemd
      2 root      20   0       0      0      0 S   0.0   0.0   0:00.01 kthreadd
      3 root       0 -20       0      0      0 I   0.0   0.0   0:00.00 rcu_gp

Loki logs have no errors or warnings.
however rarely "error contacting scheduler" appearing:
caller=scheduler_processor.go:87 msg="error contacting scheduler" err="rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing dial tcp 10.10.0.4:9095: i/o timeout\"" addr=10.10.0.4:9095

@DylanGuedes
Copy link
Contributor

Hi, thanks for reporting it.

It is really hard to tell if this is a problem with your setup or a bug. What I can say is that a lot of users have been running Loki without experiencing the same thing, so it might just be a misconfiguration.
Suggestions:

  • Try running with log.level=debug and see if anything interesting is logged
  • In case you're using Tempo, check the traces of the labels request
  • Check whether or not the same is happening for range requests
  • Try extracting a dump and see if it isn't getting locked anywhere

@zs-dima
Copy link
Author

zs-dima commented Oct 22, 2022

@DylanGuedes, thanks a lot, with log.level=debug Loki started without hang - I will collect more information with the next restart when Loki will hang again.

a lot of users have been running Loki without experiencing the same thing

most of users do not use ARM servers I guess, keeping in mind this issues are still open:
#5682
#4723

@DylanGuedes
Copy link
Contributor

most of users do not use ARM servers I guess, keeping in mind this issues are still open: #5682 #4723

good point.

Btw, I'm curious that it ran fine for you just by changing the log line. No idea what happened there 😄

@zs-dima
Copy link
Author

zs-dima commented Oct 23, 2022

@DylanGuedes reproduced with the next restart as expected.
I could provide access to the VM if it is interesting to investigate somebody from the team or contributors.

Check whether or not the same is happening for range requests

range requests runs without hangs same as
$ curl http://127.0.0.1:3100/loki/api/v1/status/buildinfo
looks only $ curl http://127.0.0.1:3100/loki/api/v1/labels hangs
So it looks like Loki issue

Try running with log.level=debug and see if anything interesting is logged

monitoring_loki.1.zef6wzaeuszu@sky    | level=info ts=2022-10-23T20:49:42.843092033Z caller=main.go:103 msg="Starting Loki" version="(version=2.6.1, branch=HEAD, revision=6bd05c9a4)"
monitoring_loki.1.zef6wzaeuszu@sky    | level=info ts=2022-10-23T20:49:42.843424473Z caller=modules.go:736 msg="RulerStorage is not configured in single binary mode and will not be started."
monitoring_loki.1.zef6wzaeuszu@sky    | level=info ts=2022-10-23T20:49:42.843842273Z caller=server.go:288 http=[::]:3100 grpc=[::]:9095 msg="server listening on addresses"
monitoring_loki.1.zef6wzaeuszu@sky    | level=debug ts=2022-10-23T20:49:42.845078833Z caller=modules.go:548 msg="initializing query frontend tripperware"
monitoring_loki.1.zef6wzaeuszu@sky    | level=warn ts=2022-10-23T20:49:42.845531513Z caller=experimental.go:20 msg="experimental feature in use" feature="In-memory (FIFO) cache - chunksfifocache"
monitoring_loki.1.zef6wzaeuszu@sky    | level=info ts=2022-10-23T20:49:42.851438192Z caller=modules.go:761 msg="RulerStorage is nil.  Not starting the ruler."
monitoring_loki.1.zef6wzaeuszu@sky    | level=info ts=2022-10-23T20:49:42.851657312Z caller=modules.go:863 msg="Not using boltdb-shipper index, not starting compactor"
monitoring_loki.1.zef6wzaeuszu@sky    | level=debug ts=2022-10-23T20:49:42.851875392Z caller=modules.go:604 msg="initializing query frontend" config="{Handler:{LogQueriesLongerThan:0s MaxBodySize:10485760 QueryStatsEnabled:false} FrontendV1:{MaxOutstandingPerTenant:2048 QuerierForgetDelay:0s} FrontendV2:{SchedulerAddress: DNSLookupPeriod:10s WorkerConcurrency:5 GRPCClientConfig:{MaxRecvMsgSize:104857600 MaxSendMsgSize:104857600 GRPCCompression: RateLimit:0 RateLimitBurst:0 BackoffOnRatelimits:false BackoffConfig:{MinBackoff:100ms MaxBackoff:10s MaxRetries:10} TLSEnabled:false TLS:{CertPath: KeyPath: CAPath: ServerName: InsecureSkipVerify:false}} InfNames:[eth0 eth3 eth1 eth2 lo] Addr: Port:0} CompressResponses:false DownstreamURL: TailProxyURL: TLS:{CertPath: KeyPath: CAPath: ServerName: InsecureSkipVerify:false}}"
monitoring_loki.1.zef6wzaeuszu@sky    | level=debug ts=2022-10-23T20:49:42.852200832Z caller=modules.go:630 msg="using query frontend" version=v2
monitoring_loki.1.zef6wzaeuszu@sky    | level=debug ts=2022-10-23T20:49:42.852892272Z caller=worker_service.go:147 msg="determining if querier is running as standalone target" runningStandalone=false queryFrontendEnabled=false queryScheduleEnabled=false readEnabled=false allEnabled=true
monitoring_loki.1.zef6wzaeuszu@sky    | level=info ts=2022-10-23T20:49:42.853087272Z caller=worker.go:112 msg="Starting querier worker using query-scheduler and scheduler ring for addresses"
monitoring_loki.1.zef6wzaeuszu@sky    | level=debug ts=2022-10-23T20:49:42.854338792Z caller=netutil.go:56 msg="found network interfaces with private IP addresses assigned" interfaces="eth0 eth3 eth1 eth2"
monitoring_loki.1.zef6wzaeuszu@sky    | level=debug ts=2022-10-23T20:49:42.854653152Z caller=netutil.go:56 msg="found network interfaces with private IP addresses assigned" interfaces="eth0 eth3 eth1 eth2"
monitoring_loki.1.zef6wzaeuszu@sky    | level=debug ts=2022-10-23T20:49:42.854926072Z caller=netutil.go:56 msg="found network interfaces with private IP addresses assigned" interfaces="eth0 eth3 eth1 eth2"

@chaudum
Copy link
Contributor

chaudum commented Oct 24, 2022

Could you get a goroutine dump when a labels request hangs and provide its output?

$ curl "http://localhost:3100/debug/pprof/goroutine?debug=2"

@zs-dima
Copy link
Author

zs-dima commented Oct 24, 2022

@chaudum thanks a lot for looking into.
I attached dump:
loki.txt

@zs-dima
Copy link
Author

zs-dima commented Oct 24, 2022

Some errors logs form the latest hang:

monitoring_loki.1.vo1fze03140n@sky    | level=warn ts=2022-10-24T20:15:20.453793617Z caller=experimental.go:20 msg="experimental feature in use" feature="In-memory (FIFO) cache - chunksfifocache"
monitoring_loki.1.vo1fze03140n@sky    | level=warn ts=2022-10-24T20:15:26.48688915Z caller=logging.go:86 traceID=3d7325d81bdc5489 orgID=fake msg="POST /loki/api/v1/push (500) 5.001662809s Response: \"rpc error: code = DeadlineExceeded desc = context deadline exceeded\\n\" ws: false; Content-Length: 396; Content-Type: application/x-protobuf; User-Agent: promtail/; "
monitoring_loki.1.vo1fze03140n@sky    | level=warn ts=2022-10-24T20:15:27.941027728Z caller=logging.go:86 traceID=7223df8338dd6720 orgID=fake msg="POST /loki/api/v1/push (500) 5.000668575s Response: \"rpc error: code = DeadlineExceeded desc = context deadline exceeded\\n\" ws: false; Content-Length: 330; Content-Type: application/x-protobuf; User-Agent: promtail/; "
monitoring_loki.1.vo1fze03140n@sky    | level=warn ts=2022-10-24T20:15:30.47834382Z caller=logging.go:86 traceID=289e9e49501dbe49 orgID=fake msg="POST /loki/api/v1/push (500) 5.001132302s Response: \"rpc error: code = DeadlineExceeded desc = context deadline exceeded\\n\" ws: false; Content-Length: 587; Content-Type: application/x-protobuf; User-Agent: promtail/; "
monitoring_loki.1.vo1fze03140n@sky    | level=warn ts=2022-10-24T20:15:32.081421476Z caller=logging.go:86 traceID=0c932a306e915e93 orgID=fake msg="POST /loki/api/v1/push (500) 5.001032948s Response: \"rpc error: code = DeadlineExceeded desc = context deadline exceeded\\n\" ws: false; Content-Length: 396; Content-Type: application/x-protobuf; User-Agent: promtail/; "
monitoring_loki.1.vo1fze03140n@sky    | level=warn ts=2022-10-24T20:15:33.171961743Z caller=logging.go:86 traceID=78439cf9f7bfc629 orgID=fake msg="POST /loki/api/v1/push (500) 5.001352191s Response: \"rpc error: code = DeadlineExceeded desc = context deadline exceeded\\n\" ws: false; Content-Length: 516; Content-Type: application/x-protobuf; User-Agent: promtail/; "
monitoring_loki.1.vo1fze03140n@sky    | level=warn ts=2022-10-24T20:15:33.861565217Z caller=logging.go:86 traceID=66c1d433b891f1e0 orgID=fake msg="POST /loki/api/v1/push (500) 5.000404995s Response: \"rpc error: code = DeadlineExceeded desc = context deadline exceeded\\n\" ws: false; Content-Length: 330; Content-Type: application/x-protobuf; User-Agent: promtail/; "
monitoring_loki.1.vo1fze03140n@sky    | level=warn ts=2022-10-24T20:15:34.636603836Z caller=logging.go:86 traceID=41c0bf9b8e3ccfe5 orgID=fake msg="POST /loki/api/v1/push (500) 5.020914175s Response: \"rpc error: code = DeadlineExceeded desc = context deadline exceeded\\n\" ws: false; Content-Length: 248097; Content-Type: application/x-protobuf; User-Agent: promtail/2.6.1; "
monitoring_loki.1.vo1fze03140n@sky    | level=warn ts=2022-10-24T20:15:36.291686438Z caller=logging.go:86 traceID=227bcce97528b6cc orgID=fake msg="POST /loki/api/v1/push (500) 5.001326522s Response: \"rpc error: code = DeadlineExceeded desc = context deadline exceeded\\n\" ws: false; Content-Length: 587; Content-Type: application/x-protobuf; User-Agent: promtail/; "
monitoring_loki.1.vo1fze03140n@sky    | level=warn ts=2022-10-24T20:15:36.476341193Z caller=pool.go:184 msg="removing ingester failing healthcheck" addr=10.10.0.4:9095 reason="rpc error: code = DeadlineExceeded desc = context deadline exceeded"

@chaudum
Copy link
Contributor

chaudum commented Nov 28, 2022

Hey @zs-dima The latest logs you posted suggest that the issue is not only related to the labels query, but also effects push requests. There may be a deadlock, since all of the push requests time out after 5s.
I also had a look at your goroutine dump, but I could not find anything suspicious :(
I still need to try to reproduce it on an ARM based installation, though.

@zs-dima
Copy link
Author

zs-dima commented Nov 28, 2022

@chaudum thanks a lot for looking into. Strange that sometimes after restart Loki working well and sometimes hanging after restart.

@zs-dima
Copy link
Author

zs-dima commented Nov 28, 2022

@chaudum you could use free Oracle A1 or I could provide access to the VM I testing it.

@mizutani256
Copy link

mizutani256 commented Jan 18, 2023

Hello! Not sure if it's the same problem since I'm not using an ARM, but the symptons are exactly the same: ready query works, labels hang. In my case it happens all the time. Here's more information.

I'm running Loki 2.6.1 using a docker swarm stack. The stack is setup using Ansible over a GitLab CI pipeline. Here's the docker setup:

  loki:
    image: grafana/loki:2.6.1
    command: "-config.file=/loki/loki.yml"
    ports:
    - "3100:3100"
    volumes:
    - "{{ loki_dir }}:/loki"

And the loki configuration:

auth_enabled: false

server:
  http_listen_port: 3100

ingester:
  lifecycler:
    address: 127.0.0.1
    ring:
      kvstore:
        store: inmemory
      replication_factor: 1
    final_sleep: 0s
  chunk_idle_period: 1m
  chunk_retain_period: 30s
  wal:
    enabled: true
    dir: "/loki/wal"

schema_config:
  configs:
  - from: 2023-01-01
    store: boltdb
    object_store: filesystem
    schema: v11
    index:
      prefix: index_
      period: 8760h # 1 year

storage_config:
  boltdb:
    directory: "/loki/index"

  filesystem:
    directory: "/loki/chunks"

limits_config:
  enforce_metric_name: false
  reject_old_samples: true
  reject_old_samples_max_age: 168h

Here's Loki's output when I try to query the labels then CTRL+C after a while:

ts=2023-01-18T17:40:03.60990286Z caller=spanlogger.go:80 user=fake method=query.Label level=info org_id=fake latency=fast query_type=labels length=1h0m0s duration=10.057671ms status=200 label= throughput=0B total_bytes=0B total_entries=0
level=warn ts=2023-01-18T17:40:06.735971157Z caller=pool.go:184 msg="removing frontend failing healthcheck" addr=10.0.0.68:9095 reason="rpc error: code = DeadlineExceeded desc = context deadline exceeded"
level=error ts=2023-01-18T17:40:06.736846343Z caller=scheduler_processor.go:182 org_id=fake msg="error notifying frontend about finished query" err="rpc error: code = Canceled desc = grpc: the client connection is closing" frontend=10.0.0.68:9095
level=error ts=2023-01-18T17:41:07.425319419Z caller=retry.go:73 org_id=fake msg="error processing request" try=0 err="context canceled"
level=error ts=2023-01-18T17:41:07.425533933Z caller=stats.go:57 component=frontend msg="failed to record query metrics" err="expected one of the *LokiRequest, *LokiInstantRequest, *LokiSeriesRequest, *LokiLabelNamesRequest, got "

The environment is a Ubuntu 22.4 on an Amd64 architectecture (as output by dpkg --print-architecture).

I'll try the dump thing and uptade here, hope it helps to further narrow down the cause.

UPDATE 1: Hangs for query_range too. The output is

level=warn ts=2023-01-18T18:09:06.736831084Z caller=pool.go:184 msg="removing frontend failing healthcheck" addr=10.0.0.68:9095 reason="rpc error: code = DeadlineExceeded desc = context deadline exceeded"
level=error ts=2023-01-18T18:09:06.737111932Z caller=scheduler_processor.go:182 org_id=fake msg="error notifying frontend about finished query" err="rpc error: code = Canceled desc = grpc: the client connection is closing" frontend=10.0.0.68:9095

UPDATE 2: Attaching dump (run during a query to labels that hung)

dump.txt

@hrvatskibogmars
Copy link

hrvatskibogmars commented Jan 23, 2023

Facing the same issue (Loki 2.7.0)

Everything works for a few days then suddenly stops (after redeploying it works again). This call just gets stuck and never finishes.

I enabled debug logs but nothing relevant I can see.

loki-query-frontend:3100/loki/api/v1/labels?start=1674463671803000000&end=1674464271803000000

If needed I can share more context about this issue.

@zs-dima
Copy link
Author

zs-dima commented Feb 6, 2023

Reproduced this issue with x64 Debian as well. So it is not only ARM issue.

@NeonSludge
Copy link

NeonSludge commented Feb 8, 2023

Getting this with Loki 2.7.3 deployed in Kubernetes in microservices mode via this Helm chart from Bitnami on an Almalinux node (amd64):

  1. Deploy Grafana and Loki.
  2. Add a Loki datasource to Grafana, click "Save and test".
  3. The /labels request times out.
  4. Restart the frontend or the query-scheduler microservice, don't touch anything else.
  5. The /labels request works fine.

I've recorded a couple of sysdig traces for the frontend microservice and reproduced this issue. It turns out that it doesn't even try to contact the query scheduler when the /labels request times out.

Before restarting the frontend

10.233.121.76 = Grafana
10.233.121.75 = Loki frontend

1. Receive /labels request from client

2428 19:50:07.042588270 1 loki (15088.15146) > read fd=12(<4t>10.233.121.76:55144->10.233.121.75:3100) size=4096
2429 19:50:07.042590518 1 loki (15088.15146) < read res=240 data=GET /loki/api/v1/labels?start=1675874407046000000&end=1675875007046000000 HTTP/1.0..Host: grafana-loki-query-frontend.observability.svc.cluster.local:3100..Connection: close..User-Agent: Grafana/9.3.6..Accept-Encoding: gzip, deflate, br....

2. Nothing of note happens

3. Client timeout, HTTP 499

14829 19:50:22.041519281 2 loki (15088.15088) > read fd=12(<4t>10.233.121.76:55144->10.233.121.75:3100) size=1
14830 19:50:22.041522946 2 loki (15088.15088) < read res=0 data=NULL
14842 19:50:22.041756806 2 loki (15088.15088) > write fd=12(<4t>10.233.121.76:55144->10.233.121.75:3100) size=175
14846 19:50:22.041814671 2 loki (15088.15088) < write res=175 data=HTTP/1.0 499 status code 499..Content-Type: application/json; charset=UTF-8..Vary: Accept-Encoding..Date: Wed, 08 Feb 2023 16:50:22 GMT..Content-Length: 16....context canceled
14893 19:50:22.111986584 3 loki (15088.15146) > write fd=2(<p>pipe:[101603]) size=137
14898 19:50:22.111992978 3 loki (15088.15146) < write res=137 data=level=error ts=2023-02-08T16:50:22.041566775Z caller=retry.go:73 org_id=fake msg="error processing request" try=0 err="context canceled".

After restarting the frontend

10.233.121.76 = Grafana
10.233.121.84 = Loki frontend
10.233.11.69 = Loki query scheduler
10.233.121.78 = Loki querier

1. Receive /labels request from client

3416 20:10:30.716592660 2 loki (32974.32988) > read fd=12(<4t>10.233.121.76:56822->10.233.121.84:3100) size=4096
3417 20:10:30.716594249 2 loki (32974.32988) < read res=240 data=GET /loki/api/v1/labels?start=1675875630720000000&end=1675876230720000000 HTTP/1.0..Host: grafana-loki-query-frontend.observability.svc.cluster.local:3100..Connection: close..User-Agent: Grafana/9.3.6..Accept-Encoding: gzip, deflate, br....

2. Talk to scheduler

3431 20:10:30.716819667 2 loki (32974.32988) > write fd=10(<4t>10.233.121.84:36826->10.233.11.69:9095) size=225
3432 20:10:30.716848286 2 loki (32974.32988) < write res=225 data=..................10.233.121.84:9095.........z".fake*....GET.E/loki/api/v1/labels?end=1675876230720000000&start=1675875630720000000....X-Scope-Orgid..fake.E..uber-trace-id.42af8e03f6cc76cc4:2b5f5db41eb117d8:29e3828608416c25:0
<...>
3439 20:10:30.717173937 2 loki (32974.32987) > read fd=10(<4t>10.233.121.84:36826->10.233.11.69:9095) size=32768
3440 20:10:30.717175801 2 loki (32974.32987) < read res=44 data=............................................

3. Receive results from querier

3493 20:10:30.720294040 1 loki (32974.32991) < accept fd=13(<4t>10.233.121.78:34578->10.233.121.84:9095) tuple=10.233.121.78:34578->10.233.121.84:9095 queuepct=0 queuelen=0 queuemax=128
<...>
3526 20:10:30.720445881 0 loki (32974.32991) > read fd=13(<4t>10.233.121.78:34578->10.233.121.84:9095) size=24
3528 20:10:30.720447567 0 loki (32974.32991) < read res=24 data=PRI * HTTP/2.0....SM....
<...>
3550 20:10:30.720622463 1 loki (32974.32987) > read fd=13(<4t>10.233.121.78:34578->10.233.121.84:9095) size=32768
3555 20:10:30.720631414 1 loki (32974.32987) < read res=167 data=.......................z......./..Content-Type..application/json; charset=UTF-8.V{"status":"success","data":["cluster","container","namespace","node","pod","stream"]}.

4. Send HTTP 200 response to client

3652 20:10:30.721202346 1 loki (32974.32974) > write fd=12(<4t>10.233.121.76:56822->10.233.121.84:3100) size=217
3655 20:10:30.721244046 1 loki (32974.32974) < write res=217 data=HTTP/1.0 200 OK..Content-Type: application/json..Vary: Accept-Encoding..Date: Wed, 08 Feb 2023 17:10:30 GMT..Content-Length: 86....{"status":"success","data":["cluster","container","namespace","node","pod","stream"]}.

No errors are being logged by the Loki frontend during all of this except for the context cancelled error being produced when the /labels request times out. This definitely looks like a deadlock somewhere inside the query frontend.

Here is a goroutine dump from the query frontend process:
loki-frontend-labels-hang.txt

UPDATE: making the querier and scheduler Kubernetes services headless (ClusterIP: None) resolves this issue (frontend works immediately after going up, no restarts necessary).

UPDATE 2: spoke too soon, the same deadlock still happens sometimes.

@joeky888
Copy link

joeky888 commented Mar 14, 2023

Same issue. Deleting s3 bucket and dynamodb then clean install loki promtail with helm chart solved the issue.

Edit:
No wait, it doesn't solve the problem, see below.

@joeky888
Copy link

joeky888 commented Mar 14, 2023

I found the problem, I had a terrible Label in the promtail config:

config:
  snippets:
    pipelineStages:
      - cri: {} # "docker" For k8s <= 1.23, "cri" for k8s >= 1.24
      - multiline:
          firstline: '^\[\d{4}-\d{2}-\d{2} \d{1,2}:\d{2}:\d{2}\]'
          max_wait_time: 3s
      - regex:
          expression: '^(?P<cri>\[\d{4}-\d{2}-\d{2} \d{1,2}:\d{2}:\d{2}\])\s(?P<message>(?s:.*))$'
      - labels:
          cri: # Previous regex <cri> 👈 THIS IS A TERRIBLE IDEA! REMOVE THIS LINE!
      - output:
          source: message # Previous regex <message>

The problem is that cri label above has massive unique values, you can verify it by using this command:

curl --max-time 600 http://loki:3100/loki/api/v1/label/<potential_terrible_label>/values

For this example, it would be:

curl --max-time 600 http://loki:3100/loki/api/v1/label/cri/values

And this is what I got, very bad:

{"status":"success","data":["[2023-03-14 11:50:25]","[2023-03-14 11:50:28]","[2023-03-14 11:50:30]","[2023-03-14 11:50:31]","[2023
-03-14 11:50:32]","[2023-03-14 11:50:33]","[2023-03-14 11:50:34]","[2023-03-14 11:50:36]","[2023-03-14 11:50:40]","[2023-03-14 11:
50:45]","[2023-03-14 11:50:48]","[2023-03-14 11:50:50]","[2023-03-14 11:50:51]","[2023-03-14 11:50:52]","[2023-03-14 11:50:53]","[
2023-03-14 11:50:54]","[2023-03-14 11:50:56]","[2023-03-14 11:51:00]","[2023-03-14 11:51:05]","[2023-03-14 11:51:08]","[2023-03-14
 11:51:10]","[2023-03-14 11:51:11]","[2023-03-14 11:51:12]","[2023-03-14 11:51:13]","[2023-03-14 11:51:14]","[2023-03-14 11:51:16]
","[2023-03-14 11:51:20]","[2023-03-14 11:51:25]","[2023-03-14 11:51:28]","[2023-03-14 11:51:30]","[2023-03-14 11:51:31]","[2023-0
3-14 11:51:32]","[2023-03-14 11:51:33]","[2023-03-14 11:51:34]","[2023-03-14 11:51:36]","[2023-03-14 11:51:40]","[2023-03-14 11:51
:45]","[2023-03-14 11:51:48]","[2023-03-14 11:51:50]","[2023-03-14 11:51:51]","[2023-03-14 11:51:52]","[2023-03-14 11:51:53]","[20
23-03-14 11:51:54]","[2023-03-14 11:51:56]","[2023-03-14 11:52:00]","[2023-03-14 11:52:05]","[2023-03-14 11:52:08]","[2023-03-14 1
1:52:10]","[2023-03-14 11:52:11]","[2023-03-14 11:52:12]","[2023-03-14 11:52:13]","[2023-03-14 11:52:14]","[2023-03-14 11:52:16]",
"[2023-03-14 11:52:19]","[2023-03-14 11:52:25]","[2023-03-14 11:52:27]","[2023-03-14 11:52:30]","[2023-03-14 11:52:31]","[2023-03-
14 11:52:32]","[2023-03-14 11:52:33]","[2023-03-14 11:52:34]","[2023-03-14 11:52:36]","[2023-03-14 11:52:39]","[2023-03-14 11:52:4
5]","[2023-03-14 11:52:47]","[2023-03-14 11:52:50]","[2023-03-14 11:52:51]","[2023-03-14 11:52:52]","[2023-03-14 11:52:53]","[2023
-03-14 11:52:54]","[2023-03-14 11:52:56]","[2023-03-14 11:52:59]","[2023-03-14 11:53:05]","[2023-03-14 11:53:07]","[2023-03-14 11:
53:10]","[2023-03-14 11:53:12]","[2023-03-14 11:53:13]","[2023-03-14 11:53:14]","[2023-03-14 11:53:16]","[2023-03-14 11:53:19]","[
2023-03-14 11:53:25]","[2023-03-14 11:53:27]","[2023-03-14 11:53:30]","[2023-03-14 11:53:31]","[2023-03-14 11:53:32]","[2023-03-14
 11:53:33]","[2023-03-14 11:53:34]","[2023-03-14 11:53:36]","[2023-03-14 11:53:39]","[2023-03-14 11:53:45]","[2023-03-14 11:53:47]
","[2023-03-14 11:53:50]","[2023-03-14 11:53:51]","[2023-03-14 11:53:52]","[2023-03-14 11:53:53]","[2023-03-14 11:53:54]","[2023-0
3-14 11:53:56]","[2023-03-14 11:53:59]","[2023-03-14 11:54:05]","[2023-03-14 11:54:07]","[2023-03-14 11:54:10]","[2023-03-14 11:54
:11]","[2023-03-14 11:54:12]","[2023-03-14 11:54:13]","[2023-03-14 11:54:14]","[2023-03-14 11:54:16]","[2023-03-14 11:54:19]","[20
23-03-14 11:54:25]","[2023-03-14 11:54:26]","[2023-03-14 11:54:30]","[2023-03-14 11:54:31]","[2023-03-14 11:54:32]","[2023-03-14 1
1:54:33]","[2023-03-14 11:54:34]","[2023-03-14 11:54:36]","[2023-03-14 11:54:39]","[2023-03-14 11:54:45]","[2023-03-14 11:54:46]",
"[2023-03-14 11:54:48]","[2023-03-14 11:54:50]","[2023-03-14 11:54:51]","[2023-03-14 11:54:52]","[2023-03-14 11:54:53]","[2023-03-
14 11:54:54]","[2023-03-14 11:54:56]","[2023-03-14 11:54:59]","[2023-03-14 11:55:05]","[2023-03-14 11:55:08]","[2023-03-14 11:55:1
0]","[2023-03-14 11:55:11]","[2023-03-14 11:55:12]","[2023-03-14 11:55:13]","[2023-03-14 11:55:14]","[2023-03-14 11:55:16]","[2023
-03-14 11:55:19]","[2023-03-14 11:55:25]","[2023-03-14 11:55:28]","[2023-03-14 11:55:30]","[2023-03-14 11:55:31]","[2023-03-14 11:
55:32]","[2023-03-14 11:55:33]","[2023-03-14 11:55:34]","[2023-03-14 11:55:36]","[2023-03-14 11:55:39]","[2023-03-14 11:55:45]","[
2023-03-14 11:55:48]","[2023-03-14 11:55:49]","[2023-03-14 11:55:50]","[2023-03-14 11:55:51]","[2023-03-14 11:55:52]","[2023-03-14
 11:55:53]","[2023-03-14 11:55:54]","[2023-03-14 11:55:56]","[2023-03-14 11:55:59]","[2023-03-14 11:56:05]","[2023-03-14 11:56:08]
","[2023-03-14 11:56:09]","[2023-03-14 11:56:10]","[2023-03-14 11:56:11]","[2023-03-14 11:56:12]"
....
....

Removing the label from the promtail and a clean install sovled the issue:

config:
  snippets:
    pipelineStages:
      - cri: {} # "docker" For k8s <= 1.23, "cri" for k8s >= 1.24
      - multiline:
          firstline: '^\[\d{4}-\d{2}-\d{2} \d{1,2}:\d{2}:\d{2}\]'
          max_wait_time: 3s
      - regex:
          expression: '^(?P<cri>\[\d{4}-\d{2}-\d{2} \d{1,2}:\d{2}:\d{2}\])\s(?P<message>(?s:.*))$'
      #- labels:
      #    cri: # Previous regex <cri> 👈 THIS IS A TERRIBLE IDEA! REMOVE THIS LINE!
      - output:
          source: message # Previous regex <message>

And now /loki/api/v1/labels is smooth. 😳

@DylanGuedes
Copy link
Contributor

Fantastic find, thank you.

@DylanGuedes
Copy link
Contributor

@zs-dima @hrvatskibogmars you're probably not using CRI, but the trick shared by @joeky888 can be pretty helpful to debug your environment. See if your labels are looking good, I remember you mentioning that things work well for some days until something bad happens. Feel free to share your Promtail cfg too.

@joeky888
Copy link

joeky888 commented Mar 14, 2023

@DylanGuedes Still, if loki can detect the problem, e.g Consider a label with > 500 values a warning can save devs a lot of time though. Anyways, thanks for delivering the fantastic alternative to the nonsense huge monster ELK stack. 😊

@zs-dima
Copy link
Author

zs-dima commented Mar 15, 2023

@DylanGuedes I do not sure Promtail the case:

promtail.yml

server:
  disable: true

client:
  url: http://loki:3100/loki/api/v1/push

positions:
  filename: /tmp/promtail/positions.yaml

scrape_configs:
  - job_name: varlogs
    static_configs:
    - targets:
      - localhost
      labels:
        job: varlogs
        host: ${HOSTNAME:-default_value} # Add hostname label
        __path__: /var/log/*log

@DylanGuedes
Copy link
Contributor

Indeed, doesn't seem like your Promtail config has anything wrong to cause problems with labels. In case you have some Go experience: I'd suggest attaching a debugger to the running QF, the fact that it works sometimes indicates that it might be failing due to some specific input somewhere.

@joeky888
Copy link

@zs-dima What's the output of these commands? Do they look fine? How long do they take?

curl --max-time 600 http://loki:3100/loki/api/v1/label/job/values
curl --max-time 600 http://loki:3100/loki/api/v1/label/host/values
curl --max-time 600 http://loki:3100/loki/api/v1/label/__path__/values

@zs-dima
Copy link
Author

zs-dima commented Mar 16, 2023

@joeky888

> curl --max-time 600 http://127.0.0.1:3100/loki/api/v1/label/job/values
{"status":"success","data":["varlogs"]}

> curl --max-time 600 http://127.0.0.1:3100/loki/api/v1/label/host/values
{"status":"success","data":["domain.com","hostname"]}

> curl --max-time 600 http://127.0.0.1:3100/loki/api/v1/label/__path__/values
{"status":"success"}

@DylanGuedes
Copy link
Contributor

hmm wait, your api/v1/labels hangs but not api/v1/label? 🤔

@joeky888
Copy link

joeky888 commented Mar 16, 2023

How about setting a smaller timeout like curl --max-time 10?

@zs-dima
Copy link
Author

zs-dima commented Mar 16, 2023

@DylanGuedes currently labels are available, but they will hung after one or two Loki restarts

@mizutani256
Copy link

mizutani256 commented Mar 23, 2023

One pattern I noticed here, when using Loki (2.7.x) in a docker swarm, is that doing docker service rm or docker stack rm to stop the service causes Loki to hang for the label and value queries in subsequent deploys. Erasing all Loki data (index, chunks and wal) seems to solve this (evidently not a sustainable solution). Sounds like maybe shutting down the service this way corrupts something in the log data perhaps? Using docker compose instead of stack never had this problem.

Edit: it also happens if I simply re-deploy the stack. Shutting down the service in any way while using docker stack seems to cause the problem.

@Etran-H
Copy link
Contributor

Etran-H commented Jun 5, 2023

I am using Loki on Windows, but after a restart, I am unable to execute any attempted queries.

level=warn ts=2023-06-05T07:54:58.1695052Z caller=pool.go:193 msg="removing frontend failing healthcheck" addr=192.168.222.201:9096 reason="rpc error: code = DeadlineExceeded desc = context deadline exceeded"
level=error ts=2023-06-05T07:54:58.1695052Z caller=scheduler_processor.go:208 org_id=fake frontend=192.168.222.201:9096 msg="error notifying frontend about finished query" err="rpc error: code = Canceled desc = grpc: the client connection is closing"
level=error ts=2023-06-05T07:54:58.1695052Z caller=scheduler_processor.go:252 org_id=fake frontend=192.168.222.201:9096 msg="error health checking" err="rpc error: code = Canceled desc = grpc: the client connection is closing"
level=warn ts=2023-06-05T07:55:03.1722212Z caller=pool.go:193 msg="removing frontend failing healthcheck" addr=192.168.222.201:9096 reason="rpc error: code = DeadlineExceeded desc = context deadline exceeded"
level=error ts=2023-06-05T07:55:03.1722212Z caller=scheduler_processor.go:208 org_id=fake frontend=192.168.222.201:9096 msg="error notifying frontend about finished query" err="rpc error: code = Canceled desc = grpc: the client connection is closing"
level=error ts=2023-06-05T07:55:03.1722212Z caller=scheduler_processor.go:252 org_id=fake frontend=192.168.222.201:9096 msg="error health checking" err="rpc error: code = Canceled desc = grpc: the client connection is closing"
level=warn ts=2023-06-05T07:55:08.1698863Z caller=pool.go:193 msg="removing frontend failing healthcheck" addr=192.168.222.201:9096 reason="rpc error: code = DeadlineExceeded desc = context deadline exceeded"
level=error ts=2023-06-05T07:55:08.1698863Z caller=scheduler_processor.go:208 org_id=fake frontend=192.168.222.201:9096 msg="error notifying frontend about finished query" err="rpc error: code = Canceled desc = grpc: the client connection is closing"
level=error ts=2023-06-05T07:55:08.1698863Z caller=scheduler_processor.go:252 org_id=fake frontend=192.168.222.201:9096 msg="error health checking" err="rpc error: code = Canceled desc = grpc: the client connection is closing"
level=warn ts=2023-06-05T07:55:13.170649Z caller=pool.go:193 msg="removing frontend failing healthcheck" addr=192.168.222.201:9096 reason="rpc error: code = DeadlineExceeded desc = context deadline exceeded"
level=error ts=2023-06-05T07:55:13.170649Z caller=scheduler_processor.go:208 org_id=fake frontend=192.168.222.201:9096 msg="error notifying frontend about finished query" err="rpc error: code = Canceled desc = grpc: the client connection is closing"
level=error ts=2023-06-05T07:55:13.170649Z caller=scheduler_processor.go:252 org_id=fake frontend=192.168.222.201:9096 msg="error health checking" err="rpc error: code = Canceled desc = grpc: the client connection is closing"
level=info ts=2023-06-05T07:55:16.8764023Z caller=table_manager.go:134 msg="uploading tables"
level=info ts=2023-06-05T07:55:16.8783998Z caller=table_manager.go:166 msg="handing over indexes to shipper"
level=warn ts=2023-06-05T07:55:18.1713666Z caller=pool.go:193 msg="removing frontend failing healthcheck" addr=192.168.222.201:9096 reason="rpc error: code = DeadlineExceeded desc = context deadline exceeded"
level=error ts=2023-06-05T07:55:18.1713666Z caller=scheduler_processor.go:208 org_id=fake frontend=192.168.222.201:9096 msg="error notifying frontend about finished query" err="rpc error: code = Canceled desc = grpc: the client connection is closing"
level=error ts=2023-06-05T07:55:18.1713666Z caller=scheduler_processor.go:252 org_id=fake frontend=192.168.222.201:9096 msg="error health checking" err="rpc error: code = Canceled desc = grpc: the client connection is closing"
level=info ts=2023-06-05T07:55:22.3184449Z caller=marker.go:202 msg="no marks file found"

@chaudum chaudum added type/bug Somehing is not working as expected and removed kind/bug labels Jun 14, 2023
@sourcehawk
Copy link

Having this issue right now after upgrading from V5 to V6 helm chart. Instant response for huge queries but no response from curling labels

@rorynickolls-skyral
Copy link

rorynickolls-skyral commented Nov 22, 2024

I'm seeing this same issue intermittently on multiple GKE environments.

Right now, running the curl commands above I get:

❯ curl --max-time 10 http://localhost:8080/loki/api/v1/label/job/values
{"status":"success"}

❯ curl --max-time 10 http://localhost:8080/loki/api/v1/label/host/values
{"status":"success"}

❯ curl --max-time 10 http://localhost:8080/loki/api/v1/label/__path__/values
curl: (28) Operation timed out after 10006 milliseconds with 0 bytes received

This is having port-forwarded the loki-gateway k8s service to my local machine, and implies a problem with that third path.

But the error I'm seeing in Grafana's logs when the labels fail to load is:

logger=context userId=0 orgId=1 uname= t=2024-11-22T09:35:15.097856651Z level=error msg="Request Completed" method=GET path=/api/datasources/uid/loki/resources/labels status=500 remote_addr=10.64.0.32 time_ms=30004 duration=30.004237513s size=116 referer= handler=/api/datasources/uid/:uid/resources/* status_source=downstream
logger=tsdb.loki endpoint=callResource pluginId=loki dsName=loki dsUID=loki uname= t=2024-11-22T09:35:34.346118797Z level=error msg="Error received from Loki" error="Get \"http://loki-gateway/loki/api/v1/index/stats?query=%7Bapp%3D%22%22%7D&start=1732264504201000000&end=1732268104202000000\": net/http: request canceled (Client.Timeout exceeded while awaiting headers)" resourcePath="/loki/api/v1/index/stats?query=%7Bapp%3D%22%22%7D&start=1732264504201000000&end=1732268104202000000" status=error duration=30.000364402s stage=databaseRequest
logger=tsdb.loki endpoint=callResource pluginId=loki dsName=loki dsUID=loki uname= t=2024-11-22T09:35:34.34616088Z level=error msg="Failed resource call from loki" err="Get \"http://loki-gateway/loki/api/v1/index/stats?query=%7Bapp%3D%22%22%7D&start=1732264504201000000&end=1732268104202000000\": net/http: request canceled (Client.Timeout exceeded while awaiting headers)" url="/loki/api/v1/index/stats?query=%7Bapp%3D%22%22%7D&start=1732264504201000000&end=1732268104202000000"
logger=context userId=0 orgId=1 uname= t=2024-11-22T09:35:34.346340417Z level=error msg="Internal server error" error="[plugin.downstreamError] client: failed to call resources: Get \"http://loki-gateway/loki/api/v1/index/stats?query=%7Bapp%3D%22%22%7D&start=1732264504201000000&end=1732268104202000000\": net/http: request canceled (Client.Timeout exceeded while awaiting headers)" remote_addr=10.64.0.32 traceID=

Edit: I thought it may be a problem to do with specific large labels as suggested further up the thread but that doesn't seem to be the case. Running the same curl twice for a very small label demonstrates the issue:

❯ curl --max-time 600 http://localhost:8080/loki/api/v1/label/exporter/values
{"status":"success","data":["OTLP"]}

❯ curl --max-time 600 http://localhost:8080/loki/api/v1/label/exporter/values
curl: (28) Operation timed out after 600005 milliseconds with 0 bytes received

@yongzhang
Copy link

Facing same issue on loki 3.2.1 (deployed in targets read, writer and backend).

Label value queries sometimes hang:

curl 'http://loki-reader/loki/api/v1/label/instance/values?start=1733769525983000000&end=1733791125983000000'

But restarts could fix the issue, not always reproducible. I only have 3 labels across all streams.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component/querier need-investigation type/bug Somehing is not working as expected
Projects
None yet
Development

No branches or pull requests