-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] [Loki] Loki hangs on labels request #7496
Comments
Hi, thanks for reporting it. It is really hard to tell if this is a problem with your setup or a bug. What I can say is that a lot of users have been running Loki without experiencing the same thing, so it might just be a misconfiguration.
|
@DylanGuedes, thanks a lot, with log.level=debug Loki started without hang - I will collect more information with the next restart when Loki will hang again.
most of users do not use ARM servers I guess, keeping in mind this issues are still open: |
@DylanGuedes reproduced with the next restart as expected.
range requests runs without hangs same as
|
Could you get a goroutine dump when a labels request hangs and provide its output?
|
Some errors logs form the latest hang:
|
Hey @zs-dima The latest logs you posted suggest that the issue is not only related to the labels query, but also effects push requests. There may be a deadlock, since all of the push requests time out after 5s. |
@chaudum thanks a lot for looking into. Strange that sometimes after restart Loki working well and sometimes hanging after restart. |
@chaudum you could use free Oracle A1 or I could provide access to the VM I testing it. |
Hello! Not sure if it's the same problem since I'm not using an ARM, but the symptons are exactly the same: ready query works, labels hang. In my case it happens all the time. Here's more information. I'm running Loki 2.6.1 using a docker swarm stack. The stack is setup using Ansible over a GitLab CI pipeline. Here's the docker setup: loki:
image: grafana/loki:2.6.1
command: "-config.file=/loki/loki.yml"
ports:
- "3100:3100"
volumes:
- "{{ loki_dir }}:/loki" And the loki configuration: auth_enabled: false
server:
http_listen_port: 3100
ingester:
lifecycler:
address: 127.0.0.1
ring:
kvstore:
store: inmemory
replication_factor: 1
final_sleep: 0s
chunk_idle_period: 1m
chunk_retain_period: 30s
wal:
enabled: true
dir: "/loki/wal"
schema_config:
configs:
- from: 2023-01-01
store: boltdb
object_store: filesystem
schema: v11
index:
prefix: index_
period: 8760h # 1 year
storage_config:
boltdb:
directory: "/loki/index"
filesystem:
directory: "/loki/chunks"
limits_config:
enforce_metric_name: false
reject_old_samples: true
reject_old_samples_max_age: 168h Here's Loki's output when I try to query the labels then CTRL+C after a while:
The environment is a Ubuntu 22.4 on an Amd64 architectecture (as output by I'll try the dump thing and uptade here, hope it helps to further narrow down the cause. UPDATE 1: Hangs for
UPDATE 2: Attaching dump (run during a query to labels that hung) |
Facing the same issue (Loki 2.7.0) Everything works for a few days then suddenly stops (after redeploying it works again). This call just gets stuck and never finishes. I enabled debug logs but nothing relevant I can see.
If needed I can share more context about this issue. |
Reproduced this issue with x64 Debian as well. So it is not only ARM issue. |
Getting this with Loki 2.7.3 deployed in Kubernetes in microservices mode via this Helm chart from Bitnami on an Almalinux node (amd64):
I've recorded a couple of sysdig traces for the frontend microservice and reproduced this issue. It turns out that it doesn't even try to contact the query scheduler when the Before restarting the frontend
1. Receive
|
Same issue. Deleting s3 bucket and dynamodb then clean install loki promtail with helm chart solved the issue. Edit: |
I found the problem, I had a terrible Label in the promtail config: config:
snippets:
pipelineStages:
- cri: {} # "docker" For k8s <= 1.23, "cri" for k8s >= 1.24
- multiline:
firstline: '^\[\d{4}-\d{2}-\d{2} \d{1,2}:\d{2}:\d{2}\]'
max_wait_time: 3s
- regex:
expression: '^(?P<cri>\[\d{4}-\d{2}-\d{2} \d{1,2}:\d{2}:\d{2}\])\s(?P<message>(?s:.*))$'
- labels:
cri: # Previous regex <cri> 👈 THIS IS A TERRIBLE IDEA! REMOVE THIS LINE!
- output:
source: message # Previous regex <message> The problem is that curl --max-time 600 http://loki:3100/loki/api/v1/label/<potential_terrible_label>/values For this example, it would be: curl --max-time 600 http://loki:3100/loki/api/v1/label/cri/values And this is what I got, very bad:
Removing the label from the promtail and a clean install sovled the issue: config:
snippets:
pipelineStages:
- cri: {} # "docker" For k8s <= 1.23, "cri" for k8s >= 1.24
- multiline:
firstline: '^\[\d{4}-\d{2}-\d{2} \d{1,2}:\d{2}:\d{2}\]'
max_wait_time: 3s
- regex:
expression: '^(?P<cri>\[\d{4}-\d{2}-\d{2} \d{1,2}:\d{2}:\d{2}\])\s(?P<message>(?s:.*))$'
#- labels:
# cri: # Previous regex <cri> 👈 THIS IS A TERRIBLE IDEA! REMOVE THIS LINE!
- output:
source: message # Previous regex <message> And now |
Fantastic find, thank you. |
@zs-dima @hrvatskibogmars you're probably not using CRI, but the trick shared by @joeky888 can be pretty helpful to debug your environment. See if your labels are looking good, I remember you mentioning that things work well for some days until something bad happens. Feel free to share your Promtail cfg too. |
@DylanGuedes Still, if loki can detect the problem, e.g Consider a label with > 500 values a warning can save devs a lot of time though. Anyways, thanks for delivering the fantastic alternative to the nonsense huge monster ELK stack. 😊 |
@DylanGuedes I do not sure Promtail the case: promtail.yml
|
Indeed, doesn't seem like your Promtail config has anything wrong to cause problems with labels. In case you have some Go experience: I'd suggest attaching a debugger to the running QF, the fact that it works sometimes indicates that it might be failing due to some specific input somewhere. |
@zs-dima What's the output of these commands? Do they look fine? How long do they take?
|
> curl --max-time 600 http://127.0.0.1:3100/loki/api/v1/label/job/values
{"status":"success","data":["varlogs"]}
> curl --max-time 600 http://127.0.0.1:3100/loki/api/v1/label/host/values
{"status":"success","data":["domain.com","hostname"]}
> curl --max-time 600 http://127.0.0.1:3100/loki/api/v1/label/__path__/values
{"status":"success"} |
hmm wait, your |
How about setting a smaller timeout like |
@DylanGuedes currently labels are available, but they will hung after one or two Loki restarts |
One pattern I noticed here, when using Loki (2.7.x) in a docker swarm, is that doing Edit: it also happens if I simply re-deploy the stack. Shutting down the service in any way while using docker stack seems to cause the problem. |
I am using Loki on Windows, but after a restart, I am unable to execute any attempted queries.
|
Having this issue right now after upgrading from V5 to V6 helm chart. Instant response for huge queries but no response from curling labels |
I'm seeing this same issue intermittently on multiple GKE environments. Right now, running the ❯ curl --max-time 10 http://localhost:8080/loki/api/v1/label/job/values
{"status":"success"}
❯ curl --max-time 10 http://localhost:8080/loki/api/v1/label/host/values
{"status":"success"}
❯ curl --max-time 10 http://localhost:8080/loki/api/v1/label/__path__/values
curl: (28) Operation timed out after 10006 milliseconds with 0 bytes received This is having port-forwarded the loki-gateway k8s service to my local machine, and implies a problem with that third path. But the error I'm seeing in Grafana's logs when the labels fail to load is:
Edit: I thought it may be a problem to do with specific large labels as suggested further up the thread but that doesn't seem to be the case. Running the same ❯ curl --max-time 600 http://localhost:8080/loki/api/v1/label/exporter/values
{"status":"success","data":["OTLP"]}
❯ curl --max-time 600 http://localhost:8080/loki/api/v1/label/exporter/values
curl: (28) Operation timed out after 600005 milliseconds with 0 bytes received |
Facing same issue on loki 3.2.1 (deployed in targets Label value queries sometimes hang:
But restarts could fix the issue, not always reproducible. I only have 3 labels across all streams. |
Often Loki hangs on labels request. However sometimes after restart it working correctly.
I tried latest, 2.5.0 и main grafana/loki versions - issue still there.
Server Oracle Cloud A1 ARM with Ubuntu 22.04.1 LTS
Docker version 20.10.20, build 9fdeb9c
$ curl http://127.0.0.1:3100/ready ready $ curl http://127.0.0.1:3100/loki/api/v1/labels <- infinite hangs
loki.yml
https://gist.github.com/zs-dima/e23237f7c7e1ff6011588fb77670412f
System is not overloaded as well
Loki logs have no errors or warnings.
however rarely "error contacting scheduler" appearing:
caller=scheduler_processor.go:87 msg="error contacting scheduler" err="rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing dial tcp 10.10.0.4:9095: i/o timeout\"" addr=10.10.0.4:9095
The text was updated successfully, but these errors were encountered: