-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Loki 3.0 Feedback and Issues #12506
Comments
pls update grafana.com/docs/loki before releasing a major update still shows the 2.9 documentation. :) |
I tried upgrading the Helm chart ( 5.47.2 → 6.0.0 ) but encountered these errors:
Pretty sure I adjusted all the breaking changes described in the release notes but maybe some of the custom config I have is not compatible? My Helm values are located here, any help? |
You are setting shared store in compactor. delete_request_store is now required |
So I should just be able to rename |
Adding |
Since the upgrade everything looks good in our environments although the backend pods seem to be outputting a lot of: I suspect that's because blooms aren't enabled although when I do enable blooms we get a nil pointer:
|
When upgrading, the pod from the new stateful set 'loki-chunks-cache' couldn't be scheduled, because none of our nodes offer the requested 9830 MiB of memory. |
very sorry about this, we are working on a new release processes and also had problems with our documentation updates, I think there are still a few things we are working out but hopefully most of it is correct now. |
You could disable this external memcached entirely by setting enabled: false or you can make it smaller by reducing
|
Awesome with the new bloom filter, for unique IDs etc! 🎉 I'm looking forward to close issue #91 (from 2018) when the experimental bloom filters are stable. 😄 Regarding docs, some feedback:
Source: |
Trying to update helm chart 5.43.2 to 6.1.0 but i am getting
|
For the loki helm chart: #12067 changed the port name for the gateway service from The gateway responds with a 404 on the |
For the loki chart we unfortunately had to face some downtime. This changed 79b876b#diff-89f4fd98934eb0f277b921d45e4c223e168490c44604e454a2192d28dab1c3e2R4 forced the recreation of all the gateway resources: This is problematic for 2 reasons:
|
Two issues so far with my existing Helm values:
Then the Trying to render the helm chart locally with "helm --debug template" results in
I try to understand the nested template structure in the helm chart to understand what is happening. A short helm chart values set (which worked fine with 5.x) triggering the phenomenon: values.yamlserviceAccount:
create: false
name: loki
test:
enabled: false
monitoring:
dashboards:
enable: false
lokiCanary:
enabled: false
selfMonitoring:
enabled: false
grafanaAgent:
installOperator: false
loki:
auth_enabled: false
limits_config:
max_streams_per_user: 10000
max_global_streams_per_user: 10000
storage_config:
aws:
s3: s3://eu-central-1
bucketnames: my-bucket-name
schemaConfig:
configs:
- from: 2024-01-19
store: tsdb
object_store: aws
schema: v11
index:
prefix: "some-prefix_"
period: 24h
query_range:
split_queries_by_interval: 0
query_scheduler:
max_outstanding_requests_per_tenant: 8192
analytics:
reporting_enabled: false
compactor:
shared_store: s3
gateway:
replicas: 3
read:
replicas: 3
write:
replicas: 3
compactor:
enable: true |
I thought I recognized that github picture!!!
2018!!! Thanks for the great feedback on the docs, very helpful. One note regarding SSD mode, honestly the original idea of SSD was to make Loki a lot more friendly outside of k8s environments, the problem we found ourselves in though is that we have had no good ability to support customers attempting to run Loki this way and as such we largely require folks to use kubernetes for our commercial offering. This is why the docs are so k8s specific. It continues to be a struggle to build an open source project which is extremely flexible for folks to run in many ways, but also a product that we have to provide support for. I'd love to know though how many folks are successfully running SSD mode outside of kubernetes. I'm still a bit bullish on the idea but over time I kind of feel like it hasn't played out as well as we hoped. |
oh interesting, we'll take a look at this, not sure what happened here, thanks! |
@MartinEmrich thank you, I will update the upgrade guide around schemaConfig, sorry about that. And thank you for the sample test values file! very helpful! |
Congratulations on the release! 🎉 :) Is there any way to verify that bloom filters are active and working? I cannot seem to find any metrics or log entries that might give a hint. There are also no bloom services listed on the curl -s -k https://localhost:3100/services
ruler => Running
compactor => Running
store => Running
ingester-querier => Running
query-scheduler => Running
ingester => Running
query-frontend => Running
distributor => Running
server => Running
ring => Running
query-frontend-tripperware => Running
analytics => Running
query-scheduler-ring => Running
querier => Running
cache-generation-loader => Running
memberlist-kv => Running I tried deploying it on a single instance in monolithic mode via Docker by adding the following options: limits_config:
bloom_gateway_enable_filtering: true
bloom_compactor_enable_compaction: true
bloom_compactor:
enabled: true
ring:
instance_addr: 127.0.0.1
kvstore:
store: inmemory
bloom_gateway:
enabled: true
client:
addresses: dns+localhost.localdomain:9095 Edit: My bad, it seems that the bloom components are not available when using |
not sure if this is intended but in the _helpers.tpl there is an if check which might be wrong: {{- if "loki.deployment.isDistributed "}} similar check is done here which looks like this: {{- $isDistributed := eq (include "loki.deployment.isDistributed" .) "true" -}}
{{- if $isDistributed -}} This causes the if check to always be true and thus the frontend.tail_proxy_url to be set in the loki config. But the configured tail_proxy_url does not point to an existing service (I used SSD deplyoment mode). Not sure if this has any impact. |
We encountered a bug in the rendering of the Loki config with the helm chart v6.0.0 that may be similar to what @MartinEmrich encountered above. These simple values will cause the rendering to fail: loki:
query_range:
parallelise_shardable_queries: false
useTestSchema: true This causes query_range:
align_queries_with_step: true
parallelise_shardable_queries: false
cache_results: true I believe anything under EDIT: I've added a PR to solve the above but in general we've had trouble upgrading to Helm chart v6 as there are now two fields which are seemingly necessary where before they were not, and they're not listed in the upgrade guide:
In general I would personally prefer that I can always install a Helm chart with no values and get some kind of sensible default, even if only for testing out the chart. Later, when I want to go production-ready, I can tweak those parameters to something more appropriate. |
On the upgrade attempt using Simple Scalable mode frontend:
scheduler_address: ""
tail_proxy_url: http://loki-querier.grafana.svc.gke-main-a.us-east1:3100
frontend_worker:
scheduler_address: "" It looks like |
Very helpful feedback, thank you! The The forced requirement for a schemaConfig is an interesting problem, if we default it in the chart then people end up using it which means we can't change it without breaking their clusters because schemas can't be changed, only new ones added. I do supposed we could just add new ones but that feels a bit like forcing an upgrade on someone... I'm not sure, this is a hard problem that I don't have great answers to. We decided that this time around we'd force people to define a schema, and provide the test schema config value that should be spit out in an error message if you want to just try the chart with data you plan on throwing away. It does seem like we need to update this error or that flag to also provide values for the storage defaults however. |
Hey folks sorry for being slow to respond to some of these issues. Appreciate your feedback and help finding and fixing problems! I've tried to make sure there are at least issues open for things folks are struggling with:
If I've missed anything please let me know! |
A couple folks have commented on this, there are a few reasons we are removing the monitoring section from the Loki chart:
I apologize as I know for some folks this is disruptive and not making your lives any better, but it's already extremely time consuming to maintain this chart so simplifying it is a huge advantage for us. The new chart should come with options for just installing Grafana and Dashboards as well as various methods for monitoring although it's not where we'd like it to be yet (unfortunately there isn't a single binary or SSD version of mimir or tempo so their installs are quite large) I would also recommend folks try out using the monitoring chart with the free tier of grafana cloud as the backend, we can provision the dashboards you need via integrations and this gives you an external mechanism for monitoring your clusters at no charge and hopefully makes everyones lives easier. |
Hi @slim-bean, first of all thank you for feedback! I'm using right now monitoring part without any grafana operator, with loki canary that scraped by promtail and send to loki after that. I don't see reason in general dropping monitoring section as only thing it should do is to deploy loki canary, service monitors and grafana dashboards. I don't think such stack will in any way confuse people or create issues in parent helm chart you mentioned. If this not the case, then I would have to just use my own helm chart with all this resources created by myself and loki chart as dependency with is not best option through. Also as I understand promtail will also get obsolete which is not best best option from what I think. Getting quick look at alloy gives me feeling it's config structure much more complicated compared to promtail, it's luck of web interface to inspect targets and due to that label stuff should be guessed instead of checked. Also having daemonset that would responsible for multiple things which unused and having bunch of metrics that would also be not needed seems like overhead. |
Hi, When shall we expect the 3.X.X release? I am interested in couple of bugfixes and do not want to use not tagged image. |
We are getting multiple errors like these caller=retry.go:95 org_id=fake msg="error processing request" try=0 query="{app="loki"} | logfmt | level="warn" or level="error"" query_hash=901594686 start=2024-05-14T13:30:00Z end=2024-05-14T13:45:00Z start_delta=17h25m33.153641627s end_delta=17h10m33.153641727s length=15m0s retry_in=329.878123ms err="context canceled" can you please help ? |
Hey @slim-bean, can you please also have a look on my issue with the different s3 buckets and differents access & secret keys. Not completly sure but i think @JBodkin-Amphora has my issue aswell. Thank you :) |
level=error ts=2024-05-16T09:04:08.131652605Z caller=flush.go:152 component=ingester org_id=fake msg="failed to flush" err="failed to flush chunks: store put chunk: -> github.com/Azure/azure-storage-blob-go/azblob.newStorageError, /src/loki/vendor/github.com/Azure/azure-storage-blob-go/azblob/zc_storage_error.go:42\n===== RESPONSE ERROR (ServiceCode=InvalidBlockList) =====\nDescription=The specified block list is invalid.\nRequestId:13f410b4-901e-007f-4770-a7b251000000\nTime:2024-05-16T09:04:08.0437568Z, Details: \n Code: InvalidBlockList\n PUT https://testinglokiprd.blob.core.windows.net/chunks/fake/a663ab7e36edbebb/18f807ba885-18f80897cbf-1d839c2?comp=blocklist&timeout=31\n Authorization: REDACTED\n Content-Length: [128]\n Content-Type: [application/xml]\n User-Agent: [Azure-Storage/0.14 (go1.21.9; linux)]\n X-Ms-Blob-Cache-Control: []\n X-Ms-Blob-Content-Disposition: []\n X-Ms-Blob-Content-Encoding: []\n X-Ms-Blob-Content-Language: []\n X-Ms-Blob-Content-Type: []\n X-Ms-Client-Request-Id: [f5420ecf-70fc-4784-75ea-1220f12b3dd0]\n X-Ms-Date: [Thu, 16 May 2024 09:04:08 GMT]\n X-Ms-Version: [2020-04-08]\n --------------------------------------------------------------------------------\n RESPONSE Status: 400 The specified block list is invalid.\n Content-Length: [221]\n Content-Type: [application/xml]\n Date: [Thu, 16 May 2024 09:04:08 GMT]\n Server: [Windows-Azure-Blob/1.0 Microsoft-HTTPAPI/2.0]\n X-Ms-Client-Request-Id: [f5420ecf-70fc-4784-75ea-1220f12b3dd0]\n X-Ms-Error-Code: [InvalidBlockList]\n X-Ms-Request-Id: [13f410b4-901e-007f-4770-a7b251000000]\n X-Ms-Version: [2020-04-08]\n\n\n, num_chunks: 1, labels: {app="parquet-2grvk", container="main", filename="/var/log/pods/argo-workflows_parquet-2grvk-parquet-29307887_8b782254-47c5-4449-b4c8-0de438c02206/main/0.log", job="argo-workflows/parquet-2grvk", namespace="argo-workflows", node_name="aks-defaultgreen-11165910-vmss0000oy", pod="parquet-2grvk-parquet-29307887", stream="stderr"}"what this error means? started getting after upgradation to loki 3.0.0 |
Hi @kunalmehta-eve - I'm probably not the right person to ask about this as I'm a consumer of Loki, not one of the maintainers. All I can recommend is checking the block list that it's flagging as invalid and comparing it to the requirements as defined in the 3.0 docs. |
Is bloom gateway supposed to work in simple scalable mode? Because documentation on how to enable it is non-existent https://grafana.com/docs/loki/latest/get-started/deployment-modes/ and in the helm chart. Also, the current bloom gateway and compactor charts are made to work only with the distributed mode of Loki
|
|
Is this issue fixed. I am trying to migrated loki to helm chart version 6.X.X and i am getting below error
|
We are seeing very high memory usage / memory leaks when ingesting logs with structured metadata. See https://community.grafana.com/t/memory-leaks-in-ingester-with-structured-metadata/123177 and #10994 Reported under #13123 and now fixed. Thanks :) |
Thanks for the info, just trying to make sure I'm following. It seems like a lot of your response is around the Grafana Agent Operator, and most of that configuration seems to be through the Looking at the So is the intent that it's the entire |
@zach-flaglerhealth agrees with you. If this would be the case, I would end up with writing own helm chart to ship own service monitors and dashboards, not the best option, but for me using clouds for monitoring isn't an option, and migrating to Grafana Mimin instead of kube-prometheus-stack and Thanos just because of couple dashboards and monitors is not an option as well. I already using own helm chart that ships loki and promtail with needed configuration where they both are set as dependencies. But will someday have to move away from promtail as well :( |
Hi Team, |
Just doing another upgrade attempt on a less-important environment. I still have issues doing the schema upgrade/schema config.
Again the old 2.x version at least ignored the schema index prefix; I found mostly "loki_index_*" folders in the S3 bucket. But the logs from yesterday and beyond should be retrievable, unless something in the first block does not match reality. I see no errors in backend or reader logs. How could I reconstruct the correct schemaConfigs for yesterday-- from looking at my actual S3 bucket entry? Update: I notices that the new index folders contain *.tsdb.gz files (Would expect that with "store: tsdb"). The older index folders do only contain a "compactor-XXXXXXXXXX.r.gz" file. What could that hint to? |
... After trying lots of combinations, it looks like Schema v12, boltdb-shipper and "loki_index_" prefix did the trick. |
Hello, I have also encountered this error repeatedly. May I ask if your problem has been resolved |
Seems to have worked for me |
Gotta have to say, the upgrade to helm chart v6 was a bad experience. This whole |
I have to agree. After many pains, lost log periods and some critical glances from colleagues, my/our Loki updates are all done and seem to work, it's time for a conclusion.
|
I've been looking at migrating to this helm chart from the loki-distributed helm chart, however it is still impossible. The biggest issue seems to be that the affinity and topologySpreadConstraints sections cannot be templated. For example:
Some of the other issues that I've encountered are:
|
When updating the storageConfig in the v6 helm chart to the following, setting the date of the new tsdb store to one day into the future as stated by the documentation results in errors in loki pods (read, write, backend): - from: "2022-01-11",
index:
period: "24h"
prefix: "loki_index_"
object_store: "s3"
schema: "v12"
store: "boltdb-shipper"
- from: "2024-09-10",
index:
prefix: "index_"
period: "24h"
object_store: "s3"
schema: "v13"
store: "tsdb" Error:
This error does not occur when I set the The error is clear by saying that I should disable allow_structured_metadata, but why isn't this just done automatically according to the storage schema I am using? Why do I have to add the storage configuration and then enable/disable this twice, once before and once after the correct date has been reached for my second storage entry? As a user I couldn't care less whether you store structured metadata or not, and frankly I have no idea what it means. All I know is that it breaks the upgrade process. Also, will the new tsdb store work without setting |
@slim-bean Hello! It's been a while, but could you provide some insight or reason for choosing 8192 as the value for |
hi , i'm facing issue, where new loki 3.2.0 cluster nodes not able to join the existing cluster ring and not able to do migration.
existing cluster version : 2.9.6 can you please help ? |
I've gotten a fix for that issue of unable to get old logs after upgrade from I my case a mistake was made in config during upgrade from 2.9.x to 3.x.x due to removal of There wasn't any FROM: TO: Checking in GCS Bucket:
|
hi , i'm facing issue, where |
@slim-bean Sorry just to confirm, do I have to specifically add the empty value to |
If you encounter any troubles upgrading to Loki 3.0 or have feedback for the upgrade process, please leave a comment on this issue!
Also you can ask questions at: https://slack.grafana.com/ in the channel
#loki-3
Known Issues:
schema_config
was renamed toschemaConfig
and this is not documentedThe text was updated successfully, but these errors were encountered: