You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We run Loki in simple scalable mode on Kubernetes with WAL enabled writing to PVCs. When a PVC becomes full it seems the corresponding loki-write instance bricks itself due to that the ingester no longer can initialize due to insufficient space on the PVC and hence cannot start to replay and flush the WAL to long term storage and start freeing up storage space on the PVC.
Startup logs from loki-write:
2024-12-03 14:48:43.807
level=error ts=2024-12-03T13:48:43.80708689Z caller=log.go:216 msg="error running loki" err="open /var/loki/wal/00039919: no space left on device
error initialising module: ingester
github.com/grafana/dskit/modules.(*Manager).initModule
/src/loki/vendor/github.com/grafana/dskit/modules/modules.go:138
github.com/grafana/dskit/modules.(*Manager).InitModuleServices
/src/loki/vendor/github.com/grafana/dskit/modules/modules.go:108
github.com/grafana/loki/v3/pkg/loki.(*Loki).Run
/src/loki/pkg/loki/loki.go:458
main.main
/src/loki/cmd/loki/main.go:129
runtime.main
/usr/local/go/src/runtime/proc.go:271
runtime.goexit
/usr/local/go/src/runtime/asm_amd64.s:1695"
2024-12-03 14:48:43.807
level=info ts=2024-12-03T13:48:43.800163536Z caller=head_manager.go:308 index-store=tsdb-2024-09-17 component=tsdb-head-manager msg="loaded wals by period" groups=1
2024-12-03 14:48:43.807
level=info ts=2024-12-03T13:48:43.800103805Z caller=manager.go:86 index-store=tsdb-2024-09-17 component=tsdb-manager msg="loaded leftover local indices" err=null successful=true buckets=49 indices=30 failures=0
2024-12-03 14:48:43.807
level=info ts=2024-12-03T13:48:43.796691475Z caller=head_manager.go:308 index-store=tsdb-2024-09-17 component=tsdb-head-manager msg="loaded wals by period" groups=0
2024-12-03 14:48:43.807
level=info ts=2024-12-03T13:48:43.796371461Z caller=table_manager.go:136 index-store=tsdb-2024-09-17 msg="uploading tables"
2024-12-03 14:48:43.807
level=info ts=2024-12-03T13:48:43.79630754Z caller=shipper.go:160 index-store=tsdb-2024-09-17 msg="starting index shipper in WO mode"
We had multiple problems leading to this particular issue, but I think the root problem started with an incorrectly configured ingester.wal.replay_memory_ceiling. It was set to default (4GB according to the docs) while the memory resource limit for loki-write was configured to 2GB. During a deployment log ingestion happen to be surging and when one of the loki-write instances was initializing it got OOM-killed during WAL replay, however ingestion continued on the same instance before this happened letting the WAL grow even further and eventually it ran out of storage space of the PVC. At this point we realized the misconfiguration of replay_memory_ceiling, adjusted it to 1.5GB, and redeployed. This is where the above log record starting to appear, causing the container to crash.
We ended up deleting the PVC to get the instance up and running again.
We run Loki in simple scalable mode on Kubernetes with WAL enabled writing to PVCs. When a PVC becomes full it seems the corresponding loki-write instance bricks itself due to that the ingester no longer can initialize due to insufficient space on the PVC and hence cannot start to replay and flush the WAL to long term storage and start freeing up storage space on the PVC.
Startup logs from loki-write:
We had multiple problems leading to this particular issue, but I think the root problem started with an incorrectly configured
ingester.wal.replay_memory_ceiling
. It was set to default (4GB according to the docs) while the memory resource limit for loki-write was configured to 2GB. During a deployment log ingestion happen to be surging and when one of the loki-write instances was initializing it got OOM-killed during WAL replay, however ingestion continued on the same instance before this happened letting the WAL grow even further and eventually it ran out of storage space of the PVC. At this point we realized the misconfiguration of replay_memory_ceiling, adjusted it to 1.5GB, and redeployed. This is where the above log record starting to appear, causing the container to crash.We ended up deleting the PVC to get the instance up and running again.
Config:
Environment:
The text was updated successfully, but these errors were encountered: