You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We suspect that Loki’s compaction process might have failed due to resource starvation in our setup. After observing this behavior, we increased CPU and memory allocations, ensuring that current resource utilization remains well within limits.
However, we now have concerns about whether retention policies are being successfully applied to files in object storage.
We are monitoring the metric loki_compactor_apply_retention_operation_total, which indicates that retention operations are failing. Additionally, the metric loki_compactor_apply_retention_last_successful_run_timestamp_seconds corroborates this observation, showing no recent successful runs.
The compaction failure log that is indicative of the problem is
sg="failed to compact files" table=index_20031 err="failed to rewrite chunk '<redacted/>d3e4c83236a4dc26/192f6fb4e1e:192f6fcc62c:e833ad72 with error failed to load chunk '<redacted>/d3e4c83236a4dc26/192f6fb4e1e:192f6fcc62c:e833ad72': failed to get s3 object: NoSuchKey: The specified key does not exist.\n\tstatus code: 404,
Upon inspecting the object storage, we noticed that the offending keys do not exist. This raises a puzzling question: why is Loki attempting to process a non-existent chunk file?
Initially, to address the issue, we removed the offending table index for the specified tenant from S3, hoping this would resolve the problem. However, the issue persisted, failing with another file instead.
Apart from this we are also seeing occasionally the following errors thrown by the compactor. Also trying to understand which exact operation is timing out here.
2024-12-04 11:03:53.779 | message=level=warn ts=2024-12-04T11:03:53.765260464Z caller=marker.go:213 msg="failed to process marks" path=/var/loki/compactor/retention/s3_2024-09-09/markers/1733305507717040496 err=timeout
-- | --
| | 2024-12-04 11:04:20.179 | message=level=warn ts=2024-12-04T11:04:20.122143977Z caller=marker.go:213 msg="failed to process marks" path=/var/loki/compactor/retention/s3_2024-09-09/markers/1733143673475594307 err=timeout
| | 2024-12-04 11:04:25.179 | message=level=warn ts=2024-12-04T11:04:25.080249868Z caller=marker.go:213 msg="failed to process marks" path=/var/loki/compactor/retention/s3_2024-09-09/markers/1733144302072920250 err=timeout
| | 2024-12-04 11:04:30.079 | message=level=warn ts=2024-12-04T11:04:30.03179495Z caller=marker.go:213 msg="failed to process marks" path=/var/loki/compactor/retention/s3_2024-09-09/markers/1733146877540183809 err=timeout
| | 2024-12-04 11:04:35.079 | message=level=warn ts=2024-12-04T11:04:35.029838891Z caller=marker.go:213 msg="failed to process marks" path=/var/loki/compactor/retention/s3_2024-09-09/markers/1733147546669388935 err=timeout
| | 2024-12-04 11:04:40.087 | message=level=warn ts=2024-12-04T11:04:40.02930571Z caller=marker.go:213 msg="failed to process marks" path=/var/loki/compactor/retention/s3_2024-09-09/markers/1733149264257553103 err=timeout
The impact of compaction not working is that we are seeing an increased error rate and latency in the read requests and rule evaluations. Also our S3 object cost and xAZ n/w transfer costs are going up due to the issue.
Could you please help investigate this issue?
The text was updated successfully, but these errors were encountered:
Issue Summary:
Potential Loki Compaction and Retention Issue After Resource Starvation
Our Setup
Compactor specific settings
We suspect that Loki’s compaction process might have failed due to resource starvation in our setup. After observing this behavior, we increased CPU and memory allocations, ensuring that current resource utilization remains well within limits.
However, we now have concerns about whether retention policies are being successfully applied to files in object storage.
We are monitoring the metric
loki_compactor_apply_retention_operation_total
, which indicates that retention operations are failing. Additionally, the metricloki_compactor_apply_retention_last_successful_run_timestamp_seconds
corroborates this observation, showing no recent successful runs.The compaction failure log that is indicative of the problem is
Upon inspecting the object storage, we noticed that the offending keys do not exist. This raises a puzzling question: why is Loki attempting to process a non-existent chunk file?
Initially, to address the issue, we removed the offending table index for the specified tenant from S3, hoping this would resolve the problem. However, the issue persisted, failing with another file instead.
Apart from this we are also seeing occasionally the following errors thrown by the compactor. Also trying to understand which exact operation is timing out here.
The impact of compaction not working is that we are seeing an increased error rate and latency in the read requests and rule evaluations. Also our S3 object cost and xAZ n/w transfer costs are going up due to the issue.
Could you please help investigate this issue?
The text was updated successfully, but these errors were encountered: