Loki compaction failure 'NoSuchKey: The specified key does not exist' #15250

milon619 · 2024-12-04T12:02:55Z

Issue Summary:
Potential Loki Compaction and Retention Issue After Resource Starvation

Our Setup

Loki distributed multi-tenant running in k8s
Version: 3.2.0

Compactor specific settings

compaction_interval: 10m
max_compaction_parallelism: 2
apply_retention_interval: 10m
retention_delete_worker_count: 50
delete_batch_size: 50
delete_request_cancel_period: 1h
delete_request_store: s3
retention_delete_delay: 1h
retention_enabled: true
working_directory: /var/loki/compactor

We suspect that Loki’s compaction process might have failed due to resource starvation in our setup. After observing this behavior, we increased CPU and memory allocations, ensuring that current resource utilization remains well within limits.

However, we now have concerns about whether retention policies are being successfully applied to files in object storage.

We are monitoring the metric loki_compactor_apply_retention_operation_total, which indicates that retention operations are failing. Additionally, the metric loki_compactor_apply_retention_last_successful_run_timestamp_seconds corroborates this observation, showing no recent successful runs.

The compaction failure log that is indicative of the problem is

sg="failed to compact files" table=index_20031 err="failed to rewrite chunk '<redacted/>d3e4c83236a4dc26/192f6fb4e1e:192f6fcc62c:e833ad72 with error failed to load chunk '<redacted>/d3e4c83236a4dc26/192f6fb4e1e:192f6fcc62c:e833ad72': failed to get s3 object: NoSuchKey: The specified key does not exist.\n\tstatus code: 404,

Upon inspecting the object storage, we noticed that the offending keys do not exist. This raises a puzzling question: why is Loki attempting to process a non-existent chunk file?

Initially, to address the issue, we removed the offending table index for the specified tenant from S3, hoping this would resolve the problem. However, the issue persisted, failing with another file instead.

Apart from this we are also seeing occasionally the following errors thrown by the compactor. Also trying to understand which exact operation is timing out here.


2024-12-04 11:03:53.779 | message=level=warn ts=2024-12-04T11:03:53.765260464Z caller=marker.go:213 msg="failed to process marks" path=/var/loki/compactor/retention/s3_2024-09-09/markers/1733305507717040496 err=timeout
-- | --
  |   | 2024-12-04 11:04:20.179 | message=level=warn ts=2024-12-04T11:04:20.122143977Z caller=marker.go:213 msg="failed to process marks" path=/var/loki/compactor/retention/s3_2024-09-09/markers/1733143673475594307 err=timeout
  |   | 2024-12-04 11:04:25.179 | message=level=warn ts=2024-12-04T11:04:25.080249868Z caller=marker.go:213 msg="failed to process marks" path=/var/loki/compactor/retention/s3_2024-09-09/markers/1733144302072920250 err=timeout
  |   | 2024-12-04 11:04:30.079 | message=level=warn ts=2024-12-04T11:04:30.03179495Z caller=marker.go:213 msg="failed to process marks" path=/var/loki/compactor/retention/s3_2024-09-09/markers/1733146877540183809 err=timeout
  |   | 2024-12-04 11:04:35.079 | message=level=warn ts=2024-12-04T11:04:35.029838891Z caller=marker.go:213 msg="failed to process marks" path=/var/loki/compactor/retention/s3_2024-09-09/markers/1733147546669388935 err=timeout
  |   | 2024-12-04 11:04:40.087 | message=level=warn ts=2024-12-04T11:04:40.02930571Z caller=marker.go:213 msg="failed to process marks" path=/var/loki/compactor/retention/s3_2024-09-09/markers/1733149264257553103 err=timeout

The impact of compaction not working is that we are seeing an increased error rate and latency in the read requests and rule evaluations. Also our S3 object cost and xAZ n/w transfer costs are going up due to the issue.

Could you please help investigate this issue?

The text was updated successfully, but these errors were encountered:

JStickler · 2024-12-09T20:06:47Z

Questions have a better chance of being answered if you ask them on the community forums.

JStickler added the type/question label Dec 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Loki compaction failure 'NoSuchKey: The specified key does not exist' #15250

Loki compaction failure 'NoSuchKey: The specified key does not exist' #15250

milon619 commented Dec 4, 2024

JStickler commented Dec 9, 2024

Loki compaction failure 'NoSuchKey: The specified key does not exist' #15250

Loki compaction failure 'NoSuchKey: The specified key does not exist' #15250

Comments

milon619 commented Dec 4, 2024

JStickler commented Dec 9, 2024