Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Loki compaction failure 'NoSuchKey: The specified key does not exist' #15250

Open
milon619 opened this issue Dec 4, 2024 · 1 comment
Open

Comments

@milon619
Copy link

milon619 commented Dec 4, 2024

Issue Summary:
Potential Loki Compaction and Retention Issue After Resource Starvation

Our Setup

  • Loki distributed multi-tenant running in k8s
  • Version: 3.2.0

Compactor specific settings

compaction_interval: 10m
max_compaction_parallelism: 2
apply_retention_interval: 10m
retention_delete_worker_count: 50
delete_batch_size: 50
delete_request_cancel_period: 1h
delete_request_store: s3
retention_delete_delay: 1h
retention_enabled: true
working_directory: /var/loki/compactor

We suspect that Loki’s compaction process might have failed due to resource starvation in our setup. After observing this behavior, we increased CPU and memory allocations, ensuring that current resource utilization remains well within limits.

However, we now have concerns about whether retention policies are being successfully applied to files in object storage.

Image

We are monitoring the metric loki_compactor_apply_retention_operation_total, which indicates that retention operations are failing. Additionally, the metric loki_compactor_apply_retention_last_successful_run_timestamp_seconds corroborates this observation, showing no recent successful runs.

Image

The compaction failure log that is indicative of the problem is

sg="failed to compact files" table=index_20031 err="failed to rewrite chunk '<redacted/>d3e4c83236a4dc26/192f6fb4e1e:192f6fcc62c:e833ad72 with error failed to load chunk '<redacted>/d3e4c83236a4dc26/192f6fb4e1e:192f6fcc62c:e833ad72': failed to get s3 object: NoSuchKey: The specified key does not exist.\n\tstatus code: 404,

Upon inspecting the object storage, we noticed that the offending keys do not exist. This raises a puzzling question: why is Loki attempting to process a non-existent chunk file?

Initially, to address the issue, we removed the offending table index for the specified tenant from S3, hoping this would resolve the problem. However, the issue persisted, failing with another file instead.

Apart from this we are also seeing occasionally the following errors thrown by the compactor. Also trying to understand which exact operation is timing out here.


2024-12-04 11:03:53.779 | message=level=warn ts=2024-12-04T11:03:53.765260464Z caller=marker.go:213 msg="failed to process marks" path=/var/loki/compactor/retention/s3_2024-09-09/markers/1733305507717040496 err=timeout
-- | --
  |   | 2024-12-04 11:04:20.179 | message=level=warn ts=2024-12-04T11:04:20.122143977Z caller=marker.go:213 msg="failed to process marks" path=/var/loki/compactor/retention/s3_2024-09-09/markers/1733143673475594307 err=timeout
  |   | 2024-12-04 11:04:25.179 | message=level=warn ts=2024-12-04T11:04:25.080249868Z caller=marker.go:213 msg="failed to process marks" path=/var/loki/compactor/retention/s3_2024-09-09/markers/1733144302072920250 err=timeout
  |   | 2024-12-04 11:04:30.079 | message=level=warn ts=2024-12-04T11:04:30.03179495Z caller=marker.go:213 msg="failed to process marks" path=/var/loki/compactor/retention/s3_2024-09-09/markers/1733146877540183809 err=timeout
  |   | 2024-12-04 11:04:35.079 | message=level=warn ts=2024-12-04T11:04:35.029838891Z caller=marker.go:213 msg="failed to process marks" path=/var/loki/compactor/retention/s3_2024-09-09/markers/1733147546669388935 err=timeout
  |   | 2024-12-04 11:04:40.087 | message=level=warn ts=2024-12-04T11:04:40.02930571Z caller=marker.go:213 msg="failed to process marks" path=/var/loki/compactor/retention/s3_2024-09-09/markers/1733149264257553103 err=timeout

The impact of compaction not working is that we are seeing an increased error rate and latency in the read requests and rule evaluations. Also our S3 object cost and xAZ n/w transfer costs are going up due to the issue.

Could you please help investigate this issue?

@JStickler
Copy link
Contributor

Questions have a better chance of being answered if you ask them on the community forums.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants