Skip to content

fix(base-cluster/loki): adjust retention settings for loki logs#1745

Merged
marvinWolff merged 2 commits intomainfrom
fix/loki-retention
Oct 27, 2025
Merged

fix(base-cluster/loki): adjust retention settings for loki logs#1745
marvinWolff merged 2 commits intomainfrom
fix/loki-retention

Conversation

@tasches
Copy link
Copy Markdown
Collaborator

@tasches tasches commented Oct 17, 2025

Summary by CodeRabbit

  • Chores
    • Updated log retention to 45 days.
    • Introduced compactor tuning: 15m compaction interval, 24h deletion delay, and 150 deletion workers.
    • Removed the legacy retention mechanism and consolidated retention/deletion configuration.

@tasches tasches self-assigned this Oct 17, 2025
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Oct 17, 2025

Walkthrough

Replaces Loki HelmRelease tableManager retention configuration with limits_config.retention_period: 45d, adds compactor settings (compaction_interval: 15m, retention_delete_delay: 24h, retention_delete_worker_count: 150), and removes the prior tableManager block.

Changes

Cohort / File(s) Summary
Loki Helm values (monitoring/logs)
charts/base-cluster/templates/monitoring/logs/loki.yaml
Added limits_config.retention_period: 45d; added compactor settings: compaction_interval: 15m, retention_delete_delay: 24h, retention_delete_worker_count: 150; removed tableManager block that had retention_deletes_enabled: true and retention_period: 6w.

Sequence Diagram(s)

sequenceDiagram
    autonumber
    participant Helm as Helm chart
    participant Values as Loki values
    participant Loki as Loki components
    participant Compactor as Compactor process

    Helm->>Values: apply `loki.loki` values
    Values->>Loki: set `limits_config.retention_period = 45d`
    Values->>Compactor: configure compactor settings
    note over Compactor: compaction_interval: 15m\nretention_delete_delay: 24h\nretention_delete_worker_count: 150
    Compactor->>Loki: perform compaction & retention deletion runs
    note right of Loki: `tableManager` retention config removed
Loading

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

  • Files to check closely:
    • charts/base-cluster/templates/monitoring/logs/loki.yaml — verify YAML structure, indentation, and compatibility with Loki Helm chart version and schema.
    • Ensure removal of tableManager does not leave deprecated references elsewhere.

Suggested reviewers

  • teutonet-bot

Poem

🐰 I hopped through charts with a twitch and a hum,

Set logs to forty-five days — tidy and plumb.
Compactors now chatter every fifteen minutes bright,
Old tableManager vanished into the night,
I nibble on code and watch deletions wait one day with delight.

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check ✅ Passed The PR title "fix(base-cluster/loki): adjust retention settings for loki logs" directly and accurately describes the main changes in the pull request. The modifications—adding a retention_period to limits_config, configuring the compactor with retention deletion settings, and removing the legacy tableManager retention configuration—are all about adjusting how Loki handles log retention. The title is concise (63 characters, 7 words), uses clear language, and follows conventional semantic versioning format. A teammate scanning the git history would immediately understand that this PR involves changes to Loki's retention configuration.
✨ Finishing touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch fix/loki-retention

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 8b61010 and 711d502.

📒 Files selected for processing (1)
  • charts/base-cluster/templates/monitoring/logs/loki.yaml (1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: Update release-please config file for a possibly new chart
  • GitHub Check: check licenses
  • GitHub Check: lint helm chart (base-cluster)
🔇 Additional comments (2)
charts/base-cluster/templates/monitoring/logs/loki.yaml (2)

80-81: Retention configuration migrated to limits_config.

The shift from legacy tableManager to limits_config with explicit retention_period is the correct modern approach for Loki 2.9+. The 45-day retention (45d) is a reasonable increase from the previous 6-week (42d) setting.

Please confirm:

  • What is the target Loki Helm chart version? (Ensure it supports limits_config & compactor retention settings)
  • Is there a migration plan for existing Loki instances or data that may be affected by the retention change?

85-87: Verify the aggressive retention_delete_delay setting.

The default retention_delete_delay is 2 hours, but your configuration sets it to 1m. A 2-hour delay introduces a buffer before permanently deleting logs, ensuring a window in case of accidental deletion or misconfigurations.

The 1-minute setting is unusually aggressive. While the delay allows index gateways to pull modified index files that don't contain references to marked chunks, preventing queries from failing against deleted chunks, and provides a short window to cancel chunk deletion in case of configuration mistakes, a 1-minute window may be too tight for safety.

In contrast, retention_delete_worker_count: 150 aligns with the default value of 150, which is appropriate.

Please confirm whether the 1-minute delay is intentional for your retention strategy, or if it should be adjusted to a more conservative value like 2 hours for better safety margins.

Comment thread charts/base-cluster/templates/monitoring/logs/loki.yaml
cwrau
cwrau previously approved these changes Oct 17, 2025
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
charts/base-cluster/templates/monitoring/logs/loki.yaml (1)

80-81: Consider adding additional limits configuration parameters.

The limits_config currently only sets retention_period. Depending on your cluster's log ingestion profile and SingleBinary resource constraints, you may want to add parameters such as:

  • ingestion_rate_mb: Controls ingestion rate per tenant (default: 3)
  • max_streams_per_user: Limits the number of log streams per tenant
  • max_cache_freshness_period: Controls cache freshness for queries

This is not a blocker, as the current minimal configuration is valid, but consider whether rate limits or cache settings are needed for your environment.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 0f1fd37 and 74666fe.

📒 Files selected for processing (1)
  • charts/base-cluster/templates/monitoring/logs/loki.yaml (1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: check licenses
  • GitHub Check: Update release-please config file for a possibly new chart
  • GitHub Check: lint helm chart (base-cluster)
🔇 Additional comments (2)
charts/base-cluster/templates/monitoring/logs/loki.yaml (2)

80-87: ✅ Retention settings appropriately corrected.

The changes address the previous critical review comment: retention_delete_delay has been increased from 1m to 24h, which is the safe threshold for production Loki deployments. This prevents premature data loss and aligns with official Loki documentation recommendations. The retention_delete_worker_count: 150 and compaction_interval: 15m are appropriate for SingleBinary deployments.


80-87: ⚠️ Summary mentions tableManager removal, but it's not present in the provided code.

The AI summary states the PR "removes the prior tableManager block," but the full file context does not show a tableManager section. Either it was removed in a prior commit, or the summary refers to a previous iteration. This is a minor inconsistency but doesn't affect the correctness of the current changes.

@marvinWolff marvinWolff added this pull request to the merge queue Oct 27, 2025
Merged via the queue into main with commit 1985d34 Oct 27, 2025
31 of 32 checks passed
@marvinWolff marvinWolff deleted the fix/loki-retention branch October 27, 2025 10:10
github-merge-queue Bot pushed a commit that referenced this pull request Oct 27, 2025
🤖 I have created a release *beep* *boop*
---


##
[10.0.1](base-cluster-v10.0.0...base-cluster-v10.0.1)
(2025-10-27)


### Bug Fixes

* **base-cluster/descheduler:** don't remove pods with too many restarts
([#1744](#1744))
([9c1ed51](9c1ed51))
* **base-cluster/ingress:** add missing `prometheus` block 🙄
([#1767](#1767))
([a329e1a](a329e1a))
* **base-cluster/loki:** adjust retention settings for loki logs
([#1745](#1745))
([1985d34](1985d34))
* **base-cluster/monitoring:** use the correct prometheus datasource id
([#1764](#1764))
([511cc84](511cc84))

---
This PR was generated with [Release
Please](https://github.com/googleapis/release-please). See
[documentation](https://github.com/googleapis/release-please#release-please).

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

* **Bug Fixes**
* Fixed descheduler to prevent removal of pods with excessive restart
counts
  * Added missing Prometheus monitoring configuration to ingress
  * Adjusted log retention settings in Loki
  * Corrected Prometheus datasource ID in monitoring

<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Co-authored-by: github-actions <41898282+github-actions[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants