Skip to content

Conversation

pstibrany
Copy link
Contributor

@pstibrany pstibrany commented Nov 12, 2020

What this PR does: This PR adds ability for ingester to close idle TSDB and delete local data when TSDB is idle for a long time (no data is appended to it). Closing/deletion only happens if head is empty (possibly thanks to previous force-compaction due to being idle), and all blocks were shipped.

This PR also takes care of hiding user metrics, while keeping the registry to avoid breaking the counters over all users. Support for metrics removal has been extracted to separate PR, #3511.

Checklist

  • Tests updated
  • Documentation added
  • CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

Copy link
Contributor

@pracucci pracucci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very good job! I did a first pass review and I will do a more accurate review once initial feedback has been discussed and/or addressed. I would also mark this feature as experimental in the "v1 guarantees" doc page: WDYT?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just realised this check has a problem: the lastUpdate is not correct after an ingester restart. An option could be checking the timestamp against max(lastUpdate, TSDB max time).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When TSDB is opened, we set lastUpdate to current time to avoid premature idle flush and close. This will only be a problem if ingesters never stay up for longer than configured idle timeout.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of current time, we can set it to last sample time from TSDB when opening, as you suggest.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When TSDB is opened, we set lastUpdate to current time to avoid premature idle flush and close.

Then we're protected 👍

Instead of current time, we can set it to last sample time from TSDB when opening, as you suggest.

Could make sense.

Copy link
Contributor

@pracucci pracucci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think would be great to have a couple of metrics (counters) on succeeded and failed closing idle TSDBs. WDYT?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added metrics for number of idle-checks, and various check results. PTAL

Copy link
Contributor

@jtlisi jtlisi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Signed-off-by: Peter Štibraný <[email protected]>
@pracucci
Copy link
Contributor

Thanks @pstibrany for addressing my comments. LGTM 👍

@pstibrany pstibrany merged commit 0995481 into cortexproject:master Nov 26, 2020
@pstibrany pstibrany mentioned this pull request Dec 1, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants