-
Notifications
You must be signed in to change notification settings - Fork 833
Avoid deletion of blocks which are not shipped #3346
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Avoid deletion of blocks which are not shipped #3346
Conversation
Signed-off-by: Ganesh Vernekar <[email protected]>
Signed-off-by: Ganesh Vernekar <[email protected]>
// If there is any issue with the shipper, we should be conservative and not delete anything. | ||
level.Error(util.Logger).Log("msg", "failed to read shipper meta during deletion of blocks", "user", u.userID, "err", err) | ||
return nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we have a metric for this to alert on? Because if this issue persists, queries could skip some data and the disk can run out of space.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In our alerts we do alert on disk running out of space. I'd say that + error message is enough.
Signed-off-by: Ganesh Vernekar <[email protected]>
Signed-off-by: Ganesh Vernekar <[email protected]>
Can we invert this and make deleteable |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @codesome for working on this! A couple of things:
- How difficult would be adding a unit test on this?
- I was checking in one of our production clusters if
thanos.shipper.json
actually contains all shipped blocks and I've found an ingester with some blocks missing in the shipper meta file (but successfully shipped to the bucket). I'm investigating the root cause, but this could be a blocker for this PR
This could fix it: thanos-io/thanos#3321 |
Not super sure about how querying is setup, but queriers depend on ingesters for some data for upto some hours and there might be gaps in queries if we aggressively delete blocks like that. |
I will check, might need some mocking of shipper |
We should never delete a block until the (configured) retention period is reached. What we're doing in this PR is adding an extra protection to not delete a block until shipped (eg. authentication or networking issues). |
A couple of things:
|
Signed-off-by: Ganesh Vernekar <[email protected]>
…pped-blocks Signed-off-by: Ganesh Vernekar <[email protected]>
@pracucci I have added a unit test now. I think we have to first get thanos-io/thanos#3321 vendored and then go with this, right? |
@codesome Correct. I opened a PR #3363 for the Thanos upgrade. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks! But let's wait for #3363 before merging.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, left non-blocking nits.
Signed-off-by: Ganesh Vernekar <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanos upgrade PR has been merged, so we can merge this PR too.
…rgid-ctx * 'master' of github.com:cortexproject/cortex: Enforce integration tests default flags config to never be overwritten (cortexproject#3370) Avoid deletion of blocks which are not shipped (cortexproject#3346) Upgrade Thanos to latest master (cortexproject#3363) Migrate CircleCI workflows to GitHub Actions (2/3) (cortexproject#3341) Remove comments that doesn't seem right (cortexproject#3361) add ingester interface (cortexproject#3352) Fail fast an ingester if unable to load existing TSDBs (cortexproject#3354) Fixed Gossip memberlist members joining when addresses are configured using DNS-based service discovery (cortexproject#3360) Export distributor method to get ingester replication set (cortexproject#3356) Correct link for Block Storage reference (cortexproject#3234) Added section on Cleaner. (cortexproject#3327) Update prometheus vendor to master (cortexproject#3345) adding GHA CI env variable check (cortexproject#3351) Add ingesters shuffle sharding support on the read path (cortexproject#3252)
This is for blocks storage ingesters
Which issue(s) this PR fixes:
Fixes #2868
Checklist
CHANGELOG.md
updated - the order of entries should be[CHANGE]
,[FEATURE]
,[ENHANCEMENT]
,[BUGFIX]