Skip to content

Conversation

codesome
Copy link
Contributor

@codesome codesome commented Oct 14, 2020

This is for blocks storage ingesters

Which issue(s) this PR fixes:
Fixes #2868

Checklist

  • Tests updated
  • Documentation added
  • CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

Signed-off-by: Ganesh Vernekar <[email protected]>
Comment on lines +126 to +128
// If there is any issue with the shipper, we should be conservative and not delete anything.
level.Error(util.Logger).Log("msg", "failed to read shipper meta during deletion of blocks", "user", u.userID, "err", err)
return nil
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we have a metric for this to alert on? Because if this issue persists, queries could skip some data and the disk can run out of space.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In our alerts we do alert on disk running out of space. I'd say that + error message is enough.

Ganesh Vernekar added 2 commits October 14, 2020 18:32
Signed-off-by: Ganesh Vernekar <[email protected]>
@roidelapluie
Copy link
Contributor

roidelapluie commented Oct 14, 2020

Can we invert this and make deleteable shipperMeta.Uploaded ? Or would that be 'too aggressive'?

Copy link
Contributor

@pracucci pracucci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @codesome for working on this! A couple of things:

  1. How difficult would be adding a unit test on this?
  2. I was checking in one of our production clusters if thanos.shipper.json actually contains all shipped blocks and I've found an ingester with some blocks missing in the shipper meta file (but successfully shipped to the bucket). I'm investigating the root cause, but this could be a blocker for this PR

@pracucci
Copy link
Contributor

pracucci commented Oct 15, 2020

  1. I was checking in one of our production clusters if thanos.shipper.json actually contains all shipped blocks and I've found an ingester with some blocks missing in the shipper meta file (but successfully shipped to the bucket). I'm investigating the root cause, but this could be a blocker for this PR

This could fix it: thanos-io/thanos#3321

@codesome
Copy link
Contributor Author

Can we invert this and make deleteable shipperMeta.Uploaded ? Or would that be 'too aggressive'?

Not super sure about how querying is setup, but queriers depend on ingesters for some data for upto some hours and there might be gaps in queries if we aggressively delete blocks like that.

@codesome
Copy link
Contributor Author

How difficult would be adding a unit test on this?

I will check, might need some mocking of shipper

@pracucci
Copy link
Contributor

Can we invert this and make deleteable shipperMeta.Uploaded ? Or would that be 'too aggressive'?

Not super sure about how querying is setup, but queriers depend on ingesters for some data for upto some hours and there might be gaps in queries if we aggressively delete blocks like that.

We should never delete a block until the (configured) retention period is reached. What we're doing in this PR is adding an extra protection to not delete a block until shipped (eg. authentication or networking issues).

@pracucci
Copy link
Contributor

How difficult would be adding a unit test on this?

I will check, might need some mocking of shipper

A couple of things:

  1. In unit tests we already have a shipperMock 
  2. If you run the ingester with the local filesystem as backend, you should be able to actually run the real shipper

Ganesh Vernekar added 2 commits October 19, 2020 19:45
@pull-request-size pull-request-size bot added size/L and removed size/M labels Oct 19, 2020
@codesome
Copy link
Contributor Author

@pracucci I have added a unit test now. I think we have to first get thanos-io/thanos#3321 vendored and then go with this, right?

@pracucci
Copy link
Contributor

@pracucci I have added a unit test now. I think we have to first get thanos-io/thanos#3321 vendored and then go with this, right?

@codesome Correct. I opened a PR #3363 for the Thanos upgrade.

Copy link
Contributor

@pracucci pracucci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks! But let's wait for #3363 before merging.

@pracucci pracucci requested a review from pstibrany October 19, 2020 17:02
Copy link
Contributor

@pstibrany pstibrany left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, left non-blocking nits.

Signed-off-by: Ganesh Vernekar <[email protected]>
Copy link
Contributor

@pracucci pracucci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanos upgrade PR has been merged, so we can merge this PR too.

@pracucci pracucci merged commit 40d8240 into cortexproject:master Oct 20, 2020
gotjosh added a commit to gotjosh/cortex that referenced this pull request Oct 20, 2020
…rgid-ctx

* 'master' of github.com:cortexproject/cortex:
  Enforce integration tests default flags config to never be overwritten (cortexproject#3370)
  Avoid deletion of blocks which are not shipped (cortexproject#3346)
  Upgrade Thanos to latest master (cortexproject#3363)
  Migrate CircleCI workflows to GitHub Actions (2/3) (cortexproject#3341)
  Remove comments that doesn't seem right (cortexproject#3361)
  add ingester interface (cortexproject#3352)
  Fail fast an ingester if unable to load existing TSDBs (cortexproject#3354)
  Fixed Gossip memberlist members joining when addresses are configured using DNS-based service discovery (cortexproject#3360)
  Export distributor method to get ingester replication set (cortexproject#3356)
  Correct link for Block Storage reference (cortexproject#3234)
  Added section on Cleaner. (cortexproject#3327)
  Update prometheus vendor to master (cortexproject#3345)
  adding GHA CI env variable check (cortexproject#3351)
  Add ingesters shuffle sharding support on the read path (cortexproject#3252)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Blocks should never be deleted by ingester's TSDB retention until shipped to the storage

4 participants