Skip to content

Table GC: rely on tm state to determine operation mode#11972

Merged
shlomi-noach merged 7 commits intovitessio:mainfrom
planetscale:tablegc-operate-state
Jan 9, 2023
Merged

Table GC: rely on tm state to determine operation mode#11972
shlomi-noach merged 7 commits intovitessio:mainfrom
planetscale:tablegc-operate-state

Conversation

@shlomi-noach
Copy link
Contributor

@shlomi-noach shlomi-noach commented Dec 15, 2022

Description

This is a simplification and an improvement to the tablet's table garbage collector operation.

Currently, the collector periodically checks to see whether the tablet is a PRIMARY. If yes, it enables the general operation, enables tickers, probes tables etc. If not, it turns off the tickers and stops doing any GC work. But it keeps polling. There is a leaderCheckInterval = 5 * time.Second variable, and this is where and when the collector checks for the tablet type.

As of this PR, this logic is simplified, and the variable is removed. The garbage collector only runs on Primary tablets. As such, we just use the fact that the tablet's state manager Open()s and Close()s the collector when going into and out of Primary type, respectively. There is no more polling/interval. It happens when it needs to happen.

The function Operate() now gets called upon Open() and terminates upon Close() using a context. Some variables are removed from the class and made function-local.

Existing endtoend tests are left unchanged and validate that the new behavior is good.

Related Issue(s)

Checklist

  • "Backport to:" labels have been added if this change should be back-ported
  • Tests were added or are not required
  • Documentation was added or is not required

Deployment Notes

Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
@shlomi-noach shlomi-noach added Type: Enhancement Logical improvement (somewhere between a bug and feature) Component: TabletManager labels Dec 15, 2022
@shlomi-noach shlomi-noach requested a review from a team December 15, 2022 09:54
@vitess-bot
Copy link
Contributor

vitess-bot bot commented Dec 15, 2022

Review Checklist

Hello reviewers! 👋 Please follow this checklist when reviewing this Pull Request.

General

  • Ensure that the Pull Request has a descriptive title.
  • If this is a change that users need to know about, please apply the release notes (needs details) label so that merging is blocked unless the summary release notes document is included.

If a new flag is being introduced:

  • Is it really necessary to add this flag?
  • Flag names should be clear and intuitive (as far as possible)
  • Help text should be descriptive.
  • Flag names should use dashes (-) as word separators rather than underscores (_).

If a workflow is added or modified:

  • Each item in Jobs should be named in order to mark it as required.
  • If the workflow should be required, the maintainer team should be notified.

Bug fixes

  • There should be at least one unit or end-to-end test.
  • The Pull Request description should include a link to an issue that describes the bug.

Non-trivial changes

  • There should be some code comments as to why things are implemented the way they are.

New/Existing features

  • Should be documented, either by modifying the existing documentation or creating new documentation.
  • New features should have a link to a feature request issue or an RFC that documents the use cases, corner cases and test cases.

Backward compatibility

  • Protobuf changes should be wire-compatible.
  • Changes to _vt tables and RPCs need to be backward compatible.
  • vtctl command output order should be stable and awk-able.
  • RPC changes should be compatible with vitess-operator
  • If a flag is removed, then it should also be removed from VTop, if used there.

Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
deepthi
deepthi previously approved these changes Dec 16, 2022
Copy link
Collaborator

@deepthi deepthi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

for {
if ctx.Err() != nil {
// cancelled
return
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, should the error be returned?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch, fixed!

@deepthi
Copy link
Collaborator

deepthi commented Dec 16, 2022

I was debating whether this requires an issue. You have labeled it as an Enhancement, to me it looks like an Internal Cleanup without any user-facing change. As such, we can do without an issue, but we might want to change the label.

@deepthi
Copy link
Collaborator

deepthi commented Dec 16, 2022

The unit test failures are very weird. This PR doesn't do anything that can affect vtexplain at all.
If at all I would have expected these failures on #11941 but they seemed to have passed there.

EDIT: unit tests are passing locally with a clean checkout. You may need to merge main again, I'm chalking this up to a bad merge.

@deepthi deepthi dismissed their stale review December 16, 2022 01:05

Question that needs to be resolved.

Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
@shlomi-noach
Copy link
Contributor Author

Fixes #11986

@shlomi-noach shlomi-noach added Type: Internal Cleanup and removed Type: Enhancement Logical improvement (somewhere between a bug and feature) labels Dec 18, 2022
@shlomi-noach
Copy link
Contributor Author

shlomi-noach commented Dec 18, 2022

@deepthi I both changed the label and created an issue, #11986

@shlomi-noach
Copy link
Contributor Author

Unit test failures still taking place, even after merging latest main

@shlomi-noach
Copy link
Contributor Author

Unit tests continue to fail for me on a clean checkout on a different box.

@deepthi
Copy link
Collaborator

deepthi commented Dec 19, 2022

Unit tests continue to fail for me on a clean checkout on a different box.

can you create an issue with details and tag @vitessio/query-serving?
It's quite mysterious because they also seem to be passing on most merges into main.

@shlomi-noach
Copy link
Contributor Author

Created issue at #11995

…ests can turn it off

Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>

isPrimary: (atomic.LoadInt64(&collector.isPrimary) > 0),
IsOpen: (atomic.LoadInt64(&collector.isOpen) > 0),
IsOpen: (atomic.LoadInt64(&collector.isOpen) > 0),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why LoadInt64 here whereas we have removed atomic from other places where we are reading isOpen ? do we not need synchronization anywhere else ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's because isOpen is only changed in Open() and Close(), which are both using stateMutex.Lock() to exclude each other. So those two functions don't need to LoadInt64. But Status() is not protected with that same mutex. The tradeoff is that it must read isOpen with LoadInt64 because it might run at the same time Open() or Close() are running.
An alternative is to add stateMutex.Lock() to Status(), but I prefer not to.


ctx := context.Background()
ctx, collector.cancelOperation = context.WithCancel(ctx)
go collector.operate(ctx)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

moving operate from init to Open. Does this means GC will start later then what it is currently now. Do you see any issue because of that delay?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see an issue, because Operate() works on intervals anyhow. There is no guarantee when exactly the cycles begin. The default cycle interval is 1h. So the precise timing of when the cycle begins is insignificant.

case <-tableCheckTicker.C:
{
_ = collector.checkTables(ctx)
log.Info("TableGC: tableCheckTicker")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Won't this log every 5 seconds on a primary tablet?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comments for the other logs on tickers

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Won't this log every 5 seconds on a primary tablet?

The default interval is 1h, and I see no reason why in production it would be any less then several minutes. In testing, where we set it to 5s - yes. But then, do we care?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed the log entry for log.Info("TableGC: purgeReentranceTicker"), which runs once per 1m. Still very spacious IMO and shouldn't be a problem logging once per minute, but now removed. The rest of logs are either once per hour, or upon something actually being done.

Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
@shlomi-noach
Copy link
Contributor Author

I've made the necessary changes, just waiting to verify that all questions have been satisfied.

@shlomi-noach
Copy link
Contributor Author

all righty, merging!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants