Table GC: rely on tm state to determine operation mode#11972
Table GC: rely on tm state to determine operation mode#11972shlomi-noach merged 7 commits intovitessio:mainfrom
Conversation
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Review ChecklistHello reviewers! 👋 Please follow this checklist when reviewing this Pull Request. General
If a new flag is being introduced:
If a workflow is added or modified:
Bug fixes
Non-trivial changes
New/Existing features
Backward compatibility
|
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
| for { | ||
| if ctx.Err() != nil { | ||
| // cancelled | ||
| return |
There was a problem hiding this comment.
Hmm, should the error be returned?
There was a problem hiding this comment.
good catch, fixed!
|
I was debating whether this requires an issue. You have labeled it as an Enhancement, to me it looks like an Internal Cleanup without any user-facing change. As such, we can do without an issue, but we might want to change the label. |
|
The unit test failures are very weird. This PR doesn't do anything that can affect EDIT: unit tests are passing locally with a clean checkout. You may need to merge main again, I'm chalking this up to a bad merge. |
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
|
Fixes #11986 |
|
Unit test failures still taking place, even after merging latest |
|
Unit tests continue to fail for me on a clean checkout on a different box. |
can you create an issue with details and tag @vitessio/query-serving? |
|
Created issue at #11995 |
…ests can turn it off Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
|
|
||
| isPrimary: (atomic.LoadInt64(&collector.isPrimary) > 0), | ||
| IsOpen: (atomic.LoadInt64(&collector.isOpen) > 0), | ||
| IsOpen: (atomic.LoadInt64(&collector.isOpen) > 0), |
There was a problem hiding this comment.
Why LoadInt64 here whereas we have removed atomic from other places where we are reading isOpen ? do we not need synchronization anywhere else ?
There was a problem hiding this comment.
That's because isOpen is only changed in Open() and Close(), which are both using stateMutex.Lock() to exclude each other. So those two functions don't need to LoadInt64. But Status() is not protected with that same mutex. The tradeoff is that it must read isOpen with LoadInt64 because it might run at the same time Open() or Close() are running.
An alternative is to add stateMutex.Lock() to Status(), but I prefer not to.
|
|
||
| ctx := context.Background() | ||
| ctx, collector.cancelOperation = context.WithCancel(ctx) | ||
| go collector.operate(ctx) |
There was a problem hiding this comment.
moving operate from init to Open. Does this means GC will start later then what it is currently now. Do you see any issue because of that delay?
There was a problem hiding this comment.
I don't see an issue, because Operate() works on intervals anyhow. There is no guarantee when exactly the cycles begin. The default cycle interval is 1h. So the precise timing of when the cycle begins is insignificant.
| case <-tableCheckTicker.C: | ||
| { | ||
| _ = collector.checkTables(ctx) | ||
| log.Info("TableGC: tableCheckTicker") |
There was a problem hiding this comment.
Won't this log every 5 seconds on a primary tablet?
There was a problem hiding this comment.
Same comments for the other logs on tickers
There was a problem hiding this comment.
Won't this log every 5 seconds on a primary tablet?
The default interval is 1h, and I see no reason why in production it would be any less then several minutes. In testing, where we set it to 5s - yes. But then, do we care?
There was a problem hiding this comment.
I removed the log entry for log.Info("TableGC: purgeReentranceTicker"), which runs once per 1m. Still very spacious IMO and shouldn't be a problem logging once per minute, but now removed. The rest of logs are either once per hour, or upon something actually being done.
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
|
I've made the necessary changes, just waiting to verify that all questions have been satisfied. |
|
all righty, merging! |
Description
This is a simplification and an improvement to the tablet's table garbage collector operation.
Currently, the collector periodically checks to see whether the tablet is a
PRIMARY. If yes, it enables the general operation, enables tickers, probes tables etc. If not, it turns off the tickers and stops doing any GC work. But it keeps polling. There is aleaderCheckInterval = 5 * time.Secondvariable, and this is where and when the collector checks for the tablet type.As of this PR, this logic is simplified, and the variable is removed. The garbage collector only runs on Primary tablets. As such, we just use the fact that the tablet's state manager
Open()s andClose()s the collector when going into and out of Primary type, respectively. There is no more polling/interval. It happens when it needs to happen.The function
Operate()now gets called uponOpen()and terminates uponClose()using acontext. Some variables are removed from the class and made function-local.Existing
endtoendtests are left unchanged and validate that the new behavior is good.Related Issue(s)
Checklist
Deployment Notes