-
-
Notifications
You must be signed in to change notification settings - Fork 516
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
dolt gc --full: Implement a flag for GC which will collect everything, including the old gen. #8462
Conversation
…mode GC through to doltdb.DDB.GC call.
…not a *NomsBlockStore.
…ns a finalizer. Previously MarkAndSweepChunks would finalize itself by either appending the table table files to the store or by swaping to the table files. Now it returns a GCFinalizer instance which is used by the caller to effect the desired change. This is necessary for implementing Full GC, where we actually want to do both with the same GC result, but at different phases of the GC process. This commit leaves the HasManyF return from NomsBlockStore.MarkAndSweepChunks as a TODO.
…added to the store as part of the GC can be checked for their contents.
This does not yet work because the source ChunkStore is currently newGen. Still reworking so that the generational store itself is the source.
Change ValueStore's handling of generational chunk stores, so that it sweeps chunks from the generational store itself, and into new gen and old gen. This will support the full gc access pattern where we need to copy chunks from old gen into a new table file which will appear in old gen itself.
… that dolt gc --full removes data from old gen which is now unreachable.
…ves will still break `dolt gc --full`.
…sed by just fetching the chunks and re-compressing them. This makes `dolt gc` work while sourcing the chunks from the generational store and makes `dolt gc --full` work against databases which have been previously archived. `dolt gc --full` will still completely undo the archiving in the process of running the garbage collection.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit comments and questions. No show stoppers.
go/store/chunks/chunk_store.go
Outdated
@@ -152,6 +152,32 @@ type LoggingChunkStore interface { | |||
|
|||
var ErrAddChunkMustBlock = errors.New("chunk keeper: add chunk must block") | |||
|
|||
type HasManyF func(ctx context.Context, hashes hash.HashSet) (absent hash.HashSet, err error) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
HasManyFunc
would better, IMO. It's just 24bits man. come on.
Also, gDoc. Dangling function defs like this end up being abused if you don't.
go/store/types/value_store.go
Outdated
if err != nil { | ||
return err | ||
} | ||
|
||
root, err := lvs.Root(ctx) | ||
if err != nil { | ||
newGen.EndGC() | ||
collector.EndGC() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason you aren't simplifying this and doing a defer collector.EndGC()
in both places where you BeginGC()
?
Looks like we always want to EndGC() so it would be more readable if you did
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was just a lot of nesting...but in converting it I found a place where we forgot the EndGC. So...this was definitely good feedback. Thanks for it.
@@ -139,7 +139,7 @@ func testGarbageCollection(t *testing.T, test gcTest) { | |||
} | |||
} | |||
|
|||
err := dEnv.DoltDB.GC(ctx, nil) | |||
err := dEnv.DoltDB.GC(ctx, types.GCModeDefault, nil) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These tests are weird. Just an observation. I was thinking that I was going to ask for a test for the Full flag here, but it looks like these tests are basically just running Exec. Maybe bats is better than this?
sqlengine tests for this instead? Just sayin. These tests are weird.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I don't think these tests are worth much compared to a bats test which does the same. Obviously they're supposed to target the DoltDB.GC invocation itself, and they're much easy to attach a debugger to or whatever, but AFAICT they're not providing much in the way of protection compared to existing layers of cheese.
@@ -343,7 +343,24 @@ func (ms *MemoryStoreView) EndGC() { | |||
ms.transitionToNoGC() | |||
} | |||
|
|||
func (ms *MemoryStoreView) MarkAndSweepChunks(ctx context.Context, hashes <-chan []hash.Hash, dest ChunkStore) error { | |||
type msvGcFinalizer struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After GC, what does the memory store contain? This refactor looks like it preserves previous behavior, but I'm not sure why the previous behavior is the way it is. Educate me
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAIK, memory store is an in-memory chunk store implementation. After GC, memory store just contains all the visited chunks – that is, all the chunks whose address came through on the keepers channel.
go/store/nbs/store.go
Outdated
@@ -1037,6 +1037,34 @@ func (nbs *NomsBlockStore) HasMany(ctx context.Context, hashes hash.HashSet) (ha | |||
return nbs.hasMany(toHasRecords(hashes)) | |||
} | |||
|
|||
func (nbs *NomsBlockStore) hasManyInSources(ctx context.Context, srcs []hash.Hash, hashes hash.HashSet) (hash.HashSet, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the ctx future proofing for when we might need it? No used currently. If you think the interface needs it, I'd drop the name and make it '_' so people know that this is intentional
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just didn't realize it was unneeded. Removed! Thanks for that :).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
__/___
_____/______|
_______/_____\_______\_____
\ < < < |
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Dolt GC is generational. By default, it moves all chunks reachable from commits into the old gen, and it leaves uncommitted data in the new gen. By default, it never revisits the chunks in the old gen. This means that if a branch with a lot of unique data exists during a GC, and then gets deleted, the storage taken up by that branch is never reclaimed by future garbage collection.
This PR implements
dolt gc --full
, which performs a collection on the entire database, ignoring the generational aspects of previous collections and visiting the entire set of live data. At the end of a successfuldolt gc --full
, only data which was reachable at the start of the run and data which has been written during the run will be present in the database.Note:
dolt gc --full
visits every reachable chunk in the database and can be resource intensive. As a consequence of runningdolt gc --full
, any previously archived table files will be unarchived, and storage savings from the archiving will be lost.dolt gc --full
involves copying chunks into new storage files, and fully materializing the new files before the old files are deleted; as a consequence, peak disk utilisation while runningdolt gc --full
can be up to 2x the existing size of the database.