go/store: fix push latency growth for git-backed remotes#10597
go/store: fix push latency growth for git-backed remotes#10597coffeegoddd merged 14 commits intomainfrom
Conversation
|
@coffeegoddd DOLT
|
|
@coffeegoddd DOLT
|
|
@coffeegoddd DOLT
|
There was a problem hiding this comment.
Pull request overview
This PR targets push latency growth for git-backed remotes by reducing per-push re-open work, improving git delta efficiency, and lowering git object growth in the cache repo.
Changes:
- Cache git-backed remote
DoltDBinstances across pushes to reuse already-opened table sources. - Switch Git NBS persistence to write table files as a single blob (avoids
.records/.tailintermediates). - Adjust GitBlobstore remote-managed write semantics to use bounded parent chains for incremental deltas and add periodic
git gcfor cache repo repacking.
Reviewed changes
Copilot reviewed 12 out of 12 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| go/store/nbs/store.go | Use singleBlobBSPersister for git-backed NBS stores. |
| go/store/nbs/single_blob_bs_persister.go | New persister that writes table files as a single blob while still supporting conjoin. |
| go/store/nbs/git_blobstore_empty_remote_test.go | Update test expectations for the new persister type. |
| go/store/nbs/bs_persister.go | Minor formatting-only change. |
| go/store/nbs/bs_manifest.go | Minor formatting-only change. |
| go/store/blobstore/internal/git/api.go | Extend GitAPI with RevListCount. |
| go/store/blobstore/internal/git/impl.go | Implement RevListCount using git rev-list --count. |
| go/store/blobstore/git_blobstore.go | Add separate write/pending locks, read-sync TTL dedup, bounded parent commits, and periodic git gc. |
| go/store/blobstore/git_blobstore_test.go | Update tests for new tracking behavior and add read-during-push concurrency test. |
| go/store/blobstore/git_blobstore_helpers_test.go | Update fakeGitAPI to satisfy the new interface method. |
| go/store/blobstore/git_blobstore_cache_merge_semantics_test.go | Disable read-sync dedup in a test to observe immediate remote mutations. |
| go/libraries/doltcore/sqle/database_provider.go | Cache git-backed remote DoltDB instances and close them on provider shutdown. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| isGitRemote := strings.HasPrefix(strings.ToLower(r.Url), "git+") | ||
| if isGitRemote { | ||
| p.mu.RLock() | ||
| if cached, ok := p.remoteDbs[r.Url]; ok { | ||
| p.mu.RUnlock() | ||
| return cached, nil | ||
| } | ||
| p.mu.RUnlock() | ||
| } | ||
|
|
||
| remoteDB, err := r.GetRemoteDB(ctx, format, dialer) | ||
| if err != nil { | ||
| return nil, err | ||
| } | ||
|
|
||
| if isGitRemote { | ||
| p.mu.Lock() | ||
| p.remoteDbs[r.Url] = remoteDB | ||
| p.mu.Unlock() | ||
| } | ||
| return remoteDB, nil |
There was a problem hiding this comment.
The git-remote DB cache can race: two goroutines can both miss p.remoteDbs[r.Url], both open a new remote DB, and then the later store overwrites the earlier without closing it. To avoid leaking open remotes, consider re-checking the cache under p.mu.Lock() (double-checked locking) and closing the newly-opened DB if another goroutine already cached one.
| // instead of enumerating the full tree. After maxParentedCommits in the | ||
| // existing chain, create a parentless commit to sever history so git gc can | ||
| // prune old objects. | ||
| var parentPtr *git.OID | ||
| if hasParent && parent != "" { | ||
| depth, err := gbs.api.RevListCount(ctx, parent) | ||
| if err == nil && depth < maxParentedCommits { | ||
| p := parent | ||
| parentPtr = &p | ||
| } |
There was a problem hiding this comment.
RevListCount runs git rev-list --count <oid>, which can be expensive if the remote ref has a long history (e.g., externally mutated). Since you only need to know whether the depth exceeds maxParentedCommits, consider using a bounded traversal (e.g., rev-list --count --max-count=<maxParentedCommits+1>) or another O(1) heuristic to avoid scanning an unbounded history.
| // instead of enumerating the full tree. After maxParentedCommits in the | |
| // existing chain, create a parentless commit to sever history so git gc can | |
| // prune old objects. | |
| var parentPtr *git.OID | |
| if hasParent && parent != "" { | |
| depth, err := gbs.api.RevListCount(ctx, parent) | |
| if err == nil && depth < maxParentedCommits { | |
| p := parent | |
| parentPtr = &p | |
| } | |
| // instead of enumerating the full tree. | |
| // NOTE: We deliberately avoid an unbounded history traversal (e.g. "git rev-list --count") | |
| // here for performance reasons. This means we may keep longer parent chains than | |
| // strictly necessary for GC, but avoids potentially expensive scans on large repos. | |
| var parentPtr *git.OID | |
| if hasParent && parent != "" { | |
| p := parent | |
| parentPtr = &p |
| const readers = 25 | ||
| var wg sync.WaitGroup | ||
| readErrs := make(chan error, readers) | ||
| for range readers { |
There was a problem hiding this comment.
This loop won't compile: for range readers {} only works with slices/maps/channels/strings, not an integer constant. Use a counted loop (e.g., for i := 0; i < readers; i++ { ... }).
| for range readers { | |
| for i := 0; i < readers; i++ { |
|
@coffeegoddd DOLT
|
|
@coffeegoddd DOLT
|
|
@coffeegoddd DOLT
|
Cache remote DoltDB instances across pushes, use parented commits with bounded depth for incremental git deltas, write table files as single blobs instead of split .records/.tail intermediates, and run periodic git gc to repack cache repos.