blobstore: add chunked-object mode to GitBlobstore#10424
Conversation
6e772da to
608904f
Compare
There was a problem hiding this comment.
Pull request overview
This PR introduces an opt-in chunked-object representation for GitBlobstore to work around Git's single-blob size limitations. The implementation adds a MaxPartSize configuration option that, when enabled, splits large objects into multiple part blobs referenced by a descriptor blob. The descriptor uses a simple text format with a magic string ("DOLTBS1") for easy detection.
Changes:
- Added internal
gitbspackage with descriptor encoding/parsing, range operations, and part path generation - Enhanced Get path to transparently detect and read chunked objects via descriptor parsing
- Refactored Put and CheckAndPut to support chunked writes when MaxPartSize is configured
- Improved Put to use create-only CAS (zero OID) when creating refs to prevent losing concurrent writes
- Added comprehensive test coverage for chunked operations, including multipart reads and range queries
Reviewed changes
Copilot reviewed 17 out of 17 changed files in this pull request and generated no comments.
Show a summary per file
| File | Description |
|---|---|
go/store/blobstore/internal/gitbs/descriptor.go |
Implements descriptor encoding/parsing with validation and DOLTBS1 magic detection |
go/store/blobstore/internal/gitbs/descriptor_test.go |
Tests for descriptor round-trip, validation, and error cases |
go/store/blobstore/internal/gitbs/descriptor_helpers_test.go |
Tests for internal descriptor parsing helpers |
go/store/blobstore/internal/gitbs/ranges.go |
Implements range normalization and part slicing logic for offset/length operations |
go/store/blobstore/internal/gitbs/ranges_test.go |
Tests for range operations including negative offsets and boundary spanning |
go/store/blobstore/internal/gitbs/ranges_helpers_test.go |
Tests for internal range helper functions including overflow cases |
go/store/blobstore/internal/gitbs/oid.go |
OID validation accepting both upper and lower case hex characters |
go/store/blobstore/internal/gitbs/parts_path.go |
Generates deterministic fanout paths for part blobs under reserved namespace |
go/store/blobstore/internal/gitbs/parts_path_test.go |
Tests for part path generation including case normalization |
go/store/blobstore/git_blobstore.go |
Main implementation: adds MaxPartSize config, multipart reader, chunked write logic, and improved ref creation CAS |
go/store/blobstore/git_blobstore_parts.go |
Implements idempotent part blob staging to ensure reachability |
go/store/blobstore/git_blobstore_parts_test.go |
Tests part staging idempotency and reachability verification |
go/store/blobstore/git_blobstore_multipart_test.go |
Unit tests for multipart reader including offset handling and error cases |
go/store/blobstore/git_blobstore_helpers_test.go |
Test utilities and tests for refactored helper functions |
go/store/blobstore/git_blobstore_chunked_get_test.go |
Integration tests for reading chunked objects with various range operations |
go/store/blobstore/git_blobstore_chunked_put_test.go |
Integration tests for writing and verifying chunked objects |
go/store/blobstore/git_blobstore_chunked_checkandput_test.go |
Integration tests for CheckAndPut with chunked objects |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
@coffeegoddd DOLT
|
|
@coffeegoddd DOLT
|
|
@coffeegoddd DOLT
|
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 11 out of 11 changed files in this pull request and generated 7 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| // tree -> blob OR tree overwrite: remove old child entries under <key>/... | ||
| entries, err := gbs.api.ListTree(ctx, parent, key) | ||
| if err != nil { | ||
| return err | ||
| } | ||
| if len(entries) == 0 { | ||
| return nil | ||
| } | ||
| paths := make([]string, 0, len(entries)) | ||
| for _, e := range entries { | ||
| paths = append(paths, key+"/"+e.Name) | ||
| } |
There was a problem hiding this comment.
When overwriting an existing tree at key, this only removes the immediate children (key/<name>). If any child is itself a subtree, removing key/<subdir> won’t remove leaf entries like key/<subdir>/file, so stale files can remain staged and end up in the new commit. This should remove all descendant index entries under key/ (e.g. list recursively and remove leaf paths) to guarantee a clean overwrite.
| // tree -> blob OR tree overwrite: remove old child entries under <key>/... | |
| entries, err := gbs.api.ListTree(ctx, parent, key) | |
| if err != nil { | |
| return err | |
| } | |
| if len(entries) == 0 { | |
| return nil | |
| } | |
| paths := make([]string, 0, len(entries)) | |
| for _, e := range entries { | |
| paths = append(paths, key+"/"+e.Name) | |
| } | |
| // tree -> blob OR tree overwrite: remove all existing entries under <key>/... | |
| var paths []string | |
| var walkTree func(prefix string) error | |
| walkTree = func(prefix string) error { | |
| entries, err := gbs.api.ListTree(ctx, parent, prefix) | |
| if err != nil { | |
| return err | |
| } | |
| for _, e := range entries { | |
| childPath := prefix + "/" + e.Name | |
| paths = append(paths, childPath) | |
| // If this child is itself a tree, recurse into it so we remove | |
| // all descendant entries from the index. | |
| _, childTyp, err := gbs.api.ResolvePathObject(ctx, parent, childPath) | |
| if err != nil { | |
| if git.IsPathNotFound(err) { | |
| // Entry disappeared concurrently; just skip it. | |
| continue | |
| } | |
| return err | |
| } | |
| if childTyp == git.ObjectTypeTree { | |
| if err := walkTree(childPath); err != nil { | |
| return err | |
| } | |
| } | |
| } | |
| return nil | |
| } | |
| if err := walkTree(key); err != nil { | |
| return err | |
| } | |
| if len(paths) == 0 { | |
| return nil | |
| } |
|
@coffeegoddd DOLT
|
|
@coffeegoddd DOLT
|
Create db/gitblobstore-next-2a by leaving GitBlobstore.Concatenate unimplemented and removing concatenate-focused tests, while keeping chunked Get/Put/CheckAndPut work intact. Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
6c31dfd to
b7bb09d
Compare
|
@coffeegoddd DOLT
|
|
@reltuk and I had a conversation about this IRL and are moving forward to get Git remotes feature out |
This PR introduces
GitBlobstore, a Blobstore implementation backed by a git repository’s object database (bare repo or .git dir). Keysare stored as paths in the tree of a commit pointed to by a configured ref (e.g. refs/dolt/data), enabling Dolt remotes to be hosted on
standard git remotes.
High-level design
• Storage model
• Each blobstore key maps to a git tree path under the ref’s commit.
• Small objects are stored as a single git blob at .
• Large objects (when chunking enabled) are stored as a git tree at containing part blobs:
• /00000001, /00000002, … (lexicographically ordered)
• No descriptor header / no stored total size; size is derived by summing part blob sizes.
• Roll-forward only: this PR supports the above formats; it does not include backward-compat for any older descriptor-based chunking
formats.
• Per-key versioning
• Get/Put/CheckAndPut return a per-key version equal to the object id at :
• inline: blob OID
• chunked: tree OID
• Idempotent
Put• For non-
manifestkeys, Put fast-succeeds if already exists (assumes content-addressed semantics common in NBS/table files),returning the existing per-key version without consuming the reader.
• manifest remains mutable and is updated via CheckAndPut.
•
CheckAndPutsemantics• CheckAndPut performs CAS against the current per-key version at (not against the HEAD commit hash).
• Implementation uses a ref-level CAS retry loop:
• re-checks version at current HEAD
• only consumes/hashes the reader after the expected version matches
• retries safely if the ref advances due to unrelated updates
• Blob↔tree transitions
• Handles transitions between inline blob and chunked tree representations by proactively removing conflicting index paths before
staging new entries (avoids git index file-vs-directory conflicts).
Internal git plumbing additions
Adds/uses a unified internal GitAPI abstraction to support:
• resolving path objects and types (blob vs tree)
• listing tree entries for chunked reads
• removing paths from the index in bare repos
• staging and committing new trees, with configurable author/committer identity fallback