-
-
Notifications
You must be signed in to change notification settings - Fork 516
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add initial implementation of Proximity Map #8408
base: main
Are you sure you want to change the base?
Conversation
@nicktobey DOLT
|
6b627d5
to
91a5a5d
Compare
@nicktobey DOLT
|
91a5a5d
to
2a75e85
Compare
@nicktobey DOLT
|
@coffeegoddd DOLT
|
I refactored the algorithm for creating the map, outlining it into more helper functions and adding more documentation. I also bumped go to 1.23 in order to use the new There's a couple potential flaws here depending on exactly how ProximityMaps get used, and they both occur if two rows have the same vector value:
|
@nicktobey DOLT
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Everything I looked at looks good. No strong opinions on the flatbuffer organization or the parameterization of the chunker factory. But I skipped all of prolly/proximity_map.go
, prolly/tree/proximity_map.go
, and deterministic_node_splitter.go
for now, so I will need to follow up on those.
if serial.ProllyTreeNodeNumFields < pm.Table().NumFields() { | ||
return nil, fb.ErrTableHasUnknownFields | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Handled by gen/fb/serial/vectorindexnode.go, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like you're right. I was copying the code from prolly_map.go
, but this seems unnecessary.
@nicktobey DOLT
|
@nicktobey DOLT
|
@nicktobey DOLT
|
@coffeegoddd DOLT
|
…ncy on github.com/esote/minmaxheap)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few comments for now. Following up a bit on map building and the approach in general offline as well.
"github.com/dolthub/dolt/go/store/hash" | ||
"github.com/dolthub/dolt/go/store/prolly/message" | ||
"github.com/dolthub/go-mysql-server/sql" | ||
"github.com/esote/minmaxheap" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These need some changes to make this compile.
+ "github.com/dolthub/go-mysql-server/sql/expression"
- "sort"
- "fmt"
- "github.com/dolthub/dolt/go/store/prolly/message"
return WalkNodes(ctx, t.Root, t.NodeStore, cb) | ||
} | ||
|
||
// GetExact searches for an exact vector in the index, calling |cb| with the matching key-value pairs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems poorly named? It seems to get the closest / lowest distance vector. And the comment is maybe confusing because it only returns the first one, even if multiple are the same distance?
} | ||
} | ||
|
||
func (t ProximityMap[K, V, O]) Has(ctx context.Context, query K) (ok bool, err error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will this always return true unless the map is empty?
if err != nil { | ||
return err | ||
} | ||
nextLevelNodes.Insert(keyAndDistance.key, node.GetValue(0), keyAndDistance.distance) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To a reader, this would be more transparently correct without this special case handling of the 0
element here. In particular, while I was reading this, I was thinking about how the parent node key in a regular prolly tree is the last key in the child node, not the first, and in order to verify that the behavior is correct here, I would have to go and verify that we take the first and not the last key in the a proximity map instead. It's more clear to me to just start the loop below at 0
. (Admittedly potentially less efficient...)
Order: keyDesc, | ||
DistanceType: distanceType, | ||
Convert: func(bytes []byte) []float64 { | ||
h, _ := keyDesc.GetJSONAddr(0, bytes) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it correct to say that the thing we're storing in the keys of the prollytree is an address to a chunk that has the floats stored as a json array of json numbers?
jsonWrapper, err := doc.ToIndexedJSONDocument(ctx) | ||
if err != nil { | ||
panic(err) | ||
} | ||
floats, err := sql.ConvertToVector(jsonWrapper) | ||
if err != nil { | ||
panic(err) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If my understanding of the above is correct, we can't panic here. It's a not a fundamental logic error to see an I/O error fetching these chunks. These errors need to be returned through the interface and visible to a caller.
This adds a new message/node type: Vector Index, and a corresponding prolly-tree-based map structure: the Proximity Map
The Vector Index message currently has a subset of the Prolly Map message. A different message type was chosen to prevent them from being accidentally confused: while a Vector Index has the same fields as a Prolly Map, and has similar computed properties, their invariants and iteration order are completely different, and algorithms written for Prolly Map nodes should not accidentally operate on Vector Index nodes instead. Old versions of Dolt that do not support vector indexes should not attempt to manipulate Vector Index nodes as if they were Prolly Map nodes.
Proximity Maps resemble other Prolly Maps, but have the following invariants:
val.Tuple
, and the vector is the first value in that tuple.Notably, while the keys of an individual node are sorted, walking all of a vector indexes keys in standard iteration order will not be sorted.
This is a useful construct because it allows for efficient proximity-based lookup, which are instrumental for quickly running "approximate nearest neighbor" algorithms.
Currently, chunk boundaries are computed completely deterministically. No rolling hash or weibull distribution is used to control chunk sizes. This was chosen because it has the simplest guarentees about chunk boundaries, which makes it easier to reason about the cost of fixup operations, and makes cascading chunk boundary changes impossible: for example, a key that marks a chunk boundary on the first two levels will always mark a chunk boundary on the first two levels even if it gets moved when fixing up the tree.
This has its downsides: the chunk sizes follow a geometric distribution with no maximum or minimum size. This may change in the future, but is acceptable as a first pass.