-
Notifications
You must be signed in to change notification settings - Fork 246
DRIVERS-3031, BSON Binary Vector clarifications: goals, non-goals, terms, and scope #1753
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -7,25 +7,77 @@ ______________________________________________________________________ | |
|
|
||
| ## Abstract | ||
|
|
||
| This document describes the subtype of the Binary BSON type used for efficient storage and retrieval of vectors. Vectors | ||
| here refer to densely packed arrays of numbers, all of the same type. | ||
| This document describes a new *Vector* subtype for BSON Binary items, used to compactly represent ordered collections of | ||
| uniformly-typed elements. A framework is presented for future type extensibility, but adoption complexity is limited by | ||
| allowing support for only a restricted set of element types at first: | ||
|
|
||
| ## Motivation | ||
| - 1-bit unsigned integer | ||
| - 8-bit signed integer | ||
| - 32-bit floating point | ||
|
|
||
| These representations correspond to the numeric types supported by popular numerical libraries for vector processing, | ||
| such as NumPy, PyTorch, TensorFlow and Apache Arrow. Storing and retrieving vector data using the same densely packed | ||
| format used by these libraries can result in significant memory savings and processing efficiency. | ||
|
|
||
| ### META | ||
| ## Meta | ||
|
|
||
| The keywords "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and | ||
| "OPTIONAL" in this document are to be interpreted as described in [RFC 2119](https://www.ietf.org/rfc/rfc2119.txt). | ||
|
|
||
| Hexadecimal values are shown here with a `0x` prefix. | ||
|
|
||
| ## Terms | ||
|
|
||
| *BSON Array* - Arrays are a fundamental container type in BSON for ordered sequences, implemented as item type `4`. Each | ||
| element can have an arbitrary data type. The encoding is relatively high-overhead, due to both the non-uniform types and | ||
| the required element name strings. | ||
|
|
||
| *BSON Binary* - BSON Binary items (type `5`) are a container for a variable-length byte sequence with extensible | ||
| interpretation, according to an 8-bit *subtype*. | ||
|
|
||
| *BSON Binary Vector* - A BSON Binary item of subtype `9`. Also referred to here as a Vector. | ||
|
|
||
| ## Motivation for Change | ||
|
|
||
| BSON does not on its own provide a densely packed encoding for numeric data of uniform element type. Numbers stored in a | ||
| BSON Array have high space overhead, owing to the item name and type included with each value. This specification offers | ||
| an alternative data format with improved performance and limited complexity. | ||
|
|
||
| ### Goals | ||
|
|
||
| - Vectors provide improved resource efficiency compared to BSON Arrays. | ||
| - Every Vector is guaranteed to represent a sequence of elements with uniform type and size. | ||
| - Vectors may be reliably compared for equality by comparing their encoded BSON Binary representation. | ||
| - Implementation complexity should be minimal. | ||
|
|
||
| ### Non-Goals | ||
|
|
||
| - No changes to Extended JSON representation are defined. Vectors will serialize to Binary items with base64 encoding: | ||
| `{"$binary": {"base64": ... , "subType": "9" }}`. | ||
| - The Vector is a 1-dimensional container. Applications may implement multi-dimensional arrays efficiently by bundling a | ||
| Vector with additional metadata, but this usage is not standardized here. | ||
| - Comprehensive support for all possible data types and bit/byte ordering is not a goal. This specification prefers to | ||
| reduce complexity by limiting the set of allowed types and providing no unnecessary data formatting options. | ||
| - Vectors within a BSON document are NOT designed for "zero copy" access by direct architecture-specific load or store. | ||
| Typically multi-byte values will not be aligned as required, and they may need byte order conversion. Internal | ||
| padding for alignment is not supported, as this would impact comparison stability. | ||
| - Vectors do not include any data compression features. Applications may see benefit from careful choice of an external | ||
| compression algorithm. | ||
| - Vectors do not provide any new comparison methods beyond byte-equality. Vectors are never equal to Arrays, even when | ||
| they represent the same numeric elements. Vectors of different element types are not comparable. | ||
| - Vectors do not guarantee that element types defined in the future will always be scalar numbers, only that elements of | ||
| a Vector always have identical type and size. | ||
|
|
||
| ## Specification | ||
|
|
||
| This specification introduces a new BSON binary subtype, the vector, with value `9`. | ||
| ### Scope | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What is your intention in adding this scope section of bullet points? Are all of them covered in this pared-down pull-request?
...
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. My intent was to agree on a scope for the specification. The specification should surely make an attempt to describe everything that's in scope, but this is my attempt to also describe the bounds of the scope. My earlier PR did make an attempt to align the content of the document with the claims here, but with the introduction alone it's an aspirational declaration that I'd hope would just let us get on the same page for the rest of this process.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I like the idea of getting onto the same page. I also want each commit on main to be self-standing. The full rewrite, which is basically what #1752 is, if needed, is going to take some time.
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. My understanding of the code review process is that we're aiming for incremental improvement, and each PR doesn't need to fully resolve all known outstanding problems. I estimated that declaring a scope improves the document even if the rest of the document is known to need further work. |
||
|
|
||
| Drivers SHOULD provide idiomatic APIs to translate between arrays of numbers and this BSON Binary specification. | ||
| - This specification defines the meaning of the data bytes in BSON Binary items of subtype `9`. | ||
| - The first two data bytes form a header, with meaning defined here. | ||
| - This specification defines validity criteria for accepting or rejecting byte strings. | ||
| - This specification includes JSON tests with valid documents, invalid documents, and expected conversion results. | ||
| - Drivers SHOULD provide low-overhead APIs for producing and consuming Vector data in the closest compatible language | ||
| types, without conversions more expensive than copying or byte-swapping. These APIs are not standardized across | ||
| languages. | ||
| - Drivers SHOULD provide facilities for converting between BSON Binary Vector and BSON Array representations. When they | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Python doesn't provide a facility to convert between BSON Arrays and Binary representations. We do nothing with BSON Arrays. (That said, everything mentioned in non-goals and motivation above is great.)
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. My thinking here is that the Python reference implementation provides substantially similar functionality as far as users are concerned, since Python lists can be used as a close analog to BSON arrays. My preference in the spec and especially the spec tests would be to standardize conversion exactly around BSON arrays (avoiding Pythonisms) but to give individual drivers plenty of freedom to use the interfaces that make the most sense for their language. As an example, the C implementation has no applicable native container types, so the array-to-vector and vector-to-array conversions make natural additions to the API to fill the same functionality niche that from_vector and as_vector handle in Python. In C, it's important to describe access and conversion distinctly because both are possible and useful. If discussion reveals that this isn't actually as useful as I thought, I'm happy to downgrade it to MAY. I'd like to standardize these conversions if possible, since at the very least they seem like a natural way to implement language-independent tests.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. With the goal that it will provide more robust unified testing, then I am open to the idea. It would be good to get the implementers of each driver together to discuss unified testing, maybe early March?
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm ready for this as soon as possible, my current project is the C and C++ driver implementation and those PRs have a list of testing improvements that are being deferred pending spec work.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What is the value for our users for the BSON Array <-> BSON Binary Vector conversion? In C# we did not see the need for the direct conversion yet. BSON Array is mostly used as internal representation by the driver, the end user just uses native types. If needed, the conversion can be easily done via Binary Vector <-> Native types <-> BsonArray transformation. I suspect that would be the case for most drivers. And this is covered by the previous bullet point. ( "closest compatible language types").
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It's for sure needed as an internal feature for test support, right? The current tests use an ad-hoc format that is scheduled for replacement by DRIVERS-3095. If there's truly no use to users we could define these conversions as for testing only, but in C I'd expect these conversions to be useful for ingesting and serializing data, and in all languages I would have thought they would provide useful migration tools. In designing the C API, I wanted to avoid requiring additional intermediate steps in operations that could be defined as direct conversions.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Maybe because that in C# the conversions are pretty trivial, and that's why I struggle to see the value of this additional clause from C# pov, given the previous one. I am not opposed to this clause, just trying to understand whether it brings any additional value to other drivers, as it seems not applicable in C#.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @mdbmes What is the list of testing improvements?
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Understood. Even if it's simple, my thought is that it's the domain we want to be specifying conversions in. (As this spec already depends on BSON semantics, but should be language independent)
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
See DRIVERS-3095, DRIVERS-3097, and the TODOs in mongodb/mongo-c-driver#1868 |
||
| choose to do so, they MUST ensure compliance using the provided tests. Drivers MUST NOT automatically convert | ||
| between representations. | ||
|
|
||
| ### Data Types (dtypes) | ||
|
|
||
|
|
@@ -247,6 +299,8 @@ See the [README](tests/README.md) for tests. | |
|
|
||
| ## Changelog | ||
|
|
||
| - 2025-02-07: Documented goals, non-goals, terms, and scope. | ||
|
|
||
| - 2025-02-04: Update validation for decoding into a FLOAT32 vector. | ||
|
|
||
| - 2024-11-01: BSON Binary Subtype 9 accepted DRIVERS-2926 (#1708) | ||
Uh oh!
There was an error while loading. Please reload this page.