-
Notifications
You must be signed in to change notification settings - Fork 245
DRIVERS-3031, BSON Binary Vector clarifications: goals, non-goals, terms, and scope #1753
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
kevinAlbs
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we intentionally provide named reduced levels of support, like PACKED_BITS with padding=0 only?
I expect only the three defined data types are supported. I am unsure if there is much value in noting cases like PACKED_BITS with padding=0 only.
How do we intend to roll out updates that add additional formats?
IIUC the data format was designed in WRITING-20326. I expect future updates might similarly be driven by other teams. I see the purpose of this specification as defining expected driver behavior. Deciding how to update to the data format seems out-of-scope of this spec.
Co-authored-by: Kevin Albertson <kevin.albertson@10gen.com>
I'm fine with both of these points of view, it's a strict but low-complexity approach. This does imply that we might have some additional driver work to do in response to tests that fully exercise the formats described in the spec. (Making non-divisible-by-8 length packed_bits a fully supported type in all environments, and the additional strictness around unused bits/bytes.) |
|
@mdbmes Thank you for splitting your original PR into pieces. This will make it easier, I think. My first request is actually to add a sentence in a paragraph that you didn't edit. :) After "but it is represented in memory as a list of integers in [0, 255]." would you please add something like, "Thus, each number represents 8 values of zeros and ones"? |
caseyclements
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only small changes requested.
|
|
||
| This document describes the subtype of the Binary BSON type used for efficient storage and retrieval of vectors. Vectors | ||
| here refer to densely packed arrays of numbers, all of the same type. | ||
| This document describes a new *Vector* subtype (9) for BSON Binary items, used to compactly represent ordered |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would you please drop the 9 in this first sentence, or perhaps change to this?
"This document describes a new Vector subtype (9) of the BSON Binary type (5) ..."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok
| - Drivers SHOULD provide low-overhead APIs for producing and consuming Vector data in the closest compatible language | ||
| types, without conversions more expensive than copying or byte-swapping. These APIs are not standardized across | ||
| languages. | ||
| - Drivers SHOULD provide facilities for converting between BSON Binary Vector and BSON Array representations. When they |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Python doesn't provide a facility to convert between BSON Arrays and Binary representations. We do nothing with BSON Arrays. (That said, everything mentioned in non-goals and motivation above is great.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My thinking here is that the Python reference implementation provides substantially similar functionality as far as users are concerned, since Python lists can be used as a close analog to BSON arrays. My preference in the spec and especially the spec tests would be to standardize conversion exactly around BSON arrays (avoiding Pythonisms) but to give individual drivers plenty of freedom to use the interfaces that make the most sense for their language.
As an example, the C implementation has no applicable native container types, so the array-to-vector and vector-to-array conversions make natural additions to the API to fill the same functionality niche that from_vector and as_vector handle in Python. In C, it's important to describe access and conversion distinctly because both are possible and useful.
If discussion reveals that this isn't actually as useful as I thought, I'm happy to downgrade it to MAY. I'd like to standardize these conversions if possible, since at the very least they seem like a natural way to implement language-independent tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With the goal that it will provide more robust unified testing, then I am open to the idea. It would be good to get the implementers of each driver together to discuss unified testing, maybe early March?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm ready for this as soon as possible, my current project is the C and C++ driver implementation and those PRs have a list of testing improvements that are being deferred pending spec work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the value for our users for the BSON Array <-> BSON Binary Vector conversion?
Do we anticipate some common use cases for such conversions?
In C# we did not see the need for the direct conversion yet. BSON Array is mostly used as internal representation by the driver, the end user just uses native types. If needed, the conversion can be easily done via Binary Vector <-> Native types <-> BsonArray transformation. I suspect that would be the case for most drivers. And this is covered by the previous bullet point. ( "closest compatible language types").
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's for sure needed as an internal feature for test support, right? The current tests use an ad-hoc format that is scheduled for replacement by DRIVERS-3095. If there's truly no use to users we could define these conversions as for testing only, but in C I'd expect these conversions to be useful for ingesting and serializing data, and in all languages I would have thought they would provide useful migration tools. In designing the C API, I wanted to avoid requiring additional intermediate steps in operations that could be defined as direct conversions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe because that in C# the conversions are pretty trivial, and that's why I struggle to see the value of this additional clause from C# pov, given the previous one.
BSON Array from/to BSON Vector conversion is just one liner in the test runner itself.
I am not opposed to this clause, just trying to understand whether it brings any additional value to other drivers, as it seems not applicable in C#.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mdbmes What is the list of testing improvements?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not opposed to this clause, just trying to understand whether it brings any additional value to other drivers, as it seems not applicable in C#.
Understood. Even if it's simple, my thought is that it's the domain we want to be specifying conversions in. (As this spec already depends on BSON semantics, but should be language independent)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the list of testing improvements?
See DRIVERS-3095, DRIVERS-3097, and the TODOs in mongodb/mongo-c-driver#1868
| ## Specification | ||
|
|
||
| This specification introduces a new BSON binary subtype, the vector, with value `9`. | ||
| ### Scope |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is your intention in adding this scope section of bullet points? Are all of them covered in this pared-down pull-request?
- Define the interpretation of the data bytes in BSON Binary items of subtype
9. - Detail the first two data bytes that form a header.
...
- ...consuming Vector data in the closest type compatible with the driver's language...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My intent was to agree on a scope for the specification. The specification should surely make an attempt to describe everything that's in scope, but this is my attempt to also describe the bounds of the scope. My earlier PR did make an attempt to align the content of the document with the claims here, but with the introduction alone it's an aspirational declaration that I'd hope would just let us get on the same page for the rest of this process.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the idea of getting onto the same page. I also want each commit on main to be self-standing. The full rewrite, which is basically what #1752 is, if needed, is going to take some time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My understanding of the code review process is that we're aiming for incremental improvement, and each PR doesn't need to fully resolve all known outstanding problems. I estimated that declaring a scope improves the document even if the rest of the document is known to need further work.
Can we keep this PR on topic? Type descriptions have multiple issues, but I'd prefer to keep that discussion separate. I proposed changes to this section in #1752 which are being deferred so we can discuss piece by piece. |
|
Extracting this PR from #1752 hasn't improved the review process, and there's no support for this change. Closing. |
This is a nearly a subset of #1752, covering only the front matter: goals, non-goals, terms, and scope.
Minor changes from that PR:
Other minor notes:
DRIVERS-3031 is a set of clarifications instigated by an incorrect description of bit order. See the ticket, it's a deeper issue than a single description localized to one part of the spec. The fix for the order description necessitates clarifying which portions of the design are intended to be per-type vs. generalized. The original generalized description was incorrect, and I'd suggest we are better served by specifying characteristics of individual formats and avoiding unnecessary generalizations that affect future compatibility.
To analyze possible resolutions to DRIVERS-3031, it would help if the specification could document its own goals and scope. This PR offers new front matter, with very little text removed.
Specific details on removed text:
This PR intends to write down all the relevant goals and background that I am aware of, and invite discussion to refine the list as needed.
To discuss:
There are still unresolved questions about our compatibility goals. Should we intentionally provide named reduced levels of support, like PACKED_BITS with padding=0 only? How do we intend to roll out updates that add additional formats?
Please complete the following before merging:
[ ] Test changes in at least one language driver. No changes.[ ] Test these changes against all server versions and topologies (including standalone, replica set, shardedNo changes.clusters, and serverless).