-
Notifications
You must be signed in to change notification settings - Fork 245
DRIVERS-3031 Clarify BSON Binary Vector design and encoding #1752
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| | 5 bits | Second byte, most significant part | (reserved) | | ||
| | 3 bits | Second byte, least significant part | Padding | | ||
|
|
||
| Reserved bits MUST be zero. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We had a discussion about the reserved bytes with @caseyclements , but I am not sure I remember its outcome.
Do we know why should drivers be limiting the reserved bytes to 0?
If the server starts sending additional data, do we necessarily want to break the users on older driver versions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This requirement follows from the design goal for comparison stability. If reserved bit contents were unspecified, we would fail to meet the goal that BSON Binary Vectors can be compared for equality by comparing the BSON Binary for equality.
The combination of MUST and SHOULD here is an implementation of the "robustness principle" for compatibility with existing implementations. If we were doing this from scratch, I'd encourage MUST both for production and validation.
| paired with a Padding value: | ||
|
|
||
| - The driver MUST ensure Padding is zero if the byte array is empty | ||
| - The driver MUST ensure the unused bits in the final byte are zero |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this required by the server?
From user's perspective this might be seen as inconvenience.
For example if the vector comes from another system (as it probably will), user would be required to zero the bits to avoid runtime exceptions. As opposed to just limiting the buffer length by "padding" parameter.
Also is this MUST consistent with SHOULD in:
"Drivers SHOULD reject Vectors with any unused bits in the final byte set to 1"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If packed bit data is coming from an external system that may leave unused bits set, it's a beneficial task to explicitly zero them. It's the comparison stability goal again.
| PACKED_BITS values MAY be optionally losslessly unpacked to a wider data type of the driver's choosing, for more | ||
| convenient access. Drivers MUST provide a way to access PACKED_BITS without unpacking. In languages with compile-time | ||
| abstraction, drivers SHOULD provide an abstract data type for manipulating elements in PACKED_BITS without unpacking. If | ||
| abstraction is not practical, drivers can instead provide direct access to the byte array and 'Padding' value. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In .NET we implemented deserializing the vector data to both
- Native types, like
float[] - New dedicated vector types like
BinaryVectorFloat32
In the case of PACKED_BITS and the native byte[] representation (as opposed to BinaryVectorPackedBit) we are loosing the Padding value when reading. Which causes data corruption in a round trip. So we disallowed Padding != 0 with native types.
Do you think it's worth mentioning in spec?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like a problem with the .NET implementation? I noticed this in a few places, drivers failing to preserve the 'Padding' value when representing PACKED_BITS as a byte array. This doesn't look correct to me.
This leads into a question about intended compatibility. Right now the spec assumes all drivers will provide a full implementation of the three data types. If drivers wish to provide fewer types, or an incomplete implementation of PACKED_BITS, that may motivate creating named levels of support.
caseyclements
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If such a big change to a spec is proposed, please allow time for everyone to review.
| The keywords "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and | ||
| "OPTIONAL" in this document are to be interpreted as described in [RFC 2119](https://www.ietf.org/rfc/rfc2119.txt). | ||
|
|
||
| Hexadecimal values are shown here with a `0x` prefix. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a complete rewrite of the description of this binary subtype. It had been vetted by quite a few senior engineers. What is the intention here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi Casey! This PR follows directly from multiple conversations on Slack, which then turned into DRIVERS-3031 which I'm now scheduled to work on. That ticket documents the motivations for the change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you would like to clarify the sentence, "All values use the little-endian format", I would hope that you could do it by a change of no more than a couple lines.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please take a look at the clarifications here and let me know if there are specific problems with my proposed changes.
|
I just noticed #1750, which overlaps with this. @qingyang-hu and I both added the requirement not to have any extra data beyond complete multi-byte elements. (No partial FLOAT32) In #1750 this was added as a hard requirement, but I wrote it as SHOULD here to avoid breaking existing drivers. I do think it would be beneficial to describe this as a hard requirement. I'd like to add a similar hard requirement that unused bits in the final byte are zero'ed. (Upgrading the SHOULD here to a MUST.) |
|
I've been asked to replace this PR with a suite of smaller changes. |
This is a change to the text of the BSON Binary Vector spec which intends to clarify the design without altering any of the intended technical content.
This PR addresses DRIVERS-3031. The spec claimed "All values use the little-endian format", but this was unclear and incorrect.
This PR makes no changes to tests. Test changes will be needed both for improved cross-language generality and for DRIVERS-3095. That will be a separate PR, possibly concurrent with CDRIVER-5641.
The new text avoids using "endian" terms except in one place where the inconsistency is specifically pointed out. The new version avoids making generalizations about all supported formats, and it instead describes the ordering for individual data formats.
Two main changes in overall writing style were included in the update:
The "Padding" field is given a general interpretation whereas previously this was unclear. This general interpretation (items not bits, always from what would have been the vector's end) matches the one in the earlier BSON Binary Vector design document.
Additional reserved size codes are documented according to the table in the earlier design document.
Additional "SHOULD" validity criteria were added beyond what drivers currently implement or test:
Before and after renderings:
https://specifications.readthedocs.io/en/latest/bson-binary-vector/bson-binary-vector/
https://specifications--1752.org.readthedocs.build/en/1752/bson-binary-vector/bson-binary-vector/
Please complete the following before merging:
[ ] Test changes in at least one language driver.No implementation changes.[ ] Test these changes against all server versions and topologies (including standalone, replica set, shardedNo implementation changes.clusters, and serverless).