DRIVERS-3031 Clarify BSON Binary Vector design and encoding #1752

ghost · 2025-02-05T04:18:12Z

This is a change to the text of the BSON Binary Vector spec which intends to clarify the design without altering any of the intended technical content.

This PR addresses DRIVERS-3031. The spec claimed "All values use the little-endian format", but this was unclear and incorrect.

This PR makes no changes to tests. Test changes will be needed both for improved cross-language generality and for DRIVERS-3095. That will be a separate PR, possibly concurrent with CDRIVER-5641.

The new text avoids using "endian" terms except in one place where the inconsistency is specifically pointed out. The new version avoids making generalizations about all supported formats, and it instead describes the ordering for individual data formats.

Two main changes in overall writing style were included in the update:

The spec now has a neutral point of view with regard to low-level and high-level languages. Data type conversion is a specified but optional feature. Conversions and unpacking are positioned secondary to direct element access.
The spec now describes both the extensible interpretation of the header (field by field) and the tightly-specified interpretation (a list of allowed bit patterns) in order to offer organizing principles while making them explicitly optional. The "dtype" concept is no longer given special mention.

The "Padding" field is given a general interpretation whereas previously this was unclear. This general interpretation (items not bits, always from what would have been the vector's end) matches the one in the earlier BSON Binary Vector design document.

Additional reserved size codes are documented according to the table in the earlier design document.

Additional "SHOULD" validity criteria were added beyond what drivers currently implement or test:

Unused bits (indicated by Padding) SHOULD be zero
Vectors with multi-byte elements SHOULD NOT have any extra trailing bytes.

Before and after renderings:

https://specifications.readthedocs.io/en/latest/bson-binary-vector/bson-binary-vector/
https://specifications--1752.org.readthedocs.build/en/1752/bson-binary-vector/bson-binary-vector/

Please complete the following before merging:

Update changelog.
~~[ ] Test changes in at least one language driver.~~ No implementation changes.
[ ] Test these changes against all server versions and topologies (including standalone, replica set, sharded
clusters, and serverless). No implementation changes.

BorisDog · 2025-02-05T21:50:13Z

source/bson-binary-vector/bson-binary-vector.md

+| 5 bits | Second byte, most significant part  | (reserved)  |
+| 3 bits | Second byte, least significant part | Padding     |
+
+Reserved bits MUST be zero.


We had a discussion about the reserved bytes with @caseyclements , but I am not sure I remember its outcome.
Do we know why should drivers be limiting the reserved bytes to 0?
If the server starts sending additional data, do we necessarily want to break the users on older driver versions?

This requirement follows from the design goal for comparison stability. If reserved bit contents were unspecified, we would fail to meet the goal that BSON Binary Vectors can be compared for equality by comparing the BSON Binary for equality.

The combination of MUST and SHOULD here is an implementation of the "robustness principle" for compatibility with existing implementations. If we were doing this from scratch, I'd encourage MUST both for production and validation.

BorisDog · 2025-02-06T00:18:11Z

source/bson-binary-vector/bson-binary-vector.md

+paired with a Padding value:
+
+- The driver MUST ensure Padding is zero if the byte array is empty
+- The driver MUST ensure the unused bits in the final byte are zero


Is this required by the server?
From user's perspective this might be seen as inconvenience.
For example if the vector comes from another system (as it probably will), user would be required to zero the bits to avoid runtime exceptions. As opposed to just limiting the buffer length by "padding" parameter.

Also is this MUST consistent with SHOULD in:
"Drivers SHOULD reject Vectors with any unused bits in the final byte set to 1"

If packed bit data is coming from an external system that may leave unused bits set, it's a beneficial task to explicitly zero them. It's the comparison stability goal again.

BorisDog · 2025-02-06T00:58:55Z

source/bson-binary-vector/bson-binary-vector.md

+PACKED_BITS values MAY be optionally losslessly unpacked to a wider data type of the driver's choosing, for more
+convenient access. Drivers MUST provide a way to access PACKED_BITS without unpacking. In languages with compile-time
+abstraction, drivers SHOULD provide an abstract data type for manipulating elements in PACKED_BITS without unpacking. If
+abstraction is not practical, drivers can instead provide direct access to the byte array and 'Padding' value.


In .NET we implemented deserializing the vector data to both

Native types, like float[]

New dedicated vector types like BinaryVectorFloat32

In the case of PACKED_BITS and the native byte[] representation (as opposed to BinaryVectorPackedBit) we are loosing the Padding value when reading. Which causes data corruption in a round trip. So we disallowed Padding != 0 with native types.
Do you think it's worth mentioning in spec?

This seems like a problem with the .NET implementation? I noticed this in a few places, drivers failing to preserve the 'Padding' value when representing PACKED_BITS as a byte array. This doesn't look correct to me.

This leads into a question about intended compatibility. Right now the spec assumes all drivers will provide a full implementation of the three data types. If drivers wish to provide fewer types, or an incomplete implementation of PACKED_BITS, that may motivate creating named levels of support.

caseyclements

If such a big change to a spec is proposed, please allow time for everyone to review.

caseyclements · 2025-02-06T13:30:07Z

source/bson-binary-vector/bson-binary-vector.md

 The keywords "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and
 "OPTIONAL" in this document are to be interpreted as described in [RFC 2119](https://www.ietf.org/rfc/rfc2119.txt).

+Hexadecimal values are shown here with a `0x` prefix.


This is a complete rewrite of the description of this binary subtype. It had been vetted by quite a few senior engineers. What is the intention here?

Hi Casey! This PR follows directly from multiple conversations on Slack, which then turned into DRIVERS-3031 which I'm now scheduled to work on. That ticket documents the motivations for the change.

If you would like to clarify the sentence, "All values use the little-endian format", I would hope that you could do it by a change of no more than a couple lines.

Please take a look at the clarifications here and let me know if there are specific problems with my proposed changes.

ghost · 2025-02-07T16:35:01Z

I just noticed #1750, which overlaps with this. @qingyang-hu and I both added the requirement not to have any extra data beyond complete multi-byte elements. (No partial FLOAT32)

In #1750 this was added as a hard requirement, but I wrote it as SHOULD here to avoid breaking existing drivers.

I do think it would be beneficial to describe this as a hard requirement. I'd like to add a similar hard requirement that unused bits in the final byte are zero'ed. (Upgrading the SHOULD here to a MUST.)

ghost · 2025-02-07T20:03:56Z

I've been asked to replace this PR with a suite of smaller changes.

Micah Scott added 6 commits February 5, 2025 10:05

DRIVERS-3031 clarification for BSON Binary Vector spec

b265430

Changelog

f245486

Additional editing. Cleaned up repeats, refine validity criteria

a8788dc

Strict output even if input is looser for compatibility

2025703

Use "unpacked" carefully, and add extjson NaN

db01d98

Clarify 'trailing'

1e44424

ghost marked this pull request as ready for review February 5, 2025 19:32

ghost self-requested a review as a code owner February 5, 2025 19:32

ghost requested review from dariakp and removed request for a team February 5, 2025 19:32

ghost changed the title ~~Draft: DRIVERS-3031 Bit order is not accurately described by BSON binary vector spec~~ DRIVERS-3031 Clarify BSON Binary Vector design and encoding Feb 5, 2025

BorisDog reviewed Feb 6, 2025

View reviewed changes

caseyclements self-requested a review February 6, 2025 13:24

caseyclements requested changes Feb 6, 2025

View reviewed changes

kevinAlbs self-requested a review February 6, 2025 20:10

Micah Scott added 2 commits February 7, 2025 07:09

Consistency, header format in examples

6e87112

Merge branch 'master' into DRIVERS-3031

0841321

ghost requested a review from qingyang-hu February 7, 2025 16:30

ghost closed this Feb 7, 2025

ghost mentioned this pull request Feb 7, 2025

DRIVERS-3031, BSON Binary Vector clarifications: goals, non-goals, terms, and scope #1753

Closed

1 task

This pull request was closed.

DRIVERS-3031 Clarify BSON Binary Vector design and encoding #1752

DRIVERS-3031 Clarify BSON Binary Vector design and encoding #1752

Uh oh!

Conversation

ghost commented Feb 5, 2025 • edited by ghost Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

caseyclements left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ghost commented Feb 7, 2025

Uh oh!

ghost commented Feb 7, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ghost commented Feb 5, 2025 •

edited by ghost

Loading