Skip to content

Conversation

@ghost
Copy link

@ghost ghost commented Feb 5, 2025

This is a change to the text of the BSON Binary Vector spec which intends to clarify the design without altering any of the intended technical content.

This PR addresses DRIVERS-3031. The spec claimed "All values use the little-endian format", but this was unclear and incorrect.

This PR makes no changes to tests. Test changes will be needed both for improved cross-language generality and for DRIVERS-3095. That will be a separate PR, possibly concurrent with CDRIVER-5641.

The new text avoids using "endian" terms except in one place where the inconsistency is specifically pointed out. The new version avoids making generalizations about all supported formats, and it instead describes the ordering for individual data formats.

Two main changes in overall writing style were included in the update:

  • The spec now has a neutral point of view with regard to low-level and high-level languages. Data type conversion is a specified but optional feature. Conversions and unpacking are positioned secondary to direct element access.
  • The spec now describes both the extensible interpretation of the header (field by field) and the tightly-specified interpretation (a list of allowed bit patterns) in order to offer organizing principles while making them explicitly optional. The "dtype" concept is no longer given special mention.

The "Padding" field is given a general interpretation whereas previously this was unclear. This general interpretation (items not bits, always from what would have been the vector's end) matches the one in the earlier BSON Binary Vector design document.

Additional reserved size codes are documented according to the table in the earlier design document.

Additional "SHOULD" validity criteria were added beyond what drivers currently implement or test:

  • Unused bits (indicated by Padding) SHOULD be zero
  • Vectors with multi-byte elements SHOULD NOT have any extra trailing bytes.

Before and after renderings:

https://specifications.readthedocs.io/en/latest/bson-binary-vector/bson-binary-vector/
https://specifications--1752.org.readthedocs.build/en/1752/bson-binary-vector/bson-binary-vector/

Please complete the following before merging:

  • Update changelog.
  • [ ] Test changes in at least one language driver. No implementation changes.
  • [ ] Test these changes against all server versions and topologies (including standalone, replica set, sharded
    clusters, and serverless).
    No implementation changes.

@ghost ghost marked this pull request as ready for review February 5, 2025 19:32
@ghost ghost self-requested a review as a code owner February 5, 2025 19:32
@ghost ghost requested review from dariakp and removed request for a team February 5, 2025 19:32
@ghost ghost changed the title Draft: DRIVERS-3031 Bit order is not accurately described by BSON binary vector spec DRIVERS-3031 Clarify BSON Binary Vector design and encoding Feb 5, 2025
| 5 bits | Second byte, most significant part | (reserved) |
| 3 bits | Second byte, least significant part | Padding |

Reserved bits MUST be zero.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We had a discussion about the reserved bytes with @caseyclements , but I am not sure I remember its outcome.
Do we know why should drivers be limiting the reserved bytes to 0?
If the server starts sending additional data, do we necessarily want to break the users on older driver versions?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This requirement follows from the design goal for comparison stability. If reserved bit contents were unspecified, we would fail to meet the goal that BSON Binary Vectors can be compared for equality by comparing the BSON Binary for equality.

The combination of MUST and SHOULD here is an implementation of the "robustness principle" for compatibility with existing implementations. If we were doing this from scratch, I'd encourage MUST both for production and validation.

paired with a Padding value:

- The driver MUST ensure Padding is zero if the byte array is empty
- The driver MUST ensure the unused bits in the final byte are zero
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this required by the server?
From user's perspective this might be seen as inconvenience.
For example if the vector comes from another system (as it probably will), user would be required to zero the bits to avoid runtime exceptions. As opposed to just limiting the buffer length by "padding" parameter.

Also is this MUST consistent with SHOULD in:
"Drivers SHOULD reject Vectors with any unused bits in the final byte set to 1"

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If packed bit data is coming from an external system that may leave unused bits set, it's a beneficial task to explicitly zero them. It's the comparison stability goal again.

PACKED_BITS values MAY be optionally losslessly unpacked to a wider data type of the driver's choosing, for more
convenient access. Drivers MUST provide a way to access PACKED_BITS without unpacking. In languages with compile-time
abstraction, drivers SHOULD provide an abstract data type for manipulating elements in PACKED_BITS without unpacking. If
abstraction is not practical, drivers can instead provide direct access to the byte array and 'Padding' value.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In .NET we implemented deserializing the vector data to both

  1. Native types, like float[]
  2. New dedicated vector types like BinaryVectorFloat32

In the case of PACKED_BITS and the native byte[] representation (as opposed to BinaryVectorPackedBit) we are loosing the Padding value when reading. Which causes data corruption in a round trip. So we disallowed Padding != 0 with native types.
Do you think it's worth mentioning in spec?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a problem with the .NET implementation? I noticed this in a few places, drivers failing to preserve the 'Padding' value when representing PACKED_BITS as a byte array. This doesn't look correct to me.

This leads into a question about intended compatibility. Right now the spec assumes all drivers will provide a full implementation of the three data types. If drivers wish to provide fewer types, or an incomplete implementation of PACKED_BITS, that may motivate creating named levels of support.

@caseyclements caseyclements self-requested a review February 6, 2025 13:24
Copy link
Contributor

@caseyclements caseyclements left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If such a big change to a spec is proposed, please allow time for everyone to review.

The keywords "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and
"OPTIONAL" in this document are to be interpreted as described in [RFC 2119](https://www.ietf.org/rfc/rfc2119.txt).

Hexadecimal values are shown here with a `0x` prefix.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a complete rewrite of the description of this binary subtype. It had been vetted by quite a few senior engineers. What is the intention here?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Casey! This PR follows directly from multiple conversations on Slack, which then turned into DRIVERS-3031 which I'm now scheduled to work on. That ticket documents the motivations for the change.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you would like to clarify the sentence, "All values use the little-endian format", I would hope that you could do it by a change of no more than a couple lines.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please take a look at the clarifications here and let me know if there are specific problems with my proposed changes.

@kevinAlbs kevinAlbs self-requested a review February 6, 2025 20:10
@ghost ghost requested a review from qingyang-hu February 7, 2025 16:30
@ghost
Copy link
Author

ghost commented Feb 7, 2025

I just noticed #1750, which overlaps with this. @qingyang-hu and I both added the requirement not to have any extra data beyond complete multi-byte elements. (No partial FLOAT32)

In #1750 this was added as a hard requirement, but I wrote it as SHOULD here to avoid breaking existing drivers.

I do think it would be beneficial to describe this as a hard requirement. I'd like to add a similar hard requirement that unused bits in the final byte are zero'ed. (Upgrading the SHOULD here to a MUST.)

@ghost
Copy link
Author

ghost commented Feb 7, 2025

I've been asked to replace this PR with a suite of smaller changes.

This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants