From b265430e65098f35052dab04373ee339ca30d0dd Mon Sep 17 00:00:00 2001 From: Micah Scott Date: Wed, 5 Feb 2025 10:05:42 -0800 Subject: [PATCH 1/7] DRIVERS-3031 clarification for BSON Binary Vector spec --- .../bson-binary-vector/bson-binary-vector.md | 429 +++++++++--------- 1 file changed, 220 insertions(+), 209 deletions(-) diff --git a/source/bson-binary-vector/bson-binary-vector.md b/source/bson-binary-vector/bson-binary-vector.md index edbeb5944b..6e5c65e636 100644 --- a/source/bson-binary-vector/bson-binary-vector.md +++ b/source/bson-binary-vector/bson-binary-vector.md @@ -7,238 +7,249 @@ ______________________________________________________________________ ## Abstract -This document describes the subtype of the Binary BSON type used for efficient storage and retrieval of vectors. Vectors -here refer to densely packed arrays of numbers, all of the same type. +This document describes a new *Vector* subtype (9) for BSON Binary items, used to compactly represent ordered +collections of uniformly-typed elements. A framework is presented for future type extensibility, but adoption complexity +is limited by allowing support for only a restricted set of element types at first: -## Motivation +- 1-bit unsigned integers +- 8-bit signed integers +- 32-bit floating point -These representations correspond to the numeric types supported by popular numerical libraries for vector processing, -such as NumPy, PyTorch, TensorFlow and Apache Arrow. Storing and retrieving vector data using the same densely packed -format used by these libraries can result in significant memory savings and processing efficiency. - -### META +## Meta The keywords "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in [RFC 2119](https://www.ietf.org/rfc/rfc2119.txt). +Hexadecimal values are shown here with a `0x` prefix. + +Bit strings are grouped with insignificant whitespace for readability. + +## Terms + +*BSON Array* - Arrays are a fundamental container type in BSON for ordered sequences, implemented as item type `4`. Each +element can have an arbitrary data type. The encoding is relatively high-overhead, due to both the non-uniform types and +the required element name strings. + +*BSON Binary* - BSON Binary items (type `5`) are a container for a variable-length byte sequence with extensible +interpretation, according to an 8-bit *subtype*. + +*BSON Binary Vector* - A BSON Binary item of subtype `9`. Also referred to here as a Vector. + +## Motivation for Change + +BSON does not on its own provide a densely packed encoding for numeric data of uniform data type. Numbers stored in a +BSON Array have high space overhead, owing to the item name and type included with each value. This specification offers +an alternative collection type with improved performance and limited complexity. + +### Goals + +- Vectors provide improved resource efficiency compared to BSON Arrays. +- Every Vector is guaranteed to represent a sequence of elements with uniform type and size. +- Vectors may be reliably compared for equality by comparing their encoded BSON Binary representation. +- Implementation complexity should be minimal. + +### Non-Goals + +- No changes to Extended JSON representation are defined. Vectors will serialize to generic Binary items with base64 + encoding: `{"$binary": {"base64": ... , "subType": "9" }}`. +- The Vector is a 1-dimensional container. Applications may implement multi-dimensional arrays efficiently by bundling a + Vector with additional metadata, but this usage is not standardized here. +- Comprehensive support for all possible data types and bit/byte ordering is not a goal. This specification prefers to + reduce complexity by limiting the set of allowed types and providing no unnecessary data formatting options. +- Vectors within a BSON document are NOT designed for "zero copy" access by direct architecture-specific load or store. + Typically multi-byte values will not be aligned as required, and they may need byte order conversion. Internal + padding for alignment is not supported, as this would impact comparison stability. +- Vectors do not include any data compression features. Applications may see benefit from careful choice of an external + compression algorithm. +- Vectors do not provide any new comparison methods. Identical Vector values must compare as identical encoded BSON + Binary byte strings. Vectors are never equal to Arrays, even when they represent the same numeric elements. +- Vectors do not guarantee that element types defined in the future will always be scalar numbers, only that Vector + elements always have identical type and size. + ## Specification -This specification introduces a new BSON binary subtype, the vector, with value `9`. +### Scope + +- This specification defines the meaning of the data bytes in BSON Binary items of subtype `9`. +- The first two data bytes form a header, with meaning defined here. +- This specification defines validity criteria for accepting or rejecting byte strings. +- Drivers may optionally implement conversions between BSON Array and Vector types. This specification defines rules + that must be followed when conversions are implemented. +- This specification includes JSON tests with valid documents, invalid documents, and expected conversion results. +- Drivers SHOULD provide low-overhead APIs for producing and consuming Vector data in the closest compatible language + types, without conversions more expensive than copying or byte-swapping. These APIs are not standardized across + languages. +- Drivers MAY provide facilities for converting between BSON Binary Vector and BSON Array representations. When they + choose to do so, they MUST ensure compliance using the provided tests. Drivers MUST NOT automatically convert + between representations. + +### Header Format -Drivers SHOULD provide idiomatic APIs to translate between arrays of numbers and this BSON Binary specification. - -### Data Types (dtypes) +Every valid Vector begins with one of the following 2-byte header patterns: -Each vector can take one of multiple data types (dtypes). The following table lists the dtypes implemented. - -| Vector data type | Alias | Bits per vector element | [Arrow Data Type](https://arrow.apache.org/docs/cpp/api/datatype.html) (for illustration) | -| ---------------- | ---------- | ----------------------- | ----------------------------------------------------------------------------------------- | -| `0x03` | INT8 | 8 | INT8 | -| `0x27` | FLOAT32 | 32 | FLOAT | -| `0x10` | PACKED_BIT | 1 `*` | BOOL | - -`*` A Binary Quantized (PACKED_BIT) Vector is a vector of 0s and 1s (bits), but it is represented in memory as a list of -integers in \[0, 255\]. So, for example, the vector `[0, 255]` would be shorthand for the 16-bit vector -`[0,0,0,0,0,0,0,0,1,1,1,1,1,1,1,1]`. The idea is that each number (a uint8) can be stored as a single byte. Of course, -some languages, Python for one, do not have an uint8 type, so must be represented as an int in memory, but not on disk. - -### Byte padding - -As not all data types have a bit length equal to a multiple of 8, and hence do not fit squarely into a certain number of -bytes, a second piece of metadata, the "padding" is included. This instructs the driver of the number of bits in the -final byte that are to be ignored. The least-significant bits are ignored. - -### Binary structure - -Following the binary subtype `9`, a two-element byte array of metadata precedes the packed numbers. - -- The first byte (dtype) describes its data type. The table above shows those that MUST be implemented. This table may - increase. dtype is an unsigned integer. - -- The second byte (padding) prescribes the number of bits to ignore in the final byte of the value. It is a non-negative - integer. It must be present, even in cases where it is not applicable, and set to zero. - -- The remainder contains the actual vector elements packed according to dtype. - -All values use the little-endian format. - -#### Example - -Let's take a vector `[238, 224]` of dtype PACKED_BIT (`\x10`) with a padding of `4`. - -In hex, it looks like this: `b"\x10\x04\xee\xe0"`: 1 byte for dtype, 1 for padding, and 1 for each uint8. - -We can visualize the binary representation like so: - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
1st byte: dtype (from list in previous table) 2nd byte: padding (values in [0,7])1st uint8: 2382nd uint8: 224
00010000000001001110111011100000
+| Header bytes | Alias | Description | +| ------------ | ----------- | ------------------------------------------------------------------------------- | +| `0x03 0x00` | INT8 | signed bytes | +| `0x27 0x00` | FLOAT32 | single precision (32-bit) floating point, least significant byte first | +| `0x10 0x00` | PACKED_BITS | single-bit integers, most significant bit first, exact multiple of 8 bits total | +| `0x10 0x01` | PACKED_BITS | as above, final 1 bit ignored | +| `0x10` ... | PACKED_BITS | ... | +| `0x10 0x07` | PACKED_BITS | as above, final 7 bits ignored | -Finally, after we remove the last 4 bits of padding, the actual bit vector has a length of 12 and looks like this! +Drivers MAY choose to interpret the header bytes as a structure with internal fields: -| 1 | 1 | 1 | 0 | 1 | 1 | 1 | 0 | 1 | 1 | 1 | 0 | -| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | +| Size | Location | Description | +| ------ | ----------------------------------- | ----------- | +| 4 bits | First byte, most significant half | Type code | +| 4 bits | First byte, least significant half | Size code | +| 5 bits | Second byte, most significant part | (reserved) | +| 3 bits | Second byte, least significant part | Padding | -## API Guidance +The generic interpretation of Padding refers to the number of items that should be ignored from what would have been the +end of the Vector, regardless of item size and bit order. -Drivers MUST implement methods for explicit encoding and decoding that adhere to the pattern described below while -following idioms of the language of the driver. - -### Encoding - -``` -Function from_vector(vector: Iterable, dtype: DtypeEnum, padding: Integer = 0) -> Binary - # Converts a numeric vector into a binary representation based on the specified dtype and padding. - - # :param vector: A sequence or iterable of numbers (either float or int) - # :param dtype: Data type for binary conversion (from DtypeEnum) - # :param padding: Optional integer specifying how many bits to ignore in the final byte - # :return: A binary representation of the vector - - Declare binary_data as Binary - - # Process each number in vector and convert according to dtype - For each number in vector - binary_element = convert_to_binary(number, dtype) - binary_data.append(binary_element) - End For - - # Apply padding to the binary data if needed - If padding > 0 - apply_padding(binary_data, padding) - End If - - Return binary_data -End Function -``` - -Note: If a driver chooses to implement a `Vector` type (or numerous) like that suggested in the Data Structure -subsection below, they MAY decide that `from_vector` that has a single argument, a Vector. - -### Decoding - -``` -Function as_vector() -> Vector - # Unpacks binary data (BSON or similar) into a Vector structure. - # This process involves extracting numeric values, the data type, and padding information. - - # :return: A BinaryVector containing the unpacked numeric values, dtype, and padding. - - Declare binary_vector as BinaryVector # Struct to hold the unpacked data - - # Extract dtype (data type) from the binary data - binary_vector.dtype = extract_dtype_from_binary() - - # Extract padding from the binary data - binary_vector.padding = extract_padding_from_binary() - - # Unpack the actual numeric values from the binary data according to the dtype - binary_vector.data = unpack_numeric_values(binary_vector.dtype) - - Return binary_vector -End Function -``` - -#### Validation - -Drivers MUST validate vector metadata and raise an error if any invariant is violated: - -- Padding MUST be 0 for all dtypes where padding doesn’t apply, and MUST be within \[0, 7\] for PACKED_BIT. -- A PACKED_BIT vector MUST NOT be empty if padding is in the range \[1, 7\]. - -Drivers MUST perform this validation when a numeric vector and padding are provided through the API, and when unpacking -binary data (BSON or similar) into a Vector structure. - -#### Data Structures - -Drivers MAY find the following structures to represent the dtype and vector structure useful. - -``` -Enum Dtype - # Enum for data types (dtype) +| Type code | Description | +| --------- | ----------------------------------------------- | +| 0 | Signed integer, two's complement representation | +| 1 | Unsigned integer | +| 2 | Floating point, IEEE 754 representation | +| 3 .. 15 | (reserved) | - # FLOAT32: Represents packing of list of floats as float32 - # Value: 0x27 (hexadecimal byte value) +| Size code | Bits per element | +| --------- | ------------------ | +| 0 | 1 | +| 1 | (reserved for 2) | +| 2 | (reserved for 4) | +| 3 | 8 | +| 4 | (reserved for 12) | +| 5 | (reserved for 16) | +| 6 | (reserved for 24) | +| 7 | 32 | +| 8 | (reserved for 48) | +| 9 | (reserved for 64) | +| 10 | (reserved for 96) | +| 11 | (reserved for 128) | +| 12 | (reserved for 192) | +| 13 | (reserved for 256) | +| 14 | (reserved for 384) | +| 15 | (reserved for 512) | - # INT8: Represents packing of list of signed integers in the range [-128, 127] as signed int8 - # Value: 0x03 (hexadecimal byte value) +### Validity Criteria - # PACKED_BIT: Special case where vector values are 0 or 1, packed as unsigned uint8 in range [0, 255] - # Packed into groups of 8 (a byte) - # Value: 0x10 (hexadecimal byte value) - - # Documentation: - # Each value is a byte (length of one), a convenient choice for decoding. -End Enum +To be valid, a Vector MUST be 2 bytes long or longer. Its header MUST be one of the valid bit patterns above. In +particular, the second byte MUST be nonzero only as necessary to represent Padding values between 0 and 7 for +PACKED_BITS vectors. Vectors with no elements MUST have a Padding value of 0. -Struct Vector - # Numeric vector with metadata for binary interoperability +When Padding is nonzero, drivers SHOULD ensure the unused bits in the final byte are zero. - # Fields: - # data: Sequence of numeric values (either float or int) - # dtype: Data type of vector (from enum BinaryVectorDtype) - # padding: Number of bits to ignore in the final byte for alignment +The contents of individual elements MUST NOT be considered when checking the validity of a Vector. Unused bits in the +final byte are not considered part of any element. - data # Sequence of float or int - dtype # Type: DtypeEnum - padding # Integer: Number of padding bits - End Struct -``` +Vectors MUST NOT include any unnecessary trailing bytes. For example, FLOAT32 Vectors must include an exact multiple of +4 bytes after the 2-byte header. -## Reference Implementation +Drivers MUST validate Vector metadata when provided through the API. For example, if a PACKED_BIT Vector is constructed +from a byte array paired with a Padding value: -- PYTHON (PYTHON-4577) +- The driver MUST ensure Padding is zero if the byte array is empty +- The driver SHOULD ensure the unused bits in the final byte are zero +- If the API allows Padding values outside the valid range of 0..7 inclusve, these must be rejected at runtime. -## Test Plan +Drivers MUST validate Vector metadata when creating an API representation from a stored BSON Binary item. A PACKED_BIT +value would have its Padding and length validated as above. A FLOAT32 Vector would be rejected for a nonzero second +header byte, or a length that isn't 2 plus a multiple of 4. -See the [README](tests/README.md) for tests. +### Type Conversions + +Type conversion is an optional feature. + +Drivers may provide conversions between BSON Array and BSON Binary Vector representations. Drivers MUST only perform +this conversion as requested, not automatically. + +#### Packing + +PACKED_BITS values MAY be optionally losslessly unpacked to a wider data type of the driver's choosing, for more +convenient access. Drivers MUST provide a way to access PACKED_BITS without unpacking. In languages with compile-time +abstraction, drivers SHOULD provide an abstract data type for manipulating elements in PACKED_BITS without unpacking. If +abstraction is not practical, drivers can instead provide direct access to the byte array and 'Padding' value. + +#### Integer Values + +INT8 and PACKED_BITS values may be losslessly represented as BSON int32 elements. + +When converting BSON int32 or int64 elements to INT8 or PACKED_BITS, out-of-range values MUST cause conversion to fail. + +There is no defined conversion from floating point to integer. Conversion from BSON double to an integer Vector MUST +fail. + +#### Floating Point Values + +There is no defined conversion from integer to floating point. Conversion from BSON int32 or int64 to a FLOAT32 Vector +MUST fail. -## FAQ +When converting BSON double elements to FLOAT32, the driver MUST round to the nearest representable values. -- What MongoDB Server version does this apply to? - - Files in the "specifications" repository have no version scheme. They are not tied to a MongoDB server version. -- In PACKED_BIT, why would one choose to use integers in \[0, 256)? - - This follows a well-established precedent for packing binary-valued arrays into bytes (8 bits), This technique is - widely used across different fields, such as data compression, communication protocols, and file formats, where - you want to store or transmit binary data more efficiently by grouping 8 bits into a single byte (uint8). For an - example in Python, see - [numpy.unpackbits](https://numpy.org/doc/2.0/reference/generated/numpy.unpackbits.html#numpy.unpackbits). +### Data Formats + +#### INT8 (`0x03 0x00`) + +Signed 1-byte integers in two's complement encoding, representing values from -128 to 127 inclusive. + +#### FLOAT32 (`0x27 0x00`) + +Single-precision floating point values in the IEEE 754 `binary32` format. 4 bytes, least significant byte first. + +#### PACKED_BITS (`0x10 0x00` .. `0x10 0x07`) + +Integers 0 and 1 represented by individual bits packed into bytes, most significant bit first. + +Padding indicates how many of the least significant bits from the last byte do not encode any element. Drivers MUST +always set these non-encoding bits in the last byte to zero. Drivers SHOULD ensure these bits are zero when checking a +Vector for validity. Vectors with no data bytes MUST have a Padding of zero. + +Note that the bit order and byte order in this specification are opposite. Byte order is "little-endian" to match common +CPU architectures, whereas bit order is "big-endian" for left-to-right readability. + +Implementations may choose to implement accessors for packed bits using machine words larger than 8 bits for performance +reasons. If so, they MUST not impose any additional constraints on data length or alignment. + +### Examples + +- `0x10 0x04 0xee 0xe0` + + - Header: PACKED_BITS, Padding=4 + - Data bytes: `0xee 0xe0` + - The same bytes in binary, most-significant bit first: `1110 1110 1110 0000` + - Discarding Padding (4) bits from the end, which SHOULD be zero: `1110 1110 1110` + - Unpacked representation, 12 elements: `[1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 0]` + +- `0x10 0x07 0x80` + + - Header: PACKED_BITS, Padding=7 + - Data byte: `0x80` + - Unpacked representation, 1 element: `[1]` + +- `0x10 0x00 0xf0 0x42` + + - Header: PACKED_BITS, Padding=0 + - Data bytes: `0xf0 0x42` + - The same bytes in binary, most-significant bit first: `1111 0000 0100 0010` + - Unpacked representation, 16 elements: `[1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0]` + +- `0x03 0x00 0xff 0x00 0x01` + + - Header bytes: INT8 + - Data bytes: `0xff 0x00 0x01` + - Unpacked representation, 3 elements: `[-1, 0, 1]` + +- `0x27 0x00 0x00 0x00 0x80 0x3f 0x34 0x12 0x80 0x7f` + + - Header: FLOAT32 + - Data bytes: `0x00 0x00 0x80 0x3f 0x34 0x12 0x80 0x7f` + - The same bytes as two 32-bit words, least significant byte first: `0x3f800000 0x7f801234` + - The same 32-bit words interpreted as IEEE 754 `binary32`: `1.0 NaN(0x001234)` + - Unpacked representation, 2 elements: `[1.0, NaN]` + +## Test Plan + +See the [README](tests/README.md) for tests. From f24548656c0f269c4fb318919caac602ba95d64b Mon Sep 17 00:00:00 2001 From: Micah Scott Date: Wed, 5 Feb 2025 10:11:18 -0800 Subject: [PATCH 2/7] Changelog --- source/bson-binary-vector/bson-binary-vector.md | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/source/bson-binary-vector/bson-binary-vector.md b/source/bson-binary-vector/bson-binary-vector.md index 6e5c65e636..71bcf87e34 100644 --- a/source/bson-binary-vector/bson-binary-vector.md +++ b/source/bson-binary-vector/bson-binary-vector.md @@ -253,3 +253,7 @@ reasons. If so, they MUST not impose any additional constraints on data length o ## Test Plan See the [README](tests/README.md) for tests. + +## Changelog + +- 2025-02-05: Text clarifications, no technical change. From a8788dc5df41ebaf80fe1ac0782ce93fbea6f183 Mon Sep 17 00:00:00 2001 From: Micah Scott Date: Wed, 5 Feb 2025 10:31:34 -0800 Subject: [PATCH 3/7] Additional editing. Cleaned up repeats, refine validity criteria --- .../bson-binary-vector/bson-binary-vector.md | 42 +++++++++++-------- 1 file changed, 24 insertions(+), 18 deletions(-) diff --git a/source/bson-binary-vector/bson-binary-vector.md b/source/bson-binary-vector/bson-binary-vector.md index 71bcf87e34..3bfa670cf4 100644 --- a/source/bson-binary-vector/bson-binary-vector.md +++ b/source/bson-binary-vector/bson-binary-vector.md @@ -61,10 +61,10 @@ an alternative collection type with improved performance and limited complexity. padding for alignment is not supported, as this would impact comparison stability. - Vectors do not include any data compression features. Applications may see benefit from careful choice of an external compression algorithm. -- Vectors do not provide any new comparison methods. Identical Vector values must compare as identical encoded BSON - Binary byte strings. Vectors are never equal to Arrays, even when they represent the same numeric elements. -- Vectors do not guarantee that element types defined in the future will always be scalar numbers, only that Vector - elements always have identical type and size. +- Vectors do not provide any new comparison methods beyond byte-equality. Vectors are never equal to Arrays, even when + they represent the same numeric elements. Vectors of different element types are not comparable. +- Vectors do not guarantee that element types defined in the future will always be scalar numbers, only that elements of + a Vector always have identical type and size. ## Specification @@ -73,8 +73,6 @@ an alternative collection type with improved performance and limited complexity. - This specification defines the meaning of the data bytes in BSON Binary items of subtype `9`. - The first two data bytes form a header, with meaning defined here. - This specification defines validity criteria for accepting or rejecting byte strings. -- Drivers may optionally implement conversions between BSON Array and Vector types. This specification defines rules - that must be followed when conversions are implemented. - This specification includes JSON tests with valid documents, invalid documents, and expected conversion results. - Drivers SHOULD provide low-overhead APIs for producing and consuming Vector data in the closest compatible language types, without conversions more expensive than copying or byte-swapping. These APIs are not standardized across @@ -105,6 +103,8 @@ Drivers MAY choose to interpret the header bytes as a structure with internal fi | 5 bits | Second byte, most significant part | (reserved) | | 3 bits | Second byte, least significant part | Padding | +Reserved bits MUST be zero. + The generic interpretation of Padding refers to the number of items that should be ignored from what would have been the end of the Vector, regardless of item size and bit order. @@ -134,30 +134,36 @@ end of the Vector, regardless of item size and bit order. | 14 | (reserved for 384) | | 15 | (reserved for 512) | +Reserved type and size codes MUST NOT be used. + ### Validity Criteria To be valid, a Vector MUST be 2 bytes long or longer. Its header MUST be one of the valid bit patterns above. In -particular, the second byte MUST be nonzero only as necessary to represent Padding values between 0 and 7 for +particular, the second byte MUST be nonzero only as necessary to represent Padding values between 0 and 7 in non-empty PACKED_BITS vectors. Vectors with no elements MUST have a Padding value of 0. -When Padding is nonzero, drivers SHOULD ensure the unused bits in the final byte are zero. +Drivers MUST reject Vectors with invalid header bytes. + +Drivers SHOULD reject Vectors with any unused bits in the final byte set to `1`. + +Drivers SHOULD reject Vectors with unnecessary trailing bytes. + +Drivers MUST NOT generate Vectors with unnecessary trailing bytes or with unused bits in the final byte set to `1`. The contents of individual elements MUST NOT be considered when checking the validity of a Vector. Unused bits in the final byte are not considered part of any element. -Vectors MUST NOT include any unnecessary trailing bytes. For example, FLOAT32 Vectors must include an exact multiple of -4 bytes after the 2-byte header. - -Drivers MUST validate Vector metadata when provided through the API. For example, if a PACKED_BIT Vector is constructed -from a byte array paired with a Padding value: +Drivers MUST validate Vector metadata when provided through the API, to avoid generating invalid Vector byte strings. +For example, if a PACKED_BIT Vector is constructed from a byte array paired with a Padding value: - The driver MUST ensure Padding is zero if the byte array is empty -- The driver SHOULD ensure the unused bits in the final byte are zero -- If the API allows Padding values outside the valid range of 0..7 inclusve, these must be rejected at runtime. +- The driver MUST ensure the unused bits in the final byte are zero +- If the API allows Padding values outside the valid range of 0..7 inclusive, these MUST be rejected at runtime. -Drivers MUST validate Vector metadata when creating an API representation from a stored BSON Binary item. A PACKED_BIT -value would have its Padding and length validated as above. A FLOAT32 Vector would be rejected for a nonzero second -header byte, or a length that isn't 2 plus a multiple of 4. +Drivers MUST validate Vector byte strings when creating an API representation from a stored BSON Binary item. A +PACKED_BIT value would have its Padding and length validated as above, and SHOULD have its unused bits checked for zero. +A FLOAT32 Vector MUST be rejected for a nonzero second header byte, and it SHOULD be rejected for a length that isn't 2 +plus a multiple of 4. ### Type Conversions From 20257035f59d99ed6a0707ea1f5ab22c2d17714c Mon Sep 17 00:00:00 2001 From: Micah Scott Date: Wed, 5 Feb 2025 10:38:00 -0800 Subject: [PATCH 4/7] Strict output even if input is looser for compatibility --- source/bson-binary-vector/bson-binary-vector.md | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/source/bson-binary-vector/bson-binary-vector.md b/source/bson-binary-vector/bson-binary-vector.md index 3bfa670cf4..48b3232c99 100644 --- a/source/bson-binary-vector/bson-binary-vector.md +++ b/source/bson-binary-vector/bson-binary-vector.md @@ -153,8 +153,9 @@ Drivers MUST NOT generate Vectors with unnecessary trailing bytes or with unused The contents of individual elements MUST NOT be considered when checking the validity of a Vector. Unused bits in the final byte are not considered part of any element. -Drivers MUST validate Vector metadata when provided through the API, to avoid generating invalid Vector byte strings. -For example, if a PACKED_BIT Vector is constructed from a byte array paired with a Padding value: +Drivers MUST validate Vector metadata when provided through the API, to avoid generating byte strings that any +conforming implementation would consider invalid. For example, if a PACKED_BIT Vector is constructed from a byte array +paired with a Padding value: - The driver MUST ensure Padding is zero if the byte array is empty - The driver MUST ensure the unused bits in the final byte are zero From db01d98b6436c17c51c401bb08246cb1bdc72636 Mon Sep 17 00:00:00 2001 From: Micah Scott Date: Wed, 5 Feb 2025 11:17:00 -0800 Subject: [PATCH 5/7] Use "unpacked" carefully, and add extjson NaN --- source/bson-binary-vector/bson-binary-vector.md | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/source/bson-binary-vector/bson-binary-vector.md b/source/bson-binary-vector/bson-binary-vector.md index 48b3232c99..7d31501e5b 100644 --- a/source/bson-binary-vector/bson-binary-vector.md +++ b/source/bson-binary-vector/bson-binary-vector.md @@ -247,7 +247,7 @@ reasons. If so, they MUST not impose any additional constraints on data length o - Header bytes: INT8 - Data bytes: `0xff 0x00 0x01` - - Unpacked representation, 3 elements: `[-1, 0, 1]` + - Integer elements: `[-1, 0, 1]` - `0x27 0x00 0x00 0x00 0x80 0x3f 0x34 0x12 0x80 0x7f` @@ -255,7 +255,8 @@ reasons. If so, they MUST not impose any additional constraints on data length o - Data bytes: `0x00 0x00 0x80 0x3f 0x34 0x12 0x80 0x7f` - The same bytes as two 32-bit words, least significant byte first: `0x3f800000 0x7f801234` - The same 32-bit words interpreted as IEEE 754 `binary32`: `1.0 NaN(0x001234)` - - Unpacked representation, 2 elements: `[1.0, NaN]` + - Floating point elements: `[1.0, NaN]` + - Converted to Array, represented as Relaxed Extended JSON: `[1.0, {"$numberDouble": "NaN"}]` ## Test Plan From 1e444248a4c89ccf64024b7e018a2c852f61dd29 Mon Sep 17 00:00:00 2001 From: Micah Scott Date: Wed, 5 Feb 2025 11:26:07 -0800 Subject: [PATCH 6/7] Clarify 'trailing' --- source/bson-binary-vector/bson-binary-vector.md | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/source/bson-binary-vector/bson-binary-vector.md b/source/bson-binary-vector/bson-binary-vector.md index 7d31501e5b..d979226e48 100644 --- a/source/bson-binary-vector/bson-binary-vector.md +++ b/source/bson-binary-vector/bson-binary-vector.md @@ -146,9 +146,10 @@ Drivers MUST reject Vectors with invalid header bytes. Drivers SHOULD reject Vectors with any unused bits in the final byte set to `1`. -Drivers SHOULD reject Vectors with unnecessary trailing bytes. +Drivers SHOULD reject Vectors with extra bytes after the last complete multi-byte element. -Drivers MUST NOT generate Vectors with unnecessary trailing bytes or with unused bits in the final byte set to `1`. +Drivers MUST NOT generate Vectors with extra bytes after the last complete element, or with unused bits in the final +byte set to `1`. The contents of individual elements MUST NOT be considered when checking the validity of a Vector. Unused bits in the final byte are not considered part of any element. From 6e87112f3da4ceb90580ebc029e34776ccace457 Mon Sep 17 00:00:00 2001 From: Micah Scott Date: Fri, 7 Feb 2025 07:09:26 -0800 Subject: [PATCH 7/7] Consistency, header format in examples --- source/bson-binary-vector/bson-binary-vector.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/source/bson-binary-vector/bson-binary-vector.md b/source/bson-binary-vector/bson-binary-vector.md index d979226e48..d2c30abb41 100644 --- a/source/bson-binary-vector/bson-binary-vector.md +++ b/source/bson-binary-vector/bson-binary-vector.md @@ -246,7 +246,7 @@ reasons. If so, they MUST not impose any additional constraints on data length o - `0x03 0x00 0xff 0x00 0x01` - - Header bytes: INT8 + - Header: INT8 - Data bytes: `0xff 0x00 0x01` - Integer elements: `[-1, 0, 1]`