diff --git a/source/bson-binary-vector/bson-binary-vector.md b/source/bson-binary-vector/bson-binary-vector.md index d46f6681a9..445fd12ed8 100644 --- a/source/bson-binary-vector/bson-binary-vector.md +++ b/source/bson-binary-vector/bson-binary-vector.md @@ -7,246 +7,266 @@ ______________________________________________________________________ ## Abstract -This document describes the subtype of the Binary BSON type used for efficient storage and retrieval of vectors. Vectors -here refer to densely packed arrays of numbers, all of the same type. +This document describes a new *Vector* subtype (9) for BSON Binary items, used to compactly represent ordered +collections of uniformly-typed elements. A framework is presented for future type extensibility, but adoption complexity +is limited by allowing support for only a restricted set of element types at first: -## Motivation +- 1-bit unsigned integers +- 8-bit signed integers +- 32-bit floating point -These representations correspond to the numeric types supported by popular numerical libraries for vector processing, -such as NumPy, PyTorch, TensorFlow and Apache Arrow. Storing and retrieving vector data using the same densely packed -format used by these libraries can result in significant memory savings and processing efficiency. - -### META +## Meta The keywords "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in [RFC 2119](https://www.ietf.org/rfc/rfc2119.txt). +Hexadecimal values are shown here with a `0x` prefix. + +Bit strings are grouped with insignificant whitespace for readability. + +## Terms + +*BSON Array* - Arrays are a fundamental container type in BSON for ordered sequences, implemented as item type `4`. Each +element can have an arbitrary data type. The encoding is relatively high-overhead, due to both the non-uniform types and +the required element name strings. + +*BSON Binary* - BSON Binary items (type `5`) are a container for a variable-length byte sequence with extensible +interpretation, according to an 8-bit *subtype*. + +*BSON Binary Vector* - A BSON Binary item of subtype `9`. Also referred to here as a Vector. + +## Motivation for Change + +BSON does not on its own provide a densely packed encoding for numeric data of uniform data type. Numbers stored in a +BSON Array have high space overhead, owing to the item name and type included with each value. This specification offers +an alternative collection type with improved performance and limited complexity. + +### Goals + +- Vectors provide improved resource efficiency compared to BSON Arrays. +- Every Vector is guaranteed to represent a sequence of elements with uniform type and size. +- Vectors may be reliably compared for equality by comparing their encoded BSON Binary representation. +- Implementation complexity should be minimal. + +### Non-Goals + +- No changes to Extended JSON representation are defined. Vectors will serialize to generic Binary items with base64 + encoding: `{"$binary": {"base64": ... , "subType": "9" }}`. +- The Vector is a 1-dimensional container. Applications may implement multi-dimensional arrays efficiently by bundling a + Vector with additional metadata, but this usage is not standardized here. +- Comprehensive support for all possible data types and bit/byte ordering is not a goal. This specification prefers to + reduce complexity by limiting the set of allowed types and providing no unnecessary data formatting options. +- Vectors within a BSON document are NOT designed for "zero copy" access by direct architecture-specific load or store. + Typically multi-byte values will not be aligned as required, and they may need byte order conversion. Internal + padding for alignment is not supported, as this would impact comparison stability. +- Vectors do not include any data compression features. Applications may see benefit from careful choice of an external + compression algorithm. +- Vectors do not provide any new comparison methods beyond byte-equality. Vectors are never equal to Arrays, even when + they represent the same numeric elements. Vectors of different element types are not comparable. +- Vectors do not guarantee that element types defined in the future will always be scalar numbers, only that elements of + a Vector always have identical type and size. + ## Specification -This specification introduces a new BSON binary subtype, the vector, with value `9`. +### Scope + +- This specification defines the meaning of the data bytes in BSON Binary items of subtype `9`. +- The first two data bytes form a header, with meaning defined here. +- This specification defines validity criteria for accepting or rejecting byte strings. +- This specification includes JSON tests with valid documents, invalid documents, and expected conversion results. +- Drivers SHOULD provide low-overhead APIs for producing and consuming Vector data in the closest compatible language + types, without conversions more expensive than copying or byte-swapping. These APIs are not standardized across + languages. +- Drivers MAY provide facilities for converting between BSON Binary Vector and BSON Array representations. When they + choose to do so, they MUST ensure compliance using the provided tests. Drivers MUST NOT automatically convert + between representations. -Drivers SHOULD provide idiomatic APIs to translate between arrays of numbers and this BSON Binary specification. - -### Data Types (dtypes) +### Header Format -Each vector can take one of multiple data types (dtypes). The following table lists the dtypes implemented. - -| Vector data type | Alias | Bits per vector element | [Arrow Data Type](https://arrow.apache.org/docs/cpp/api/datatype.html) (for illustration) | -| ---------------- | ---------- | ----------------------- | ----------------------------------------------------------------------------------------- | -| `0x03` | INT8 | 8 | INT8 | -| `0x27` | FLOAT32 | 32 | FLOAT | -| `0x10` | PACKED_BIT | 1 `*` | BOOL | - -`*` A Binary Quantized (PACKED_BIT) Vector is a vector of 0s and 1s (bits), but it is represented in memory as a list of -integers in \[0, 255\]. So, for example, the vector `[0, 255]` would be shorthand for the 16-bit vector -`[0,0,0,0,0,0,0,0,1,1,1,1,1,1,1,1]`. The idea is that each number (a uint8) can be stored as a single byte. Of course, -some languages, Python for one, do not have an uint8 type, so must be represented as an int in memory, but not on disk. - -### Byte padding - -As not all data types have a bit length equal to a multiple of 8, and hence do not fit squarely into a certain number of -bytes, a second piece of metadata, the "padding" is included. This instructs the driver of the number of bits in the -final byte that are to be ignored. The least-significant bits are ignored. - -### Binary structure - -Following the binary subtype `9`, a two-element byte array of metadata precedes the packed numbers. - -- The first byte (dtype) describes its data type. The table above shows those that MUST be implemented. This table may - increase. dtype is an unsigned integer. - -- The second byte (padding) prescribes the number of bits to ignore in the final byte of the value. It is a non-negative - integer. It must be present, even in cases where it is not applicable, and set to zero. - -- The remainder contains the actual vector elements packed according to dtype. - -All values use the little-endian format. - -#### Example - -Let's take a vector `[238, 224]` of dtype PACKED_BIT (`\x10`) with a padding of `4`. - -In hex, it looks like this: `b"\x10\x04\xee\xe0"`: 1 byte for dtype, 1 for padding, and 1 for each uint8. - -We can visualize the binary representation like so: - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
1st byte: dtype (from list in previous table) 2nd byte: padding (values in [0,7])1st uint8: 2382nd uint8: 224
00010000000001001110111011100000
+Every valid Vector begins with one of the following 2-byte header patterns: -Finally, after we remove the last 4 bits of padding, the actual bit vector has a length of 12 and looks like this! +| Header bytes | Alias | Description | +| ------------ | ----------- | ------------------------------------------------------------------------------- | +| `0x03 0x00` | INT8 | signed bytes | +| `0x27 0x00` | FLOAT32 | single precision (32-bit) floating point, least significant byte first | +| `0x10 0x00` | PACKED_BITS | single-bit integers, most significant bit first, exact multiple of 8 bits total | +| `0x10 0x01` | PACKED_BITS | as above, final 1 bit ignored | +| `0x10` ... | PACKED_BITS | ... | +| `0x10 0x07` | PACKED_BITS | as above, final 7 bits ignored | -| 1 | 1 | 1 | 0 | 1 | 1 | 1 | 0 | 1 | 1 | 1 | 0 | -| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | - -## API Guidance +Drivers MAY choose to interpret the header bytes as a structure with internal fields: -Drivers MUST implement methods for explicit encoding and decoding that adhere to the pattern described below while -following idioms of the language of the driver. - -### Encoding - -``` -Function from_vector(vector: Iterable, dtype: DtypeEnum, padding: Integer = 0) -> Binary - # Converts a numeric vector into a binary representation based on the specified dtype and padding. - - # :param vector: A sequence or iterable of numbers (either float or int) - # :param dtype: Data type for binary conversion (from DtypeEnum) - # :param padding: Optional integer specifying how many bits to ignore in the final byte - # :return: A binary representation of the vector - - Declare binary_data as Binary - - # Process each number in vector and convert according to dtype - For each number in vector - binary_element = convert_to_binary(number, dtype) - binary_data.append(binary_element) - End For - - # Apply padding to the binary data if needed - If padding > 0 - apply_padding(binary_data, padding) - End If - - Return binary_data -End Function -``` - -Note: If a driver chooses to implement a `Vector` type (or numerous) like that suggested in the Data Structure -subsection below, they MAY decide that `from_vector` that has a single argument, a Vector. - -### Decoding - -``` -Function as_vector() -> Vector - # Unpacks binary data (BSON or similar) into a Vector structure. - # This process involves extracting numeric values, the data type, and padding information. - - # :return: A BinaryVector containing the unpacked numeric values, dtype, and padding. - - Declare binary_vector as BinaryVector # Struct to hold the unpacked data - - # Extract dtype (data type) from the binary data - binary_vector.dtype = extract_dtype_from_binary() - - # Extract padding from the binary data - binary_vector.padding = extract_padding_from_binary() - - # Unpack the actual numeric values from the binary data according to the dtype - binary_vector.data = unpack_numeric_values(binary_vector.dtype) - - Return binary_vector -End Function -``` - -#### Validation - -Drivers MUST validate vector metadata and raise an error if any invariant is violated: - -- Padding MUST be 0 for all dtypes where padding doesn’t apply, and MUST be within \[0, 7\] for PACKED_BIT. -- A PACKED_BIT vector MUST NOT be empty if padding is in the range \[1, 7\]. -- When unpacking binary data into a FLOAT32 Vector structure, the length of the binary data following the dtype and - padding MUST be a multiple of 4 bytes. - -Drivers MUST perform this validation when a numeric vector and padding are provided through the API, and when unpacking -binary data (BSON or similar) into a Vector structure. - -#### Data Structures - -Drivers MAY find the following structures to represent the dtype and vector structure useful. - -``` -Enum Dtype - # Enum for data types (dtype) +| Size | Location | Description | +| ------ | ----------------------------------- | ----------- | +| 4 bits | First byte, most significant half | Type code | +| 4 bits | First byte, least significant half | Size code | +| 5 bits | Second byte, most significant part | (reserved) | +| 3 bits | Second byte, least significant part | Padding | - # FLOAT32: Represents packing of list of floats as float32 - # Value: 0x27 (hexadecimal byte value) +Reserved bits MUST be zero. - # INT8: Represents packing of list of signed integers in the range [-128, 127] as signed int8 - # Value: 0x03 (hexadecimal byte value) +The generic interpretation of Padding refers to the number of items that should be ignored from what would have been the +end of the Vector, regardless of item size and bit order. - # PACKED_BIT: Special case where vector values are 0 or 1, packed as unsigned uint8 in range [0, 255] - # Packed into groups of 8 (a byte) - # Value: 0x10 (hexadecimal byte value) - - # Documentation: - # Each value is a byte (length of one), a convenient choice for decoding. -End Enum +| Type code | Description | +| --------- | ----------------------------------------------- | +| 0 | Signed integer, two's complement representation | +| 1 | Unsigned integer | +| 2 | Floating point, IEEE 754 representation | +| 3 .. 15 | (reserved) | -Struct Vector - # Numeric vector with metadata for binary interoperability +| Size code | Bits per element | +| --------- | ------------------ | +| 0 | 1 | +| 1 | (reserved for 2) | +| 2 | (reserved for 4) | +| 3 | 8 | +| 4 | (reserved for 12) | +| 5 | (reserved for 16) | +| 6 | (reserved for 24) | +| 7 | 32 | +| 8 | (reserved for 48) | +| 9 | (reserved for 64) | +| 10 | (reserved for 96) | +| 11 | (reserved for 128) | +| 12 | (reserved for 192) | +| 13 | (reserved for 256) | +| 14 | (reserved for 384) | +| 15 | (reserved for 512) | - # Fields: - # data: Sequence of numeric values (either float or int) - # dtype: Data type of vector (from enum BinaryVectorDtype) - # padding: Number of bits to ignore in the final byte for alignment +Reserved type and size codes MUST NOT be used. - data # Sequence of float or int - dtype # Type: DtypeEnum - padding # Integer: Number of padding bits - End Struct -``` +### Validity Criteria -## Reference Implementation +To be valid, a Vector MUST be 2 bytes long or longer. Its header MUST be one of the valid bit patterns above. In +particular, the second byte MUST be nonzero only as necessary to represent Padding values between 0 and 7 in non-empty +PACKED_BITS vectors. Vectors with no elements MUST have a Padding value of 0. -- PYTHON (PYTHON-4577) +Drivers MUST reject Vectors with invalid header bytes. -## Test Plan +Drivers SHOULD reject Vectors with any unused bits in the final byte set to `1`. -See the [README](tests/README.md) for tests. +Drivers SHOULD reject Vectors with extra bytes after the last complete multi-byte element. + +Drivers MUST NOT generate Vectors with extra bytes after the last complete element, or with unused bits in the final +byte set to `1`. + +The contents of individual elements MUST NOT be considered when checking the validity of a Vector. Unused bits in the +final byte are not considered part of any element. + +Drivers MUST validate Vector metadata when provided through the API, to avoid generating byte strings that any +conforming implementation would consider invalid. For example, if a PACKED_BIT Vector is constructed from a byte array +paired with a Padding value: + +- The driver MUST ensure Padding is zero if the byte array is empty +- The driver MUST ensure the unused bits in the final byte are zero +- If the API allows Padding values outside the valid range of 0..7 inclusive, these MUST be rejected at runtime. + +Drivers MUST validate Vector byte strings when creating an API representation from a stored BSON Binary item. A +PACKED_BIT value would have its Padding and length validated as above, and SHOULD have its unused bits checked for zero. +A FLOAT32 Vector MUST be rejected for a nonzero second header byte, and it SHOULD be rejected for a length that isn't 2 +plus a multiple of 4. + +### Type Conversions + +Type conversion is an optional feature. + +Drivers may provide conversions between BSON Array and BSON Binary Vector representations. Drivers MUST only perform +this conversion as requested, not automatically. + +#### Packing + +PACKED_BITS values MAY be optionally losslessly unpacked to a wider data type of the driver's choosing, for more +convenient access. Drivers MUST provide a way to access PACKED_BITS without unpacking. In languages with compile-time +abstraction, drivers SHOULD provide an abstract data type for manipulating elements in PACKED_BITS without unpacking. If +abstraction is not practical, drivers can instead provide direct access to the byte array and 'Padding' value. + +#### Integer Values + +INT8 and PACKED_BITS values may be losslessly represented as BSON int32 elements. + +When converting BSON int32 or int64 elements to INT8 or PACKED_BITS, out-of-range values MUST cause conversion to fail. -## FAQ +There is no defined conversion from floating point to integer. Conversion from BSON double to an integer Vector MUST +fail. -- What MongoDB Server version does this apply to? - - Files in the "specifications" repository have no version scheme. They are not tied to a MongoDB server version. -- In PACKED_BIT, why would one choose to use integers in \[0, 256)? - - This follows a well-established precedent for packing binary-valued arrays into bytes (8 bits), This technique is - widely used across different fields, such as data compression, communication protocols, and file formats, where - you want to store or transmit binary data more efficiently by grouping 8 bits into a single byte (uint8). For an - example in Python, see - [numpy.unpackbits](https://numpy.org/doc/2.0/reference/generated/numpy.unpackbits.html#numpy.unpackbits). +#### Floating Point Values + +There is no defined conversion from integer to floating point. Conversion from BSON int32 or int64 to a FLOAT32 Vector +MUST fail. + +When converting BSON double elements to FLOAT32, the driver MUST round to the nearest representable values. + +### Data Formats + +#### INT8 (`0x03 0x00`) + +Signed 1-byte integers in two's complement encoding, representing values from -128 to 127 inclusive. + +#### FLOAT32 (`0x27 0x00`) + +Single-precision floating point values in the IEEE 754 `binary32` format. 4 bytes, least significant byte first. + +#### PACKED_BITS (`0x10 0x00` .. `0x10 0x07`) + +Integers 0 and 1 represented by individual bits packed into bytes, most significant bit first. + +Padding indicates how many of the least significant bits from the last byte do not encode any element. Drivers MUST +always set these non-encoding bits in the last byte to zero. Drivers SHOULD ensure these bits are zero when checking a +Vector for validity. Vectors with no data bytes MUST have a Padding of zero. + +Note that the bit order and byte order in this specification are opposite. Byte order is "little-endian" to match common +CPU architectures, whereas bit order is "big-endian" for left-to-right readability. + +Implementations may choose to implement accessors for packed bits using machine words larger than 8 bits for performance +reasons. If so, they MUST not impose any additional constraints on data length or alignment. + +### Examples + +- `0x10 0x04 0xee 0xe0` + + - Header: PACKED_BITS, Padding=4 + - Data bytes: `0xee 0xe0` + - The same bytes in binary, most-significant bit first: `1110 1110 1110 0000` + - Discarding Padding (4) bits from the end, which SHOULD be zero: `1110 1110 1110` + - Unpacked representation, 12 elements: `[1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 0]` + +- `0x10 0x07 0x80` + + - Header: PACKED_BITS, Padding=7 + - Data byte: `0x80` + - Unpacked representation, 1 element: `[1]` + +- `0x10 0x00 0xf0 0x42` + + - Header: PACKED_BITS, Padding=0 + - Data bytes: `0xf0 0x42` + - The same bytes in binary, most-significant bit first: `1111 0000 0100 0010` + - Unpacked representation, 16 elements: `[1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0]` + +- `0x03 0x00 0xff 0x00 0x01` + + - Header: INT8 + - Data bytes: `0xff 0x00 0x01` + - Integer elements: `[-1, 0, 1]` + +- `0x27 0x00 0x00 0x00 0x80 0x3f 0x34 0x12 0x80 0x7f` + + - Header: FLOAT32 + - Data bytes: `0x00 0x00 0x80 0x3f 0x34 0x12 0x80 0x7f` + - The same bytes as two 32-bit words, least significant byte first: `0x3f800000 0x7f801234` + - The same 32-bit words interpreted as IEEE 754 `binary32`: `1.0 NaN(0x001234)` + - Floating point elements: `[1.0, NaN]` + - Converted to Array, represented as Relaxed Extended JSON: `[1.0, {"$numberDouble": "NaN"}]` + +## Test Plan + +See the [README](tests/README.md) for tests. ## Changelog +- 2025-02-05: Text clarifications, no technical change. + - 2025-02-04: Update validation for decoding into a FLOAT32 vector. - 2024-11-01: BSON Binary Subtype 9 accepted DRIVERS-2926 (#1708)