diff --git a/source/bson-binary-vector/bson-binary-vector.md b/source/bson-binary-vector/bson-binary-vector.md new file mode 100644 index 0000000000..0d08ff6093 --- /dev/null +++ b/source/bson-binary-vector/bson-binary-vector.md @@ -0,0 +1,244 @@ +# BSON Binary Subtype 9 - Vector + +- Status: Pending +- Minimum Server Version: N/A + +______________________________________________________________________ + +## Abstract + +This document describes the subtype of the Binary BSON type used for efficient storage and retrieval of vectors. Vectors +here refer to densely packed arrays of numbers, all of the same type. + +## Motivation + +These representations correspond to the numeric types supported by popular numerical libraries for vector processing, +such as NumPy, PyTorch, TensorFlow and Apache Arrow. Storing and retrieving vector data using the same densely packed +format used by these libraries can result in significant memory savings and processing efficiency. + +### META + +The keywords "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and +"OPTIONAL" in this document are to be interpreted as described in [RFC 2119](https://www.ietf.org/rfc/rfc2119.txt). + +## Specification + +This specification introduces a new BSON binary subtype, the vector, with value `9`. + +Drivers SHOULD provide idiomatic APIs to translate between arrays of numbers and this BSON Binary specification. + +### Data Types (dtypes) + +Each vector can take one of multiple data types (dtypes). The following table lists the dtypes implemented. + +| Vector data type | Alias | Bits per vector element | [Arrow Data Type](https://arrow.apache.org/docs/cpp/api/datatype.html) (for illustration) | +| ---------------- | ---------- | ----------------------- | ----------------------------------------------------------------------------------------- | +| `0x03` | INT8 | 8 | INT8 | +| `0x27` | FLOAT32 | 32 | FLOAT | +| `0x10` | PACKED_BIT | 1 `*` | BOOL | + +`*` A Binary Quantized (PACKED_BIT) Vector is a vector of 0s and 1s (bits), but it is represented in memory as a list of +integers in \[0, 255\]. So, for example, the vector `[0, 255]` would be shorthand for the 16-bit vector +`[0,0,0,0,0,0,0,0,1,1,1,1,1,1,1,1]`. The idea is that each number (a uint8) can be stored as a single byte. Of course, +some languages, Python for one, do not have an uint8 type, so must be represented as an int in memory, but not on disk. + +### Byte padding + +As not all data types have a bit length equal to a multiple of 8, and hence do not fit squarely into a certain number of +bytes, a second piece of metadata, the "padding" is included. This instructs the driver of the number of bits in the +final byte that are to be ignored. The least-significant bits are ignored. + +### Binary structure + +Following the binary subtype `9`, a two-element byte array of metadata precedes the packed numbers. + +- The first byte (dtype) describes its data type. The table above shows those that MUST be implemented. This table may + increase. dtype is an unsigned integer. + +- The second byte (padding) prescribes the number of bits to ignore in the final byte of the value. It is a non-negative + integer. It must be present, even in cases where it is not applicable, and set to zero. + +- The remainder contains the actual vector elements packed according to dtype. + +All values use the little-endian format. + +#### Example + +Let's take a vector `[238, 224]` of dtype PACKED_BIT (`\x10`) with a padding of `4`. + +In hex, it looks like this: `b"\x10\x04\xee\xe0"`: 1 byte for dtype, 1 for padding, and 1 for each uint8. + +We can visualize the binary representation like so: + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
1st byte: dtype (from list in previous table) 2nd byte: padding (values in [0,7])1st uint8: 2382nd uint8: 224
00010000000001001110111011100000
+ +Finally, after we remove the last 4 bits of padding, the actual bit vector has a length of 12 and looks like this! + +| 1 | 1 | 1 | 0 | 1 | 1 | 1 | 0 | 1 | 1 | 1 | 0 | +| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | + +## API Guidance + +Drivers MUST implement methods for explicit encoding and decoding that adhere to the pattern described below while +following idioms of the language of the driver. + +### Encoding + +``` +Function from_vector(vector: Iterable, dtype: DtypeEnum, padding: Integer = 0) -> Binary + # Converts a numeric vector into a binary representation based on the specified dtype and padding. + + # :param vector: A sequence or iterable of numbers (either float or int) + # :param dtype: Data type for binary conversion (from DtypeEnum) + # :param padding: Optional integer specifying how many bits to ignore in the final byte + # :return: A binary representation of the vector + + Declare binary_data as Binary + + # Process each number in vector and convert according to dtype + For each number in vector + binary_element = convert_to_binary(number, dtype) + binary_data.append(binary_element) + End For + + # Apply padding to the binary data if needed + If padding > 0 + apply_padding(binary_data, padding) + End If + + Return binary_data +End Function +``` + +Note: If a driver chooses to implement a `Vector` type (or numerous) like that suggested in the Data Structure +subsection below, they MAY decide that `from_vector` that has a single argument, a Vector. + +### Decoding + +``` +Function as_vector() -> Vector + # Unpacks binary data (BSON or similar) into a Vector structure. + # This process involves extracting numeric values, the data type, and padding information. + + # :return: A BinaryVector containing the unpacked numeric values, dtype, and padding. + + Declare binary_vector as BinaryVector # Struct to hold the unpacked data + + # Extract dtype (data type) from the binary data + binary_vector.dtype = extract_dtype_from_binary() + + # Extract padding from the binary data + binary_vector.padding = extract_padding_from_binary() + + # Unpack the actual numeric values from the binary data according to the dtype + binary_vector.data = unpack_numeric_values(binary_vector.dtype) + + Return binary_vector +End Function +``` + +#### Validation + +Drivers MUST validate vector metadata and raise an error if any invariant is violated: + +- Padding MUST be 0 for all dtypes where padding doesn’t apply, and MUST be within \[0, 7\] for PACKED_BIT. +- A PACKED_BIT vector MUST NOT be empty if padding is in the range \[1, 7\]. + +Drivers MUST perform this validation when a numeric vector and padding are provided through the API, and when unpacking +binary data (BSON or similar) into a Vector structure. + +#### Data Structures + +Drivers MAY find the following structures to represent the dtype and vector structure useful. + +``` +Enum Dtype + # Enum for data types (dtype) + + # FLOAT32: Represents packing of list of floats as float32 + # Value: 0x27 (hexadecimal byte value) + + # INT8: Represents packing of list of signed integers in the range [-128, 127] as signed int8 + # Value: 0x03 (hexadecimal byte value) + + # PACKED_BIT: Special case where vector values are 0 or 1, packed as unsigned uint8 in range [0, 255] + # Packed into groups of 8 (a byte) + # Value: 0x10 (hexadecimal byte value) + + # Documentation: + # Each value is a byte (length of one), a convenient choice for decoding. +End Enum + +Struct Vector + # Numeric vector with metadata for binary interoperability + + # Fields: + # data: Sequence of numeric values (either float or int) + # dtype: Data type of vector (from enum BinaryVectorDtype) + # padding: Number of bits to ignore in the final byte for alignment + + data # Sequence of float or int + dtype # Type: DtypeEnum + padding # Integer: Number of padding bits + End Struct +``` + +## Reference Implementation + +- PYTHON (PYTHON-4577) + +## Test Plan + +See the [README](tests/README.md) for tests. + +## FAQ + +- What MongoDB Server version does this apply to? + - Files in the "specifications" repository have no version scheme. They are not tied to a MongoDB server version. +- In PACKED_BIT, why would one choose to use integers in \[0, 256)? + - This follows a well-established precedent for packing binary-valued arrays into bytes (8 bits), This technique is + widely used across different fields, such as data compression, communication protocols, and file formats, where you + want to store or transmit binary data more efficiently by grouping 8 bits into a single byte (uint8). For an example + in Python, see + [numpy.unpackbits](https://numpy.org/doc/2.0/reference/generated/numpy.unpackbits.html#numpy.unpackbits). diff --git a/source/bson-binary-vector/tests/README.md b/source/bson-binary-vector/tests/README.md new file mode 100644 index 0000000000..aa774cc022 --- /dev/null +++ b/source/bson-binary-vector/tests/README.md @@ -0,0 +1,58 @@ +# Testing Binary subtype 9: Vector + +The JSON files in this directory tree are platform-independent tests that drivers can use to prove their conformance to +the specification. + +These tests focus on the roundtrip of the list of numbers as input/output, along with their data type and byte padding. + +Additional tests exist in `bson_corpus/tests/binary.json` but do not sufficiently test the end-to-end process of Vector +to BSON. For this reason, drivers must create a bespoke test runner for the vector subtype. + +## Format + +The test data corpus consists of a JSON file for each data type (dtype). Each file contains a number of test cases, +under the top-level key "tests". Each test case pertains to a single vector. The keys provide the specification of the +vector. Valid cases also include the Canonical BSON format of a document {test_key: binary}. The "test_key" is common, +and specified at the top level. + +#### Top level keys + +Each JSON file contains three top-level keys. + +- `description`: human-readable description of what is in the file +- `test_key`: name used for key when encoding/decoding a BSON document containing the single BSON Binary for the test + case. Applies to *every* case. +- `tests`: array of test case objects, each of which have the following keys. Valid cases will also contain additional + binary and json encoding values. + +#### Keys of individual tests cases + +- `description`: string describing the test. +- `valid`: boolean indicating if the vector, dtype, and padding should be considered a valid input. +- `vector`: list of numbers +- `dtype_hex`: string defining the data type in hex (e.g. "0x10", "0x27") +- `dtype_alias`: (optional) string defining the data dtype, perhaps as Enum. +- `padding`: (optional) integer for byte padding. Defaults to 0. +- `canonical_bson`: (required if valid is true) an (uppercase) big-endian hex representation of a BSON byte string. + +## Required tests + +#### To prove correct in a valid case (`valid: true`), one MUST + +- encode a document from the numeric values, dtype, and padding, along with the "test_key", and assert this matches the + canonical_bson string. +- decode the canonical_bson into its binary form, and then assert that the numeric values, dtype, and padding all match + those provided in the JSON. + +Note: For floating point number types, exact numerical matches may not be possible. Drivers that natively support the +floating-point type being tested (e.g., when testing float32 vector values in a driver that natively supports float32), +MUST assert that the input float array is the same after encoding and decoding. + +#### To prove correct in an invalid case (`valid:false`), one MUST + +- raise an exception when attempting to encode a document from the numeric values, dtype, and padding. + +## FAQ + +- What MongoDB Server version does this apply to? + - Files in the "specifications" repository have no version scheme. They are not tied to a MongoDB server version. diff --git a/source/bson-binary-vector/tests/float32.json b/source/bson-binary-vector/tests/float32.json new file mode 100644 index 0000000000..872c435323 --- /dev/null +++ b/source/bson-binary-vector/tests/float32.json @@ -0,0 +1,51 @@ +{ + "description": "Tests of Binary subtype 9, Vectors, with dtype FLOAT32", + "test_key": "vector", + "tests": [ + { + "description": "Simple Vector FLOAT32", + "valid": true, + "vector": [127.0, 7.0], + "dtype_hex": "0x27", + "dtype_alias": "FLOAT32", + "padding": 0, + "canonical_bson": "1C00000005766563746F72000A0000000927000000FE420000E04000" + }, + { + "description": "Vector with decimals and negative value FLOAT32", + "valid": true, + "vector": [127.7, -7.7], + "dtype_hex": "0x27", + "dtype_alias": "FLOAT32", + "padding": 0, + "canonical_bson": "1C00000005766563746F72000A0000000927006666FF426666F6C000" + }, + { + "description": "Empty Vector FLOAT32", + "valid": true, + "vector": [], + "dtype_hex": "0x27", + "dtype_alias": "FLOAT32", + "padding": 0, + "canonical_bson": "1400000005766563746F72000200000009270000" + }, + { + "description": "Infinity Vector FLOAT32", + "valid": true, + "vector": ["-inf", 0.0, "inf"], + "dtype_hex": "0x27", + "dtype_alias": "FLOAT32", + "padding": 0, + "canonical_bson": "2000000005766563746F72000E000000092700000080FF000000000000807F00" + }, + { + "description": "FLOAT32 with padding", + "valid": false, + "vector": [127.0, 7.0], + "dtype_hex": "0x27", + "dtype_alias": "FLOAT32", + "padding": 3 + } + ] +} + diff --git a/source/bson-binary-vector/tests/int8.json b/source/bson-binary-vector/tests/int8.json new file mode 100644 index 0000000000..7529721e5e --- /dev/null +++ b/source/bson-binary-vector/tests/int8.json @@ -0,0 +1,57 @@ +{ + "description": "Tests of Binary subtype 9, Vectors, with dtype INT8", + "test_key": "vector", + "tests": [ + { + "description": "Simple Vector INT8", + "valid": true, + "vector": [127, 7], + "dtype_hex": "0x03", + "dtype_alias": "INT8", + "padding": 0, + "canonical_bson": "1600000005766563746F7200040000000903007F0700" + }, + { + "description": "Empty Vector INT8", + "valid": true, + "vector": [], + "dtype_hex": "0x03", + "dtype_alias": "INT8", + "padding": 0, + "canonical_bson": "1400000005766563746F72000200000009030000" + }, + { + "description": "Overflow Vector INT8", + "valid": false, + "vector": [128], + "dtype_hex": "0x03", + "dtype_alias": "INT8", + "padding": 0 + }, + { + "description": "Underflow Vector INT8", + "valid": false, + "vector": [-129], + "dtype_hex": "0x03", + "dtype_alias": "INT8", + "padding": 0 + }, + { + "description": "INT8 with padding", + "valid": false, + "vector": [127, 7], + "dtype_hex": "0x03", + "dtype_alias": "INT8", + "padding": 3 + }, + { + "description": "INT8 with float inputs", + "valid": false, + "vector": [127.77, 7.77], + "dtype_hex": "0x03", + "dtype_alias": "INT8", + "padding": 0 + } + ] +} + diff --git a/source/bson-binary-vector/tests/packed_bit.json b/source/bson-binary-vector/tests/packed_bit.json new file mode 100644 index 0000000000..035776e87f --- /dev/null +++ b/source/bson-binary-vector/tests/packed_bit.json @@ -0,0 +1,98 @@ +{ + "description": "Tests of Binary subtype 9, Vectors, with dtype PACKED_BIT", + "test_key": "vector", + "tests": [ + { + "description": "Padding specified with no vector data PACKED_BIT", + "valid": false, + "vector": [], + "dtype_hex": "0x10", + "dtype_alias": "PACKED_BIT", + "padding": 1 + }, + { + "description": "Simple Vector PACKED_BIT", + "valid": true, + "vector": [127, 7], + "dtype_hex": "0x10", + "dtype_alias": "PACKED_BIT", + "padding": 0, + "canonical_bson": "1600000005766563746F7200040000000910007F0700" + }, + { + "description": "Empty Vector PACKED_BIT", + "valid": true, + "vector": [], + "dtype_hex": "0x10", + "dtype_alias": "PACKED_BIT", + "padding": 0, + "canonical_bson": "1400000005766563746F72000200000009100000" + }, + { + "description": "PACKED_BIT with padding", + "valid": true, + "vector": [127, 7], + "dtype_hex": "0x10", + "dtype_alias": "PACKED_BIT", + "padding": 3, + "canonical_bson": "1600000005766563746F7200040000000910037F0700" + }, + { + "description": "Overflow Vector PACKED_BIT", + "valid": false, + "vector": [256], + "dtype_hex": "0x10", + "dtype_alias": "PACKED_BIT", + "padding": 0 + }, + { + "description": "Underflow Vector PACKED_BIT", + "valid": false, + "vector": [-1], + "dtype_hex": "0x10", + "dtype_alias": "PACKED_BIT", + "padding": 0 + }, + { + "description": "Vector with float values PACKED_BIT", + "valid": false, + "vector": [127.5], + "dtype_hex": "0x10", + "dtype_alias": "PACKED_BIT", + "padding": 0 + }, + { + "description": "Padding specified with no vector data PACKED_BIT", + "valid": false, + "vector": [], + "dtype_hex": "0x10", + "dtype_alias": "PACKED_BIT", + "padding": 1 + }, + { + "description": "Exceeding maximum padding PACKED_BIT", + "valid": false, + "vector": [1], + "dtype_hex": "0x10", + "dtype_alias": "PACKED_BIT", + "padding": 8 + }, + { + "description": "Negative padding PACKED_BIT", + "valid": false, + "vector": [1], + "dtype_hex": "0x10", + "dtype_alias": "PACKED_BIT", + "padding": -1 + }, + { + "description": "Vector with float values PACKED_BIT", + "valid": false, + "vector": [127.5], + "dtype_hex": "0x10", + "dtype_alias": "PACKED_BIT", + "padding": 0 + } + ] +} + diff --git a/source/bson-corpus/tests/binary.json b/source/bson-corpus/tests/binary.json index 20aaef743b..0e0056f3a2 100644 --- a/source/bson-corpus/tests/binary.json +++ b/source/bson-corpus/tests/binary.json @@ -74,6 +74,36 @@ "description": "$type query operator (conflicts with legacy $binary form with $type field)", "canonical_bson": "180000000378001000000010247479706500020000000000", "canonical_extjson": "{\"x\" : { \"$type\" : {\"$numberInt\": \"2\"}}}" + }, + { + "description": "subtype 0x09 Vector FLOAT32", + "canonical_bson": "170000000578000A0000000927000000FE420000E04000", + "canonical_extjson": "{\"x\": {\"$binary\": {\"base64\": \"JwAAAP5CAADgQA==\", \"subType\": \"09\"}}}" + }, + { + "description": "subtype 0x09 Vector INT8", + "canonical_bson": "11000000057800040000000903007F0700", + "canonical_extjson": "{\"x\": {\"$binary\": {\"base64\": \"AwB/Bw==\", \"subType\": \"09\"}}}" + }, + { + "description": "subtype 0x09 Vector PACKED_BIT", + "canonical_bson": "11000000057800040000000910007F0700", + "canonical_extjson": "{\"x\": {\"$binary\": {\"base64\": \"EAB/Bw==\", \"subType\": \"09\"}}}" + }, + { + "description": "subtype 0x09 Vector (Zero-length) FLOAT32", + "canonical_bson": "0F0000000578000200000009270000", + "canonical_extjson": "{\"x\": {\"$binary\": {\"base64\": \"JwA=\", \"subType\": \"09\"}}}" + }, + { + "description": "subtype 0x09 Vector (Zero-length) INT8", + "canonical_bson": "0F0000000578000200000009030000", + "canonical_extjson": "{\"x\": {\"$binary\": {\"base64\": \"AwA=\", \"subType\": \"09\"}}}" + }, + { + "description": "subtype 0x09 Vector (Zero-length) PACKED_BIT", + "canonical_bson": "0F0000000578000200000009100000", + "canonical_extjson": "{\"x\": {\"$binary\": {\"base64\": \"EAA=\", \"subType\": \"09\"}}}" } ], "decodeErrors": [ diff --git a/source/index.md b/source/index.md index 22ba76c141..fcf33d1933 100644 --- a/source/index.md +++ b/source/index.md @@ -3,6 +3,7 @@ - [Atlas Serverless Tests](serverless-testing/README.md) - [Authentication](auth/auth.md) - [BSON Binary Encrypted](bson-binary-encrypted/binary-encrypted.md) +- [BSON Binary Subtype 9 - Vector](bson-binary-vector/bson-binary-vector.md) - [BSON Binary UUID](bson-binary-uuid/uuid.md) - [BSON Corpus](bson-corpus/bson-corpus.md) - [BSON Decimal128](bson-decimal128/decimal128.md)