From ec64aa9b868368c4f8e544bb239a57a31fbd9ddf Mon Sep 17 00:00:00 2001 From: Casey Clements Date: Fri, 13 Sep 2024 17:17:27 -0400 Subject: [PATCH 01/30] Added bson_corpus test new binary subtype 9: vectors --- source/bson-corpus/tests/binary.json | 30 ++++++++++++++++++++++++++++ 1 file changed, 30 insertions(+) diff --git a/source/bson-corpus/tests/binary.json b/source/bson-corpus/tests/binary.json index 20aaef743b..0e0056f3a2 100644 --- a/source/bson-corpus/tests/binary.json +++ b/source/bson-corpus/tests/binary.json @@ -74,6 +74,36 @@ "description": "$type query operator (conflicts with legacy $binary form with $type field)", "canonical_bson": "180000000378001000000010247479706500020000000000", "canonical_extjson": "{\"x\" : { \"$type\" : {\"$numberInt\": \"2\"}}}" + }, + { + "description": "subtype 0x09 Vector FLOAT32", + "canonical_bson": "170000000578000A0000000927000000FE420000E04000", + "canonical_extjson": "{\"x\": {\"$binary\": {\"base64\": \"JwAAAP5CAADgQA==\", \"subType\": \"09\"}}}" + }, + { + "description": "subtype 0x09 Vector INT8", + "canonical_bson": "11000000057800040000000903007F0700", + "canonical_extjson": "{\"x\": {\"$binary\": {\"base64\": \"AwB/Bw==\", \"subType\": \"09\"}}}" + }, + { + "description": "subtype 0x09 Vector PACKED_BIT", + "canonical_bson": "11000000057800040000000910007F0700", + "canonical_extjson": "{\"x\": {\"$binary\": {\"base64\": \"EAB/Bw==\", \"subType\": \"09\"}}}" + }, + { + "description": "subtype 0x09 Vector (Zero-length) FLOAT32", + "canonical_bson": "0F0000000578000200000009270000", + "canonical_extjson": "{\"x\": {\"$binary\": {\"base64\": \"JwA=\", \"subType\": \"09\"}}}" + }, + { + "description": "subtype 0x09 Vector (Zero-length) INT8", + "canonical_bson": "0F0000000578000200000009030000", + "canonical_extjson": "{\"x\": {\"$binary\": {\"base64\": \"AwA=\", \"subType\": \"09\"}}}" + }, + { + "description": "subtype 0x09 Vector (Zero-length) PACKED_BIT", + "canonical_bson": "0F0000000578000200000009100000", + "canonical_extjson": "{\"x\": {\"$binary\": {\"base64\": \"EAA=\", \"subType\": \"09\"}}}" } ], "decodeErrors": [ From d5ab5f1ac3dd20e6e287841b8c1dd8a1a610ba28 Mon Sep 17 00:00:00 2001 From: Casey Clements Date: Mon, 16 Sep 2024 16:49:21 -0400 Subject: [PATCH 02/30] Added first draft of Binary Vector subtype spec markdown --- source/bson-corpus/binary-vector-subtype.md | 94 +++++++++++++++++++++ source/index.md | 1 + 2 files changed, 95 insertions(+) create mode 100644 source/bson-corpus/binary-vector-subtype.md diff --git a/source/bson-corpus/binary-vector-subtype.md b/source/bson-corpus/binary-vector-subtype.md new file mode 100644 index 0000000000..c7bb3c6c2c --- /dev/null +++ b/source/bson-corpus/binary-vector-subtype.md @@ -0,0 +1,94 @@ +# BSON Binary Subtype 9 - Vector + +- Status: Pending +- Minimum Server Version: N/A + +______________________________________________________________________ + +## Abstract + +This document describes the addition of a new subtype to the Binary BSON type. This subtype is used for efficient +storage and retrieval of vectors. Vectors here refer to densely packed arrays of numbers, all of the same type. + +## Motivation + +These representations correspond to the numeric types supported by popular numerical libraries for vector processing, +such as NumPy, PyTorch, TensorFlow and Apache Arrow. Storing and retrieving vector data using the same densely packed +format used by these libraries can result in up to 8x memory savings and orders of magnitude improvement in processing +efficiency. Without this support, MongoDB will be at a competitive disadvantage to databases that do and our users will +bear the additional cost of storing and processing vector data. + +`*` The early addition of the "Packed Bit" representation was to facilitate partnerships in the expanding market of +Generative AI, specifically in Vector Quantization. Succinctly put, a Binary Quantized Vector is just a vector of 0s and +1s (bits), but it is often represented as a list of uint8 (int in Python). So, for example, the vector `[255, 0]` would +be shorthand for the 16 bit vector `[1,1,1,1,1,1,1,1,0,0,0,0,0,0,0,0]`.\ +The authors are well-aware of the inherent +ambiguity here. This is a market-standard, unfortunately. Change is inevitable. + +## META + +The keywords "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and +"OPTIONAL" in this document are to be interpreted as described in [RFC 2119](https://www.ietf.org/rfc/rfc2119.txt). + +## Specification + +This spec introduces a new BSON binary subtype, the vector, with value `"\x09"`. Each vector can take one of multiple +data types (dtypes). The following table lists the first dtypes implemented. + +| Vector data type | Alias | Bits per vector element | [PyArrow Data Type](https://arrow.apache.org/docs/cpp/api/datatype.html) (for illustration) | +| ---------------- | ---------- | ----------------------- | ------------------------------------------------------------------------------------------- | +| `0x03` | INT8 | 8 | INT8 | +| `0x27` | FLOAT32 | 32 | FLOAT | +| `0x10` | PACKED_BIT | 1 `*` | BOOL | + +As not all data types have a bit length equal to a multiple of 8, and hence do not fit squarely into a certain number of +bytes, a second piece of metadata, the "padding" is included. This instructs the driver of the number of bits in the +final byte that are to be ignored. + +The binary structure the vector subtype's value is this. Following the binary subtype `0x09` is a two-element byte +array. + +- The first byte (dtype) describes its data type, such as float32 or int8. The table above shows the implemented + initially implemented in Python. The complete list of data types runs from `0x02` to `0x4b` + +- The second byte (padding) prescribes the number of bits to ignore in the final byte of the value.Ω + +- The remainder contains the actual vector elements packed according to dtype. + +All values use the little-endian format. + +## Reference Implementation + +Please consult the Python driver's `pymongo.binary` module. Prose tests described below can be found in +`test.test_bson.TestBSON.test_vector`. + +## Prose Tests + +The following tests have not yet been automated, but MUST still be tested. + +### 1. Standard encoding / decoding from a list of numbers + +For each data type, the API must provide an idiomatic way to consume a list of that type that encodes to BSON and +decodes back to its original form. + +### 2. JSON functionality + +For each data type, the API must provide an idiomatic way to consume a list of that type that dumps to JSON and loads +back to its original form. + +### 3. PACKED_BIT (Binary Quantized) Vector Tests + +PACKED_BIT vectors must provide a method to consume a list of integers in \[0, 255\] that actually is a representation +of a vector of 0s and 1s (plus additionally padding if appropriate) and reproduce these inputs. + +PACKED_BIT vectors should also be able to be output in an idiomatic format (e.g. `List[int]`, `List`) the true +mathematical representation of the vector. This being a vector of 0s and 1s with any additional elements from padding +discarded. + +### 4. Invalid cases + +Because we the vector represents data types that are often not native to a driver's language, it is important that +invalid numbers are trapped. + +- For `INT8`, only numbers within `[-128, 127]` are permitted. +- For `PACKED_BIT`, only numbers within `[0, 255]` are permitted. diff --git a/source/index.md b/source/index.md index f2e2c8719b..9797508f94 100644 --- a/source/index.md +++ b/source/index.md @@ -1,6 +1,7 @@ # MongoDB Specifications - [BSON Binary Subtype 6](client-side-encryption/subtype6.md) +- [BSON Binary Subtype 9 - Vector](bson-corpus/binary-vector-subtype.md) - [BSON Corpus](bson-corpus/bson-corpus.md) - [BSON Decimal128 Type Handling in Drivers](bson-decimal128/decimal128.md) - [Causal Consistency Specification](causal-consistency/causal-consistency.md) From 91212cac17b110f289545130b8f0fa78500173f9 Mon Sep 17 00:00:00 2001 From: Casey Clements Date: Tue, 17 Sep 2024 15:55:10 -0400 Subject: [PATCH 03/30] Move bson-binary-vector.md from bson-corpus to its own dir --- .../bson-binary-vector.md} | 16 +++++++--------- source/index.md | 2 +- 2 files changed, 8 insertions(+), 10 deletions(-) rename source/{bson-corpus/binary-vector-subtype.md => bson-binary-vector/bson-binary-vector.md} (85%) diff --git a/source/bson-corpus/binary-vector-subtype.md b/source/bson-binary-vector/bson-binary-vector.md similarity index 85% rename from source/bson-corpus/binary-vector-subtype.md rename to source/bson-binary-vector/bson-binary-vector.md index c7bb3c6c2c..24195b0fe8 100644 --- a/source/bson-corpus/binary-vector-subtype.md +++ b/source/bson-binary-vector/bson-binary-vector.md @@ -15,15 +15,13 @@ storage and retrieval of vectors. Vectors here refer to densely packed arrays of These representations correspond to the numeric types supported by popular numerical libraries for vector processing, such as NumPy, PyTorch, TensorFlow and Apache Arrow. Storing and retrieving vector data using the same densely packed format used by these libraries can result in up to 8x memory savings and orders of magnitude improvement in processing -efficiency. Without this support, MongoDB will be at a competitive disadvantage to databases that do and our users will -bear the additional cost of storing and processing vector data. - -`*` The early addition of the "Packed Bit" representation was to facilitate partnerships in the expanding market of -Generative AI, specifically in Vector Quantization. Succinctly put, a Binary Quantized Vector is just a vector of 0s and -1s (bits), but it is often represented as a list of uint8 (int in Python). So, for example, the vector `[255, 0]` would -be shorthand for the 16 bit vector `[1,1,1,1,1,1,1,1,0,0,0,0,0,0,0,0]`.\ -The authors are well-aware of the inherent -ambiguity here. This is a market-standard, unfortunately. Change is inevitable. +efficiency. + +`*` Succinctly put, a Binary Quantized Vector is just a vector of 0s and 1s (bits), but it is often represented as a +list of uint8 (int in Python). So, for example, the vector `[255, 0]` would be shorthand for the 16 bit vector +`[1,1,1,1,1,1,1,1,0,0,0,0,0,0,0,0]`.\ +The authors are well-aware of the inherent ambiguity here. This is a +market-standard, unfortunately. Change is inevitable. ## META diff --git a/source/index.md b/source/index.md index 9797508f94..8a52f2775b 100644 --- a/source/index.md +++ b/source/index.md @@ -1,7 +1,7 @@ # MongoDB Specifications - [BSON Binary Subtype 6](client-side-encryption/subtype6.md) -- [BSON Binary Subtype 9 - Vector](bson-corpus/binary-vector-subtype.md) +- [BSON Binary Subtype 9 - Vector](bson-binary-vector/bson-binary-vector.md) - [BSON Corpus](bson-corpus/bson-corpus.md) - [BSON Decimal128 Type Handling in Drivers](bson-decimal128/decimal128.md) - [Causal Consistency Specification](causal-consistency/causal-consistency.md) From 8757836b79ec17b58b2ecbbd0e9b0e1278db3077 Mon Sep 17 00:00:00 2001 From: Casey Clements Date: Tue, 17 Sep 2024 16:54:51 -0400 Subject: [PATCH 04/30] Updates based on feedback. --- .../bson-binary-vector/bson-binary-vector.md | 77 +++++++------------ source/bson-binary-vector/tests/README.md | 0 2 files changed, 29 insertions(+), 48 deletions(-) create mode 100644 source/bson-binary-vector/tests/README.md diff --git a/source/bson-binary-vector/bson-binary-vector.md b/source/bson-binary-vector/bson-binary-vector.md index 24195b0fe8..8c02addb3a 100644 --- a/source/bson-binary-vector/bson-binary-vector.md +++ b/source/bson-binary-vector/bson-binary-vector.md @@ -14,24 +14,22 @@ storage and retrieval of vectors. Vectors here refer to densely packed arrays of These representations correspond to the numeric types supported by popular numerical libraries for vector processing, such as NumPy, PyTorch, TensorFlow and Apache Arrow. Storing and retrieving vector data using the same densely packed -format used by these libraries can result in up to 8x memory savings and orders of magnitude improvement in processing -efficiency. +format used by these libraries can result in up to significant memory savings and processing efficiency. -`*` Succinctly put, a Binary Quantized Vector is just a vector of 0s and 1s (bits), but it is often represented as a -list of uint8 (int in Python). So, for example, the vector `[255, 0]` would be shorthand for the 16 bit vector -`[1,1,1,1,1,1,1,1,0,0,0,0,0,0,0,0]`.\ -The authors are well-aware of the inherent ambiguity here. This is a -market-standard, unfortunately. Change is inevitable. - -## META +### META The keywords "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in [RFC 2119](https://www.ietf.org/rfc/rfc2119.txt). ## Specification -This spec introduces a new BSON binary subtype, the vector, with value `"\x09"`. Each vector can take one of multiple -data types (dtypes). The following table lists the first dtypes implemented. +This specification introduces a new BSON binary subtype, the vector, with value `"\x09"`. + +Drivers SHOULD provide idiomatic APIs to translate between arrays of numbers and this BSON Binary specification. + +#### Data Types + +Each vector can take one of multiple data types (dtypes). The following table lists the first dtypes implemented. | Vector data type | Alias | Bits per vector element | [PyArrow Data Type](https://arrow.apache.org/docs/cpp/api/datatype.html) (for illustration) | | ---------------- | ---------- | ----------------------- | ------------------------------------------------------------------------------------------- | @@ -39,54 +37,37 @@ data types (dtypes). The following table lists the first dtypes implemented. | `0x27` | FLOAT32 | 32 | FLOAT | | `0x10` | PACKED_BIT | 1 `*` | BOOL | +`*` A Binary Quantized (PACKED_BIT) Vector is a vector of 0s and 1s (bits), but it is represented in memory as a list of +integers in \[0, 255\]. So, for example, the vector `[0, 255]` would be shorthand for the 16 bit vector +`[0,0,0,0,0,0,0,0,1,1,1,1,1,1,1,1]`. The idea is that each number (a uint8) can be stored as a single byte. Of course, +some languages, Python for one, do not have an uint8 type, so must be represented as an int in memory, but not on disk. + +The authors are well-aware of the inherent ambiguity here, and alternatives. This is a market-standard, unfortunately. +Change is inevitable. + +#### Byte padding + As not all data types have a bit length equal to a multiple of 8, and hence do not fit squarely into a certain number of bytes, a second piece of metadata, the "padding" is included. This instructs the driver of the number of bits in the final byte that are to be ignored. -The binary structure the vector subtype's value is this. Following the binary subtype `0x09` is a two-element byte -array. +#### Binary structure -- The first byte (dtype) describes its data type, such as float32 or int8. The table above shows the implemented - initially implemented in Python. The complete list of data types runs from `0x02` to `0x4b` +Following the binary subtype `0x09` is a two-element byte array. -- The second byte (padding) prescribes the number of bits to ignore in the final byte of the value.Ω +- The first byte (dtype) describes its data type. The table above shows those that MUST be implemented. This table may + increase. + +- The second byte (padding) prescribes the number of bits to ignore in the final byte of the value. - The remainder contains the actual vector elements packed according to dtype. All values use the little-endian format. -## Reference Implementation - -Please consult the Python driver's `pymongo.binary` module. Prose tests described below can be found in -`test.test_bson.TestBSON.test_vector`. - -## Prose Tests - -The following tests have not yet been automated, but MUST still be tested. - -### 1. Standard encoding / decoding from a list of numbers - -For each data type, the API must provide an idiomatic way to consume a list of that type that encodes to BSON and -decodes back to its original form. - -### 2. JSON functionality - -For each data type, the API must provide an idiomatic way to consume a list of that type that dumps to JSON and loads -back to its original form. - -### 3. PACKED_BIT (Binary Quantized) Vector Tests - -PACKED_BIT vectors must provide a method to consume a list of integers in \[0, 255\] that actually is a representation -of a vector of 0s and 1s (plus additionally padding if appropriate) and reproduce these inputs. - -PACKED_BIT vectors should also be able to be output in an idiomatic format (e.g. `List[int]`, `List`) the true -mathematical representation of the vector. This being a vector of 0s and 1s with any additional elements from padding -discarded. +### Reference Implementation -### 4. Invalid cases +Please consult the Python driver's `pymongo.binary` module. -Because we the vector represents data types that are often not native to a driver's language, it is important that -invalid numbers are trapped. +### Test Plan -- For `INT8`, only numbers within `[-128, 127]` are permitted. -- For `PACKED_BIT`, only numbers within `[0, 255]` are permitted. +See the [README](tests/README.md) for tests. diff --git a/source/bson-binary-vector/tests/README.md b/source/bson-binary-vector/tests/README.md new file mode 100644 index 0000000000..e69de29bb2 From 830632a54af3aed39d063422549eb4b7c21ef356 Mon Sep 17 00:00:00 2001 From: Casey Clements Date: Fri, 20 Sep 2024 10:27:14 -0400 Subject: [PATCH 05/30] Added README.md for Binary Vector tests --- source/bson-binary-vector/tests/README.md | 40 +++++++++++++++++++++++ 1 file changed, 40 insertions(+) diff --git a/source/bson-binary-vector/tests/README.md b/source/bson-binary-vector/tests/README.md index e69de29bb2..b217705ee8 100644 --- a/source/bson-binary-vector/tests/README.md +++ b/source/bson-binary-vector/tests/README.md @@ -0,0 +1,40 @@ +# Testing Binary subtype 9: Vector + +The JSON files in this directory tree are platform-independent tests that drivers can use to prove their conformance to +the specification. + +These tests focus on the roundtrip of the list numbers as input/output, along with their data type and byte padding. + +Additional tests exist in `bson_corpus/tests/binary.json` but do not sufficiently test the end-to-end process of Vector +to BSON. For this reason, drivers must create a bespoke test runner for the vector subtype. + +Each test case here pertains to a single vector. The inputs required to create the Binary BSON object are defined, and +when valid, the Canonical BSON and Extended JSON representations are included for comparison. + +## Version + +Files in the "specifications" repository have no version scheme. They are not tied to a MongoDB server version. + +## Format + +#### Top level keys + +Each JSON file contains three top-level keys. + +- `description`: human-readable description of what is in the file +- `test_key`: Field name used when decoding/encoding a BSON document containing the single BSON Binary for the test + case. Applies to *every* case. +- `tests`: array of test case objects, each of which have the following keys. Valid cases will also contain additional + binary and json encoding values. + +#### Keys of tests objects + +- `description`: string describing the test. +- `valid`: boolean indicating if the vector, dtype, and padding should be considered a valid input. +- `vector`: list of numbers +- `dtype_hex`: string defining the data type in hex (e.g. "0x10", "0x27") +- `dtype_alias`: (optional) string defining the data dtype, perhaps as Enum. +- `padding`: (optional) integer for byte padding. Defaults to 0. +- `canonical_bson`: (required if valid is true) an (uppercase) big-endian hex representation of a BSON byte string. +- `canonical_extjson`: (required if valid is true) string containing a Canonical Extended JSON document. Because this is + itself embedded as a *string* inside a JSON document, characters like quote and backslash are escaped. From 67b410d226af43f3b74cff93f5f77c45776e2bb3 Mon Sep 17 00:00:00 2001 From: Casey Clements Date: Fri, 20 Sep 2024 10:34:09 -0400 Subject: [PATCH 06/30] Added tests for binary vector subtype --- .../tests/vector-test-cases.json | 145 ++++++++++++++++++ 1 file changed, 145 insertions(+) create mode 100644 source/bson-binary-vector/tests/vector-test-cases.json diff --git a/source/bson-binary-vector/tests/vector-test-cases.json b/source/bson-binary-vector/tests/vector-test-cases.json new file mode 100644 index 0000000000..ffd322a9ab --- /dev/null +++ b/source/bson-binary-vector/tests/vector-test-cases.json @@ -0,0 +1,145 @@ +{ + "description": "Basic Tests of Binary Vectors, subtype 9", + "test_key": "vector", + "tests": [ + { + "description": "Simple Vector INT8", + "valid": true, + "vector": [127, 7], + "dtype_hex": "0x03", + "dtype_alias": "INT8", + "padding": 0, + "canonical_bson": "1600000005766563746F7200040000000903007F0700", + "canonical_extjson": "{\"vector\": {\"$binary\": {\"base64\": \"AwB/Bw==\", \"subType\": \"09\"}}}" + }, + { + "description": "Simple Vector FLOAT32", + "valid": true, + "vector": [127.0, 7.0], + "dtype_hex": "0x27", + "dtype_alias": "FLOAT32", + "padding": 0, + "canonical_bson": "1C00000005766563746F72000A0000000927000000FE420000E04000", + "canonical_extjson": "{\"vector\": {\"$binary\": {\"base64\": \"JwAAAP5CAADgQA==\", \"subType\": \"09\"}}}" + }, + { + "description": "Simple Vector PACKED_BIT", + "valid": true, + "vector": [127, 7], + "dtype_hex": "0x10", + "dtype_alias": "PACKED_BIT", + "padding": 0, + "canonical_bson": "1600000005766563746F7200040000000910007F0700", + "canonical_extjson": "{\"vector\": {\"$binary\": {\"base64\": \"EAB/Bw==\", \"subType\": \"09\"}}}" + }, + { + "description": "Empty Vector INT8", + "valid": true, + "vector": [], + "dtype_hex": "0x03", + "dtype_alias": "INT8", + "padding": 0, + "canonical_bson": "1400000005766563746F72000200000009030000", + "canonical_extjson": "{\"vector\": {\"$binary\": {\"base64\": \"AwA=\", \"subType\": \"09\"}}}" + }, + { + "description": "Empty Vector FLOAT32", + "valid": true, + "vector": [], + "dtype_hex": "0x27", + "dtype_alias": "FLOAT32", + "padding": 0, + "canonical_bson": "1400000005766563746F72000200000009270000", + "canonical_extjson": "{\"vector\": {\"$binary\": {\"base64\": \"JwA=\", \"subType\": \"09\"}}}" + }, + { + "description": "Empty Vector PACKED_BIT", + "valid": true, + "vector": [], + "dtype_hex": "0x10", + "dtype_alias": "PACKED_BIT", + "padding": 0, + "canonical_bson": "1400000005766563746F72000200000009100000", + "canonical_extjson": "{\"vector\": {\"$binary\": {\"base64\": \"EAA=\", \"subType\": \"09\"}}}" + }, + { + "description": "Infinity Vector FLOAT32", + "valid": true, + "vector": ["-inf", 0.0, "inf"], + "dtype_hex": "0x27", + "dtype_alias": "FLOAT32", + "padding": 0, + "canonical_bson": "2000000005766563746F72000E000000092700000080FF000000000000807F00", + "canonical_extjson": "{\"vector\": {\"$binary\": {\"base64\": \"JwAAAID/AAAAAAAAgH8=\", \"subType\": \"09\"}}}" + }, + { + "description": "PACKED_BIT with padding", + "valid": true, + "vector": [127, 7], + "dtype_hex": "0x10", + "dtype_alias": "PACKED_BIT", + "padding": 3, + "canonical_bson": "1600000005766563746F7200040000000910037F0700", + "canonical_extjson": "{\"vector\": {\"$binary\": {\"base64\": \"EAN/Bw==\", \"subType\": \"09\"}}}" + } + ], + "invalid": [ + { + "description": "Overflow Vector INT8", + "valid": false, + "vector": [256], + "dtype_hex": "0x03", + "dtype_alias": "INT8", + "padding": 0 + }, + { + "description": "Overflow Vector PACKED_BIT", + "valid": false, + "vector": [256], + "dtype_hex": "0x10", + "dtype_alias": "PACKED_BIT", + "padding": 0 + }, + { + "description": "Underflow Vector INT8", + "valid": false, + "vector": [-1], + "dtype_hex": "0x03", + "dtype_alias": "INT8", + "padding": 0 + }, + { + "description": "Underflow Vector PACKED_BIT", + "valid": false, + "vector": [-1], + "dtype_hex": "0x10", + "dtype_alias": "PACKED_BIT", + "padding": 0 + }, + { + "description": "INT8 with padding", + "valid": false, + "vector": [127, 7], + "dtype_hex": "0x03", + "dtype_alias": "INT8", + "padding": 3 + }, + { + "description": "FLOAT32 with padding", + "valid": false, + "vector": [127.0, 7.0], + "dtype_hex": "0x27", + "dtype_alias": "FLOAT32", + "padding": 3 + }, + { + "description": "INT8 with float inputs", + "valid": false, + "vector": [127.77, 7.77], + "dtype_hex": "0x27", + "dtype_alias": "INT8", + "padding": 0 + } + ] +} + From 07667a1f6d00b92f4c03c9949c1f556fe5eed998 Mon Sep 17 00:00:00 2001 From: Casey Clements Date: Fri, 20 Sep 2024 18:19:46 -0400 Subject: [PATCH 07/30] Broke tests into 3 files by dtype --- source/bson-binary-vector/tests/float32.json | 45 ++++++ source/bson-binary-vector/tests/int8.json | 59 +++++++ .../bson-binary-vector/tests/packed_bit.json | 53 +++++++ .../tests/vector-test-cases.json | 145 ------------------ 4 files changed, 157 insertions(+), 145 deletions(-) create mode 100644 source/bson-binary-vector/tests/float32.json create mode 100644 source/bson-binary-vector/tests/int8.json create mode 100644 source/bson-binary-vector/tests/packed_bit.json delete mode 100644 source/bson-binary-vector/tests/vector-test-cases.json diff --git a/source/bson-binary-vector/tests/float32.json b/source/bson-binary-vector/tests/float32.json new file mode 100644 index 0000000000..9ec72861d4 --- /dev/null +++ b/source/bson-binary-vector/tests/float32.json @@ -0,0 +1,45 @@ +{ + "description": "Tests of Binary subtype 9, Vectors, with dtype FLOAT32", + "test_key": "vector", + "tests": [ + { + "description": "Simple Vector FLOAT32", + "valid": true, + "vector": [127.0, 7.0], + "dtype_hex": "0x27", + "dtype_alias": "FLOAT32", + "padding": 0, + "canonical_bson": "1C00000005766563746F72000A0000000927000000FE420000E04000", + "canonical_extjson": "{\"vector\": {\"$binary\": {\"base64\": \"JwAAAP5CAADgQA==\", \"subType\": \"09\"}}}" + }, + { + "description": "Empty Vector FLOAT32", + "valid": true, + "vector": [], + "dtype_hex": "0x27", + "dtype_alias": "FLOAT32", + "padding": 0, + "canonical_bson": "1400000005766563746F72000200000009270000", + "canonical_extjson": "{\"vector\": {\"$binary\": {\"base64\": \"JwA=\", \"subType\": \"09\"}}}" + }, + { + "description": "Infinity Vector FLOAT32", + "valid": true, + "vector": ["-inf", 0.0, "inf"], + "dtype_hex": "0x27", + "dtype_alias": "FLOAT32", + "padding": 0, + "canonical_bson": "2000000005766563746F72000E000000092700000080FF000000000000807F00", + "canonical_extjson": "{\"vector\": {\"$binary\": {\"base64\": \"JwAAAID/AAAAAAAAgH8=\", \"subType\": \"09\"}}}" + }, + { + "description": "FLOAT32 with padding", + "valid": false, + "vector": [127.0, 7.0], + "dtype_hex": "0x27", + "dtype_alias": "FLOAT32", + "padding": 3 + } + ] +} + diff --git a/source/bson-binary-vector/tests/int8.json b/source/bson-binary-vector/tests/int8.json new file mode 100644 index 0000000000..92eab609e8 --- /dev/null +++ b/source/bson-binary-vector/tests/int8.json @@ -0,0 +1,59 @@ +{ + "description": "Tests of Binary subtype 9, Vectors, with dtype INT8", + "test_key": "vector", + "tests": [ + { + "description": "Simple Vector INT8", + "valid": true, + "vector": [127, 7], + "dtype_hex": "0x03", + "dtype_alias": "INT8", + "padding": 0, + "canonical_bson": "1600000005766563746F7200040000000903007F0700", + "canonical_extjson": "{\"vector\": {\"$binary\": {\"base64\": \"AwB/Bw==\", \"subType\": \"09\"}}}" + }, + { + "description": "Empty Vector INT8", + "valid": true, + "vector": [], + "dtype_hex": "0x03", + "dtype_alias": "INT8", + "padding": 0, + "canonical_bson": "1400000005766563746F72000200000009030000", + "canonical_extjson": "{\"vector\": {\"$binary\": {\"base64\": \"AwA=\", \"subType\": \"09\"}}}" + }, + { + "description": "Overflow Vector INT8", + "valid": false, + "vector": [128], + "dtype_hex": "0x03", + "dtype_alias": "INT8", + "padding": 0 + }, + { + "description": "Underflow Vector INT8", + "valid": false, + "vector": [-129], + "dtype_hex": "0x03", + "dtype_alias": "INT8", + "padding": 0 + }, + { + "description": "INT8 with padding", + "valid": false, + "vector": [127, 7], + "dtype_hex": "0x03", + "dtype_alias": "INT8", + "padding": 3 + }, + { + "description": "INT8 with float inputs", + "valid": false, + "vector": [127.77, 7.77], + "dtype_hex": "0x03", + "dtype_alias": "INT8", + "padding": 0 + } + ] +} + diff --git a/source/bson-binary-vector/tests/packed_bit.json b/source/bson-binary-vector/tests/packed_bit.json new file mode 100644 index 0000000000..de108876a9 --- /dev/null +++ b/source/bson-binary-vector/tests/packed_bit.json @@ -0,0 +1,53 @@ +{ + "description": "Tests of Binary subtype 9, Vectors, with dtype PACKED_BIT", + "test_key": "vector", + "tests": [ + { + "description": "Simple Vector PACKED_BIT", + "valid": true, + "vector": [127, 7], + "dtype_hex": "0x10", + "dtype_alias": "PACKED_BIT", + "padding": 0, + "canonical_bson": "1600000005766563746F7200040000000910007F0700", + "canonical_extjson": "{\"vector\": {\"$binary\": {\"base64\": \"EAB/Bw==\", \"subType\": \"09\"}}}" + }, + { + "description": "Empty Vector PACKED_BIT", + "valid": true, + "vector": [], + "dtype_hex": "0x10", + "dtype_alias": "PACKED_BIT", + "padding": 0, + "canonical_bson": "1400000005766563746F72000200000009100000", + "canonical_extjson": "{\"vector\": {\"$binary\": {\"base64\": \"EAA=\", \"subType\": \"09\"}}}" + }, + { + "description": "PACKED_BIT with padding", + "valid": true, + "vector": [127, 7], + "dtype_hex": "0x10", + "dtype_alias": "PACKED_BIT", + "padding": 3, + "canonical_bson": "1600000005766563746F7200040000000910037F0700", + "canonical_extjson": "{\"vector\": {\"$binary\": {\"base64\": \"EAN/Bw==\", \"subType\": \"09\"}}}" + }, + { + "description": "Overflow Vector PACKED_BIT", + "valid": false, + "vector": [256], + "dtype_hex": "0x10", + "dtype_alias": "PACKED_BIT", + "padding": 0 + }, + { + "description": "Underflow Vector PACKED_BIT", + "valid": false, + "vector": [-1], + "dtype_hex": "0x10", + "dtype_alias": "PACKED_BIT", + "padding": 0 + } + ] +} + diff --git a/source/bson-binary-vector/tests/vector-test-cases.json b/source/bson-binary-vector/tests/vector-test-cases.json deleted file mode 100644 index ffd322a9ab..0000000000 --- a/source/bson-binary-vector/tests/vector-test-cases.json +++ /dev/null @@ -1,145 +0,0 @@ -{ - "description": "Basic Tests of Binary Vectors, subtype 9", - "test_key": "vector", - "tests": [ - { - "description": "Simple Vector INT8", - "valid": true, - "vector": [127, 7], - "dtype_hex": "0x03", - "dtype_alias": "INT8", - "padding": 0, - "canonical_bson": "1600000005766563746F7200040000000903007F0700", - "canonical_extjson": "{\"vector\": {\"$binary\": {\"base64\": \"AwB/Bw==\", \"subType\": \"09\"}}}" - }, - { - "description": "Simple Vector FLOAT32", - "valid": true, - "vector": [127.0, 7.0], - "dtype_hex": "0x27", - "dtype_alias": "FLOAT32", - "padding": 0, - "canonical_bson": "1C00000005766563746F72000A0000000927000000FE420000E04000", - "canonical_extjson": "{\"vector\": {\"$binary\": {\"base64\": \"JwAAAP5CAADgQA==\", \"subType\": \"09\"}}}" - }, - { - "description": "Simple Vector PACKED_BIT", - "valid": true, - "vector": [127, 7], - "dtype_hex": "0x10", - "dtype_alias": "PACKED_BIT", - "padding": 0, - "canonical_bson": "1600000005766563746F7200040000000910007F0700", - "canonical_extjson": "{\"vector\": {\"$binary\": {\"base64\": \"EAB/Bw==\", \"subType\": \"09\"}}}" - }, - { - "description": "Empty Vector INT8", - "valid": true, - "vector": [], - "dtype_hex": "0x03", - "dtype_alias": "INT8", - "padding": 0, - "canonical_bson": "1400000005766563746F72000200000009030000", - "canonical_extjson": "{\"vector\": {\"$binary\": {\"base64\": \"AwA=\", \"subType\": \"09\"}}}" - }, - { - "description": "Empty Vector FLOAT32", - "valid": true, - "vector": [], - "dtype_hex": "0x27", - "dtype_alias": "FLOAT32", - "padding": 0, - "canonical_bson": "1400000005766563746F72000200000009270000", - "canonical_extjson": "{\"vector\": {\"$binary\": {\"base64\": \"JwA=\", \"subType\": \"09\"}}}" - }, - { - "description": "Empty Vector PACKED_BIT", - "valid": true, - "vector": [], - "dtype_hex": "0x10", - "dtype_alias": "PACKED_BIT", - "padding": 0, - "canonical_bson": "1400000005766563746F72000200000009100000", - "canonical_extjson": "{\"vector\": {\"$binary\": {\"base64\": \"EAA=\", \"subType\": \"09\"}}}" - }, - { - "description": "Infinity Vector FLOAT32", - "valid": true, - "vector": ["-inf", 0.0, "inf"], - "dtype_hex": "0x27", - "dtype_alias": "FLOAT32", - "padding": 0, - "canonical_bson": "2000000005766563746F72000E000000092700000080FF000000000000807F00", - "canonical_extjson": "{\"vector\": {\"$binary\": {\"base64\": \"JwAAAID/AAAAAAAAgH8=\", \"subType\": \"09\"}}}" - }, - { - "description": "PACKED_BIT with padding", - "valid": true, - "vector": [127, 7], - "dtype_hex": "0x10", - "dtype_alias": "PACKED_BIT", - "padding": 3, - "canonical_bson": "1600000005766563746F7200040000000910037F0700", - "canonical_extjson": "{\"vector\": {\"$binary\": {\"base64\": \"EAN/Bw==\", \"subType\": \"09\"}}}" - } - ], - "invalid": [ - { - "description": "Overflow Vector INT8", - "valid": false, - "vector": [256], - "dtype_hex": "0x03", - "dtype_alias": "INT8", - "padding": 0 - }, - { - "description": "Overflow Vector PACKED_BIT", - "valid": false, - "vector": [256], - "dtype_hex": "0x10", - "dtype_alias": "PACKED_BIT", - "padding": 0 - }, - { - "description": "Underflow Vector INT8", - "valid": false, - "vector": [-1], - "dtype_hex": "0x03", - "dtype_alias": "INT8", - "padding": 0 - }, - { - "description": "Underflow Vector PACKED_BIT", - "valid": false, - "vector": [-1], - "dtype_hex": "0x10", - "dtype_alias": "PACKED_BIT", - "padding": 0 - }, - { - "description": "INT8 with padding", - "valid": false, - "vector": [127, 7], - "dtype_hex": "0x03", - "dtype_alias": "INT8", - "padding": 3 - }, - { - "description": "FLOAT32 with padding", - "valid": false, - "vector": [127.0, 7.0], - "dtype_hex": "0x27", - "dtype_alias": "FLOAT32", - "padding": 3 - }, - { - "description": "INT8 with float inputs", - "valid": false, - "vector": [127.77, 7.77], - "dtype_hex": "0x27", - "dtype_alias": "INT8", - "padding": 0 - } - ] -} - From 80f19fa600078d08184c9cf4facee1e12877384f Mon Sep 17 00:00:00 2001 From: Casey Clements Date: Fri, 20 Sep 2024 18:36:42 -0400 Subject: [PATCH 08/30] Added github link in Reference Implementation --- source/bson-binary-vector/bson-binary-vector.md | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/source/bson-binary-vector/bson-binary-vector.md b/source/bson-binary-vector/bson-binary-vector.md index 8c02addb3a..16031f6988 100644 --- a/source/bson-binary-vector/bson-binary-vector.md +++ b/source/bson-binary-vector/bson-binary-vector.md @@ -66,7 +66,8 @@ All values use the little-endian format. ### Reference Implementation -Please consult the Python driver's `pymongo.binary` module. +Please consult the Python driver's +[pymongo.binary](https://github.com/mongodb/mongo-python-driver/blob/master/bson/binary.py) module. ### Test Plan From 7255b6c47fac0b8ea12295fb55e5ecac16385e39 Mon Sep 17 00:00:00 2001 From: Casey Clements Date: Fri, 20 Sep 2024 18:43:14 -0400 Subject: [PATCH 09/30] PyArrow -> Arrow --- source/bson-binary-vector/bson-binary-vector.md | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/source/bson-binary-vector/bson-binary-vector.md b/source/bson-binary-vector/bson-binary-vector.md index 16031f6988..47d64cd3c3 100644 --- a/source/bson-binary-vector/bson-binary-vector.md +++ b/source/bson-binary-vector/bson-binary-vector.md @@ -31,11 +31,11 @@ Drivers SHOULD provide idiomatic APIs to translate between arrays of numbers and Each vector can take one of multiple data types (dtypes). The following table lists the first dtypes implemented. -| Vector data type | Alias | Bits per vector element | [PyArrow Data Type](https://arrow.apache.org/docs/cpp/api/datatype.html) (for illustration) | -| ---------------- | ---------- | ----------------------- | ------------------------------------------------------------------------------------------- | -| `0x03` | INT8 | 8 | INT8 | -| `0x27` | FLOAT32 | 32 | FLOAT | -| `0x10` | PACKED_BIT | 1 `*` | BOOL | +| Vector data type | Alias | Bits per vector element | [Arrow Data Type](https://arrow.apache.org/docs/cpp/api/datatype.html) (for illustration) | +| ---------------- | ---------- | ----------------------- | ----------------------------------------------------------------------------------------- | +| `0x03` | INT8 | 8 | INT8 | +| `0x27` | FLOAT32 | 32 | FLOAT | +| `0x10` | PACKED_BIT | 1 `*` | BOOL | `*` A Binary Quantized (PACKED_BIT) Vector is a vector of 0s and 1s (bits), but it is represented in memory as a list of integers in \[0, 255\]. So, for example, the vector `[0, 255]` would be shorthand for the 16 bit vector From 0ff289bd80fffda25cb8e29966fd56a0865286ec Mon Sep 17 00:00:00 2001 From: Casey Clements Date: Fri, 20 Sep 2024 22:02:59 -0400 Subject: [PATCH 10/30] Added reference to jira ticket --- source/bson-binary-vector/bson-binary-vector.md | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/source/bson-binary-vector/bson-binary-vector.md b/source/bson-binary-vector/bson-binary-vector.md index 47d64cd3c3..6b26cf36f6 100644 --- a/source/bson-binary-vector/bson-binary-vector.md +++ b/source/bson-binary-vector/bson-binary-vector.md @@ -66,8 +66,7 @@ All values use the little-endian format. ### Reference Implementation -Please consult the Python driver's -[pymongo.binary](https://github.com/mongodb/mongo-python-driver/blob/master/bson/binary.py) module. +- PYTHON (PYTHON-4577) [pymongo.binary](https://github.com/mongodb/mongo-python-driver/blob/master/bson/binary.py) ### Test Plan From b3d6ea002a204e55220a434b2b2404508239ce12 Mon Sep 17 00:00:00 2001 From: Casey Clements Date: Mon, 23 Sep 2024 14:51:22 -0400 Subject: [PATCH 11/30] Added example for Binary structure --- source/bson-binary-vector/bson-binary-vector.md | 3 +++ 1 file changed, 3 insertions(+) diff --git a/source/bson-binary-vector/bson-binary-vector.md b/source/bson-binary-vector/bson-binary-vector.md index 6b26cf36f6..074fa24a85 100644 --- a/source/bson-binary-vector/bson-binary-vector.md +++ b/source/bson-binary-vector/bson-binary-vector.md @@ -62,6 +62,9 @@ Following the binary subtype `0x09` is a two-element byte array. - The remainder contains the actual vector elements packed according to dtype. +For example, a vector \[6, 7\] of dtype PACKED_BIT (\\x10) with a padding of 3 would look like this: +`b"\x10\x03\x06\x07'`: 1 byte for dtype, 1 for padding, and 1 for each uint8. + All values use the little-endian format. ### Reference Implementation From 8cfc15a9cf153349112c5e88f019e0e265b8bb05 Mon Sep 17 00:00:00 2001 From: Casey Clements Date: Mon, 23 Sep 2024 16:12:05 -0400 Subject: [PATCH 12/30] Added table visualization of binary structure --- .../bson-binary-vector/bson-binary-vector.md | 33 +++++++++++++++++-- 1 file changed, 30 insertions(+), 3 deletions(-) diff --git a/source/bson-binary-vector/bson-binary-vector.md b/source/bson-binary-vector/bson-binary-vector.md index 074fa24a85..55363a6bca 100644 --- a/source/bson-binary-vector/bson-binary-vector.md +++ b/source/bson-binary-vector/bson-binary-vector.md @@ -29,7 +29,7 @@ Drivers SHOULD provide idiomatic APIs to translate between arrays of numbers and #### Data Types -Each vector can take one of multiple data types (dtypes). The following table lists the first dtypes implemented. +Each vector can take one of multiple data types (dtypes). The following table lists the dtypes implemented. | Vector data type | Alias | Bits per vector element | [Arrow Data Type](https://arrow.apache.org/docs/cpp/api/datatype.html) (for illustration) | | ---------------- | ---------- | ----------------------- | ----------------------------------------------------------------------------------------- | @@ -53,7 +53,7 @@ final byte that are to be ignored. #### Binary structure -Following the binary subtype `0x09` is a two-element byte array. +Following the binary subtype `0x09` a two-element byte array of metadata precedes the packed numbers. - The first byte (dtype) describes its data type. The table above shows those that MUST be implemented. This table may increase. @@ -62,9 +62,36 @@ Following the binary subtype `0x09` is a two-element byte array. - The remainder contains the actual vector elements packed according to dtype. -For example, a vector \[6, 7\] of dtype PACKED_BIT (\\x10) with a padding of 3 would look like this: +For example, a vector `[6, 7]` of dtype PACKED_BIT (`\x10`) with a padding of `3` would look like this: `b"\x10\x03\x06\x07'`: 1 byte for dtype, 1 for padding, and 1 for each uint8. + + + + + + + + + + + + + + + + + + + + + + + + + +
1st byte: dtype (from list in previous table) 2nd byte: padding (values in [0,7])binary numbers packed according to dtype
0000101090000011...
+ All values use the little-endian format. ### Reference Implementation From a6ee71bf62bf98e49469a93735ecba372a3b8d64 Mon Sep 17 00:00:00 2001 From: Casey Clements Date: Mon, 23 Sep 2024 16:17:18 -0400 Subject: [PATCH 13/30] typo --- source/bson-binary-vector/bson-binary-vector.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/source/bson-binary-vector/bson-binary-vector.md b/source/bson-binary-vector/bson-binary-vector.md index 55363a6bca..c71be7d1d9 100644 --- a/source/bson-binary-vector/bson-binary-vector.md +++ b/source/bson-binary-vector/bson-binary-vector.md @@ -80,7 +80,7 @@ For example, a vector `[6, 7]` of dtype PACKED_BIT (`\x10`) with a padding of `3 0 1 0 - 9 + 0 0 0 0 From 0d10725f7b16476e501a35f4efa170fe7a4e29e8 Mon Sep 17 00:00:00 2001 From: Casey Clements Date: Thu, 26 Sep 2024 16:39:23 -0400 Subject: [PATCH 14/30] Updates from Anna's comments --- source/bson-binary-vector/bson-binary-vector.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/source/bson-binary-vector/bson-binary-vector.md b/source/bson-binary-vector/bson-binary-vector.md index c71be7d1d9..ed00d20e50 100644 --- a/source/bson-binary-vector/bson-binary-vector.md +++ b/source/bson-binary-vector/bson-binary-vector.md @@ -7,8 +7,8 @@ ______________________________________________________________________ ## Abstract -This document describes the addition of a new subtype to the Binary BSON type. This subtype is used for efficient -storage and retrieval of vectors. Vectors here refer to densely packed arrays of numbers, all of the same type. +This document describes the subtype of the Binary BSON type used for efficient storage and retrieval of vectors. Vectors +here refer to densely packed arrays of numbers, all of the same type. ## Motivation @@ -38,7 +38,7 @@ Each vector can take one of multiple data types (dtypes). The following table li | `0x10` | PACKED_BIT | 1 `*` | BOOL | `*` A Binary Quantized (PACKED_BIT) Vector is a vector of 0s and 1s (bits), but it is represented in memory as a list of -integers in \[0, 255\]. So, for example, the vector `[0, 255]` would be shorthand for the 16 bit vector +integers in \[0, 255\]. So, for example, the vector `[0, 255]` would be shorthand for the 16-bit vector `[0,0,0,0,0,0,0,0,1,1,1,1,1,1,1,1]`. The idea is that each number (a uint8) can be stored as a single byte. Of course, some languages, Python for one, do not have an uint8 type, so must be represented as an int in memory, but not on disk. From 5935ce0262b01d159e0b3d0d9f6146215d2acc04 Mon Sep 17 00:00:00 2001 From: Casey Clements Date: Thu, 26 Sep 2024 18:15:42 -0400 Subject: [PATCH 15/30] Correction from Shane's comment --- source/bson-binary-vector/bson-binary-vector.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/source/bson-binary-vector/bson-binary-vector.md b/source/bson-binary-vector/bson-binary-vector.md index ed00d20e50..e36935df5d 100644 --- a/source/bson-binary-vector/bson-binary-vector.md +++ b/source/bson-binary-vector/bson-binary-vector.md @@ -14,7 +14,7 @@ here refer to densely packed arrays of numbers, all of the same type. These representations correspond to the numeric types supported by popular numerical libraries for vector processing, such as NumPy, PyTorch, TensorFlow and Apache Arrow. Storing and retrieving vector data using the same densely packed -format used by these libraries can result in up to significant memory savings and processing efficiency. +format used by these libraries can result in significant memory savings and processing efficiency. ### META From a8b464e67e1b92553fbbeea73f6ec55a39176877 Mon Sep 17 00:00:00 2001 From: Casey Clements Date: Thu, 26 Sep 2024 19:31:13 -0400 Subject: [PATCH 16/30] Fix typo in binary structure html table --- source/bson-binary-vector/bson-binary-vector.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/source/bson-binary-vector/bson-binary-vector.md b/source/bson-binary-vector/bson-binary-vector.md index e36935df5d..fa68e5ee37 100644 --- a/source/bson-binary-vector/bson-binary-vector.md +++ b/source/bson-binary-vector/bson-binary-vector.md @@ -75,10 +75,10 @@ For example, a vector `[6, 7]` of dtype PACKED_BIT (`\x10`) with a padding of `3 0 0 0 - 0 1 0 - 1 + 0 + 0 0 0 0 From f50677b672a9c0f0060029dd1f790f2621574a4f Mon Sep 17 00:00:00 2001 From: Casey Clements Date: Fri, 27 Sep 2024 15:32:08 -0400 Subject: [PATCH 17/30] Moved editorial comments about PACKED_BIT ambiguity to an FAQ --- .../bson-binary-vector/bson-binary-vector.md | 24 ++++++++++++------- 1 file changed, 16 insertions(+), 8 deletions(-) diff --git a/source/bson-binary-vector/bson-binary-vector.md b/source/bson-binary-vector/bson-binary-vector.md index fa68e5ee37..809eda2fc0 100644 --- a/source/bson-binary-vector/bson-binary-vector.md +++ b/source/bson-binary-vector/bson-binary-vector.md @@ -27,7 +27,7 @@ This specification introduces a new BSON binary subtype, the vector, with value Drivers SHOULD provide idiomatic APIs to translate between arrays of numbers and this BSON Binary specification. -#### Data Types +### Data Types Each vector can take one of multiple data types (dtypes). The following table lists the dtypes implemented. @@ -42,16 +42,13 @@ integers in \[0, 255\]. So, for example, the vector `[0, 255]` would be shorthan `[0,0,0,0,0,0,0,0,1,1,1,1,1,1,1,1]`. The idea is that each number (a uint8) can be stored as a single byte. Of course, some languages, Python for one, do not have an uint8 type, so must be represented as an int in memory, but not on disk. -The authors are well-aware of the inherent ambiguity here, and alternatives. This is a market-standard, unfortunately. -Change is inevitable. - -#### Byte padding +### Byte padding As not all data types have a bit length equal to a multiple of 8, and hence do not fit squarely into a certain number of bytes, a second piece of metadata, the "padding" is included. This instructs the driver of the number of bits in the final byte that are to be ignored. -#### Binary structure +### Binary structure Following the binary subtype `0x09` a two-element byte array of metadata precedes the packed numbers. @@ -94,10 +91,21 @@ For example, a vector `[6, 7]` of dtype PACKED_BIT (`\x10`) with a padding of `3 All values use the little-endian format. -### Reference Implementation +## Reference Implementation - PYTHON (PYTHON-4577) [pymongo.binary](https://github.com/mongodb/mongo-python-driver/blob/master/bson/binary.py) -### Test Plan +## Test Plan See the [README](tests/README.md) for tests. + +## FAQ + +- What MongoDB Server version does this apply to? + - Files in the "specifications" repository have no version scheme. They are not tied to a MongoDB server version. +- In PACKED_BIT, why would one choose to use integers in \[0, 256)? + - This follows a well-established precedent for packing binary-valued arrays into bytes (8 bits), This technique is + widely used across different fields, such as data compression, communication protocols, and file formats, where you + want to store or transmit binary data more efficiently by grouping 8 bits into a single byte (uint8). For an example + in Python, see + [numpy.unpackbits](https://numpy.org/doc/2.0/reference/generated/numpy.unpackbits.html#numpy.unpackbits). From 2d4ea724c37724193e18aea249c268d31e85cdad Mon Sep 17 00:00:00 2001 From: Casey Clements Date: Fri, 27 Sep 2024 15:39:27 -0400 Subject: [PATCH 18/30] Added Required Tests section to README. Removed JSON from tests. --- source/bson-binary-vector/tests/README.md | 38 +++++++++++++------ source/bson-binary-vector/tests/float32.json | 9 ++--- source/bson-binary-vector/tests/int8.json | 6 +-- .../bson-binary-vector/tests/packed_bit.json | 9 ++--- 4 files changed, 34 insertions(+), 28 deletions(-) diff --git a/source/bson-binary-vector/tests/README.md b/source/bson-binary-vector/tests/README.md index b217705ee8..05fed67a40 100644 --- a/source/bson-binary-vector/tests/README.md +++ b/source/bson-binary-vector/tests/README.md @@ -3,31 +3,29 @@ The JSON files in this directory tree are platform-independent tests that drivers can use to prove their conformance to the specification. -These tests focus on the roundtrip of the list numbers as input/output, along with their data type and byte padding. +These tests focus on the roundtrip of the list of numbers as input/output, along with their data type and byte padding. Additional tests exist in `bson_corpus/tests/binary.json` but do not sufficiently test the end-to-end process of Vector to BSON. For this reason, drivers must create a bespoke test runner for the vector subtype. -Each test case here pertains to a single vector. The inputs required to create the Binary BSON object are defined, and -when valid, the Canonical BSON and Extended JSON representations are included for comparison. - -## Version - -Files in the "specifications" repository have no version scheme. They are not tied to a MongoDB server version. - ## Format +The test data corpus consists of a JSON file for each data type (dtype). Each file contains a number of test cases, +under the top-level key "tests". Each test case pertains to a single vector. The keys provide the specification of the +vector. Valid cases also include the Canonical BSON format of a document {test_key: binary}. The "test_key" is common, +and specified at the top level. + #### Top level keys Each JSON file contains three top-level keys. - `description`: human-readable description of what is in the file -- `test_key`: Field name used when decoding/encoding a BSON document containing the single BSON Binary for the test +- `test_key`: name used for key when encoding/decoding a BSON document containing the single BSON Binary for the test case. Applies to *every* case. - `tests`: array of test case objects, each of which have the following keys. Valid cases will also contain additional binary and json encoding values. -#### Keys of tests objects +#### Keys of individual tests cases - `description`: string describing the test. - `valid`: boolean indicating if the vector, dtype, and padding should be considered a valid input. @@ -36,5 +34,21 @@ Each JSON file contains three top-level keys. - `dtype_alias`: (optional) string defining the data dtype, perhaps as Enum. - `padding`: (optional) integer for byte padding. Defaults to 0. - `canonical_bson`: (required if valid is true) an (uppercase) big-endian hex representation of a BSON byte string. -- `canonical_extjson`: (required if valid is true) string containing a Canonical Extended JSON document. Because this is - itself embedded as a *string* inside a JSON document, characters like quote and backslash are escaped. + +## Required tests + +To prove correct in a valid case (`valid: true`), one MUST + +- decode the canonical_bson into its binary form, and then assert that the numeric values, dtype, and padding all match + those provided in the JSON. +- encode a document from the numeric values, dtype, and padding, along with the "test_key", and assert this matches the + canonical_bson string. + +To prove correct in an invalid case (`valid:false`), one MUST + +- raise an exception when attempting to encode a document from the numeric values, dtype, and padding. + +## FAQ + +- What MongoDB Server version does this apply to? + - Files in the "specifications" repository have no version scheme. They are not tied to a MongoDB server version. diff --git a/source/bson-binary-vector/tests/float32.json b/source/bson-binary-vector/tests/float32.json index 9ec72861d4..bbbe00b758 100644 --- a/source/bson-binary-vector/tests/float32.json +++ b/source/bson-binary-vector/tests/float32.json @@ -9,8 +9,7 @@ "dtype_hex": "0x27", "dtype_alias": "FLOAT32", "padding": 0, - "canonical_bson": "1C00000005766563746F72000A0000000927000000FE420000E04000", - "canonical_extjson": "{\"vector\": {\"$binary\": {\"base64\": \"JwAAAP5CAADgQA==\", \"subType\": \"09\"}}}" + "canonical_bson": "1C00000005766563746F72000A0000000927000000FE420000E04000" }, { "description": "Empty Vector FLOAT32", @@ -19,8 +18,7 @@ "dtype_hex": "0x27", "dtype_alias": "FLOAT32", "padding": 0, - "canonical_bson": "1400000005766563746F72000200000009270000", - "canonical_extjson": "{\"vector\": {\"$binary\": {\"base64\": \"JwA=\", \"subType\": \"09\"}}}" + "canonical_bson": "1400000005766563746F72000200000009270000" }, { "description": "Infinity Vector FLOAT32", @@ -29,8 +27,7 @@ "dtype_hex": "0x27", "dtype_alias": "FLOAT32", "padding": 0, - "canonical_bson": "2000000005766563746F72000E000000092700000080FF000000000000807F00", - "canonical_extjson": "{\"vector\": {\"$binary\": {\"base64\": \"JwAAAID/AAAAAAAAgH8=\", \"subType\": \"09\"}}}" + "canonical_bson": "2000000005766563746F72000E000000092700000080FF000000000000807F00" }, { "description": "FLOAT32 with padding", diff --git a/source/bson-binary-vector/tests/int8.json b/source/bson-binary-vector/tests/int8.json index 92eab609e8..7529721e5e 100644 --- a/source/bson-binary-vector/tests/int8.json +++ b/source/bson-binary-vector/tests/int8.json @@ -9,8 +9,7 @@ "dtype_hex": "0x03", "dtype_alias": "INT8", "padding": 0, - "canonical_bson": "1600000005766563746F7200040000000903007F0700", - "canonical_extjson": "{\"vector\": {\"$binary\": {\"base64\": \"AwB/Bw==\", \"subType\": \"09\"}}}" + "canonical_bson": "1600000005766563746F7200040000000903007F0700" }, { "description": "Empty Vector INT8", @@ -19,8 +18,7 @@ "dtype_hex": "0x03", "dtype_alias": "INT8", "padding": 0, - "canonical_bson": "1400000005766563746F72000200000009030000", - "canonical_extjson": "{\"vector\": {\"$binary\": {\"base64\": \"AwA=\", \"subType\": \"09\"}}}" + "canonical_bson": "1400000005766563746F72000200000009030000" }, { "description": "Overflow Vector INT8", diff --git a/source/bson-binary-vector/tests/packed_bit.json b/source/bson-binary-vector/tests/packed_bit.json index de108876a9..a41cd593f5 100644 --- a/source/bson-binary-vector/tests/packed_bit.json +++ b/source/bson-binary-vector/tests/packed_bit.json @@ -9,8 +9,7 @@ "dtype_hex": "0x10", "dtype_alias": "PACKED_BIT", "padding": 0, - "canonical_bson": "1600000005766563746F7200040000000910007F0700", - "canonical_extjson": "{\"vector\": {\"$binary\": {\"base64\": \"EAB/Bw==\", \"subType\": \"09\"}}}" + "canonical_bson": "1600000005766563746F7200040000000910007F0700" }, { "description": "Empty Vector PACKED_BIT", @@ -19,8 +18,7 @@ "dtype_hex": "0x10", "dtype_alias": "PACKED_BIT", "padding": 0, - "canonical_bson": "1400000005766563746F72000200000009100000", - "canonical_extjson": "{\"vector\": {\"$binary\": {\"base64\": \"EAA=\", \"subType\": \"09\"}}}" + "canonical_bson": "1400000005766563746F72000200000009100000" }, { "description": "PACKED_BIT with padding", @@ -29,8 +27,7 @@ "dtype_hex": "0x10", "dtype_alias": "PACKED_BIT", "padding": 3, - "canonical_bson": "1600000005766563746F7200040000000910037F0700", - "canonical_extjson": "{\"vector\": {\"$binary\": {\"base64\": \"EAN/Bw==\", \"subType\": \"09\"}}}" + "canonical_bson": "1600000005766563746F7200040000000910037F0700" }, { "description": "Overflow Vector PACKED_BIT", From f50b1ccebe809821cf1c15ad42f6adf2646b68cf Mon Sep 17 00:00:00 2001 From: Casey Clements Date: Fri, 27 Sep 2024 17:51:27 -0400 Subject: [PATCH 19/30] Further improvements to PACKED_BIT with padding example. --- .../bson-binary-vector/bson-binary-vector.md | 36 ++++++++++++++++--- 1 file changed, 31 insertions(+), 5 deletions(-) diff --git a/source/bson-binary-vector/bson-binary-vector.md b/source/bson-binary-vector/bson-binary-vector.md index 809eda2fc0..276644728b 100644 --- a/source/bson-binary-vector/bson-binary-vector.md +++ b/source/bson-binary-vector/bson-binary-vector.md @@ -59,14 +59,22 @@ Following the binary subtype `0x09` a two-element byte array of metadata precede - The remainder contains the actual vector elements packed according to dtype. -For example, a vector `[6, 7]` of dtype PACKED_BIT (`\x10`) with a padding of `3` would look like this: -`b"\x10\x03\x06\x07'`: 1 byte for dtype, 1 for padding, and 1 for each uint8. +All values use the little-endian format. + +#### Example + +Let's take a vector `[238, 224]` of dtype PACKED_BIT (`\x10`) with a padding of `4`. + +In hex, it looks like this: `b"\x10\x04\xee\xe0"`: 1 byte for dtype, 1 for padding, and 1 for each uint8. + +We can visualize the binary representation like so: - + + @@ -82,14 +90,32 @@ For example, a vector `[6, 7]` of dtype PACKED_BIT (`\x10`) with a padding of `3 + + + + + + - + + + + + + + + + +
1st byte: dtype (from list in previous table) 2nd byte: padding (values in [0,7])binary numbers packed according to dtype1st uint8: 2382nd uint8: 224
00 0 0100111 0 1 1...1011100000
-All values use the little-endian format. +Finally, after we remove the last 4 bits of padding, the actual bit vector has a length of 12 and looks like this! + +| 1 | 1 | 1 | 0 | 1 | 1 | 1 | 0 | 1 | 1 | 1 | 0 | +| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | ## Reference Implementation From d6f160b90125bb653b9d8353f13610b3353ae5ae Mon Sep 17 00:00:00 2001 From: Casey Clements Date: Sat, 28 Sep 2024 12:18:58 -0400 Subject: [PATCH 20/30] Fixed consistency for subtype reference. Follows Tech Design Doc --- source/bson-binary-vector/bson-binary-vector.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/source/bson-binary-vector/bson-binary-vector.md b/source/bson-binary-vector/bson-binary-vector.md index 276644728b..d2e22ead9f 100644 --- a/source/bson-binary-vector/bson-binary-vector.md +++ b/source/bson-binary-vector/bson-binary-vector.md @@ -50,7 +50,7 @@ final byte that are to be ignored. ### Binary structure -Following the binary subtype `0x09` a two-element byte array of metadata precedes the packed numbers. +Following the binary subtype `\x09` a two-element byte array of metadata precedes the packed numbers. - The first byte (dtype) describes its data type. The table above shows those that MUST be implemented. This table may increase. From 60088d9097014f1206553bcf4bd654633b30e134 Mon Sep 17 00:00:00 2001 From: Casey Clements Date: Mon, 7 Oct 2024 09:59:52 +0100 Subject: [PATCH 21/30] Addressed Neal's comments. --- source/bson-binary-vector/bson-binary-vector.md | 7 ++++--- 1 file changed, 4 insertions(+), 3 deletions(-) diff --git a/source/bson-binary-vector/bson-binary-vector.md b/source/bson-binary-vector/bson-binary-vector.md index d2e22ead9f..6aa065e802 100644 --- a/source/bson-binary-vector/bson-binary-vector.md +++ b/source/bson-binary-vector/bson-binary-vector.md @@ -46,16 +46,17 @@ some languages, Python for one, do not have an uint8 type, so must be represente As not all data types have a bit length equal to a multiple of 8, and hence do not fit squarely into a certain number of bytes, a second piece of metadata, the "padding" is included. This instructs the driver of the number of bits in the -final byte that are to be ignored. +final byte that are to be ignored. It is the least-significant bits that are ignored. ### Binary structure Following the binary subtype `\x09` a two-element byte array of metadata precedes the packed numbers. - The first byte (dtype) describes its data type. The table above shows those that MUST be implemented. This table may - increase. + increase. dtype is an unsigned integer. -- The second byte (padding) prescribes the number of bits to ignore in the final byte of the value. +- The second byte (padding) prescribes the number of bits to ignore in the final byte of the value. It is a non-negative + integer. It must be present, even in cases where it is not applicable, and set to zero. - The remainder contains the actual vector elements packed according to dtype. From a1b87f7c6a78c17a0017cac37f9905a74ad7ee97 Mon Sep 17 00:00:00 2001 From: Casey Clements Date: Mon, 7 Oct 2024 11:55:07 +0100 Subject: [PATCH 22/30] Made clear that it is the least significant bit that is ignored. --- source/bson-binary-vector/bson-binary-vector.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/source/bson-binary-vector/bson-binary-vector.md b/source/bson-binary-vector/bson-binary-vector.md index 6aa065e802..090ff5e78f 100644 --- a/source/bson-binary-vector/bson-binary-vector.md +++ b/source/bson-binary-vector/bson-binary-vector.md @@ -46,7 +46,7 @@ some languages, Python for one, do not have an uint8 type, so must be represente As not all data types have a bit length equal to a multiple of 8, and hence do not fit squarely into a certain number of bytes, a second piece of metadata, the "padding" is included. This instructs the driver of the number of bits in the -final byte that are to be ignored. It is the least-significant bits that are ignored. +final byte that are to be ignored. The least-significant bits are ignored. ### Binary structure From d267b2aa2ac0f007a97c05f2863d228175f51223 Mon Sep 17 00:00:00 2001 From: Casey Clements Date: Mon, 7 Oct 2024 11:56:41 +0100 Subject: [PATCH 23/30] Additional invalid test cases for PACKED_BIT vectors --- .../bson-binary-vector/tests/packed_bit.json | 48 +++++++++++++++++++ 1 file changed, 48 insertions(+) diff --git a/source/bson-binary-vector/tests/packed_bit.json b/source/bson-binary-vector/tests/packed_bit.json index a41cd593f5..035776e87f 100644 --- a/source/bson-binary-vector/tests/packed_bit.json +++ b/source/bson-binary-vector/tests/packed_bit.json @@ -2,6 +2,14 @@ "description": "Tests of Binary subtype 9, Vectors, with dtype PACKED_BIT", "test_key": "vector", "tests": [ + { + "description": "Padding specified with no vector data PACKED_BIT", + "valid": false, + "vector": [], + "dtype_hex": "0x10", + "dtype_alias": "PACKED_BIT", + "padding": 1 + }, { "description": "Simple Vector PACKED_BIT", "valid": true, @@ -44,6 +52,46 @@ "dtype_hex": "0x10", "dtype_alias": "PACKED_BIT", "padding": 0 + }, + { + "description": "Vector with float values PACKED_BIT", + "valid": false, + "vector": [127.5], + "dtype_hex": "0x10", + "dtype_alias": "PACKED_BIT", + "padding": 0 + }, + { + "description": "Padding specified with no vector data PACKED_BIT", + "valid": false, + "vector": [], + "dtype_hex": "0x10", + "dtype_alias": "PACKED_BIT", + "padding": 1 + }, + { + "description": "Exceeding maximum padding PACKED_BIT", + "valid": false, + "vector": [1], + "dtype_hex": "0x10", + "dtype_alias": "PACKED_BIT", + "padding": 8 + }, + { + "description": "Negative padding PACKED_BIT", + "valid": false, + "vector": [1], + "dtype_hex": "0x10", + "dtype_alias": "PACKED_BIT", + "padding": -1 + }, + { + "description": "Vector with float values PACKED_BIT", + "valid": false, + "vector": [127.5], + "dtype_hex": "0x10", + "dtype_alias": "PACKED_BIT", + "padding": 0 } ] } From 0b888fb631b860f2fc926c0574978b98bbbdc503 Mon Sep 17 00:00:00 2001 From: Casey Clements Date: Mon, 7 Oct 2024 14:38:43 +0100 Subject: [PATCH 24/30] Additional float32 binary vector test cases --- source/bson-binary-vector/tests/README.md | 1 + source/bson-binary-vector/tests/float32.json | 9 +++++++++ 2 files changed, 10 insertions(+) diff --git a/source/bson-binary-vector/tests/README.md b/source/bson-binary-vector/tests/README.md index 05fed67a40..f8a433c011 100644 --- a/source/bson-binary-vector/tests/README.md +++ b/source/bson-binary-vector/tests/README.md @@ -43,6 +43,7 @@ To prove correct in a valid case (`valid: true`), one MUST those provided in the JSON. - encode a document from the numeric values, dtype, and padding, along with the "test_key", and assert this matches the canonical_bson string. +- For floating point number types, numerical values need not match exactly. To prove correct in an invalid case (`valid:false`), one MUST diff --git a/source/bson-binary-vector/tests/float32.json b/source/bson-binary-vector/tests/float32.json index bbbe00b758..872c435323 100644 --- a/source/bson-binary-vector/tests/float32.json +++ b/source/bson-binary-vector/tests/float32.json @@ -11,6 +11,15 @@ "padding": 0, "canonical_bson": "1C00000005766563746F72000A0000000927000000FE420000E04000" }, + { + "description": "Vector with decimals and negative value FLOAT32", + "valid": true, + "vector": [127.7, -7.7], + "dtype_hex": "0x27", + "dtype_alias": "FLOAT32", + "padding": 0, + "canonical_bson": "1C00000005766563746F72000A0000000927006666FF426666F6C000" + }, { "description": "Empty Vector FLOAT32", "valid": true, From c823174d279d31cecfa1c37bc2c122c025741a28 Mon Sep 17 00:00:00 2001 From: Casey Clements Date: Tue, 8 Oct 2024 14:14:49 +0100 Subject: [PATCH 25/30] Added API Guidance section --- .../bson-binary-vector/bson-binary-vector.md | 95 ++++++++++++++++++- 1 file changed, 94 insertions(+), 1 deletion(-) diff --git a/source/bson-binary-vector/bson-binary-vector.md b/source/bson-binary-vector/bson-binary-vector.md index 090ff5e78f..ed9da8b91e 100644 --- a/source/bson-binary-vector/bson-binary-vector.md +++ b/source/bson-binary-vector/bson-binary-vector.md @@ -27,7 +27,7 @@ This specification introduces a new BSON binary subtype, the vector, with value Drivers SHOULD provide idiomatic APIs to translate between arrays of numbers and this BSON Binary specification. -### Data Types +### Data Types (dtypes) Each vector can take one of multiple data types (dtypes). The following table lists the dtypes implemented. @@ -118,6 +118,99 @@ Finally, after we remove the last 4 bits of padding, the actual bit vector has a | 1 | 1 | 1 | 0 | 1 | 1 | 1 | 0 | 1 | 1 | 1 | 0 | | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | +## API Guidance + +Drivers MUST implement methods for explicit encoding and decoding that adhere to the pattern described below while +following idioms of the language of the driver. + +### Encoding + +``` +Function from_vector(vector: Iterable, dtype: DtypeEnum, padding: Integer = 0) -> Binary + # Converts a numeric vector into a binary representation based on the specified dtype and padding. + + # :param vector: A sequence or iterable of numbers (either float or int) + # :param dtype: Data type for binary conversion (from DtypeEnum) + # :param padding: Optional integer specifying how many bits to ignore in the final byte + # :return: A binary representation of the vector + + Declare binary_data as Binary + + # Process each number in vector and convert according to dtype + For each number in vector + binary_element = convert_to_binary(number, dtype) + binary_data.append(binary_element) + End For + + # Apply padding to the binary data if needed + If padding > 0 + apply_padding(binary_data, padding) + End If + + Return binary_data +End Function +``` + +### Decoding + +``` +Function as_vector() -> Vector + # Unpacks binary data (BSON or similar) into a Vector structure. + # This process involves extracting numeric values, the data type, and padding information. + + # :return: A BinaryVector containing the unpacked numeric values, dtype, and padding. + + Declare binary_vector as BinaryVector # Struct to hold the unpacked data + + # Extract dtype (data type) from the binary data + binary_vector.dtype = extract_dtype_from_binary() + + # Extract padding from the binary data + binary_vector.padding = extract_padding_from_binary() + + # Unpack the actual numeric values from the binary data according to the dtype + binary_vector.data = unpack_numeric_values(binary_vector.dtype) + + Return binary_vector +End Function +``` + +#### Data Structures + +Drivers MAY find the following structures to represent the dtype and vector structure useful. + +``` +Enum Dtype + # Enum for data types (dtype) + + # FLOAT32: Represents packing of list of floats as float32 + # Value: 0x27 (hexadecimal byte value) + + # INT8: Represents packing of list of signed integers in the range [-128, 127] as signed int8 + # Value: 0x03 (hexadecimal byte value) + + # PACKED_BIT: Special case where vector values are 0 or 1, packed as unsigned uint8 in range [0, 255] + # Packed into groups of 8 (a byte) + # Value: 0x10 (hexadecimal byte value) + + # Documentation: + # Each value is a byte (length of one), a convenient choice for decoding. +End Enum + +Struct Vector + # Numeric vector with metadata for binary interoperability + + # Fields: + # data: Sequence of numeric values (either float or int) + # dtype: Data type of vector (from enum BinaryVectorDtype) + # padding: Number of bits to ignore in the final byte for alignment + + data # Sequence of float or int + dtype # Type: DtypeEnum + padding # Integer: Number of padding bits + End Struct +``` + ## Reference Implementation - PYTHON (PYTHON-4577) [pymongo.binary](https://github.com/mongodb/mongo-python-driver/blob/master/bson/binary.py) From fcc1be5536304138c1e6f95a5a0c903ebbd740e6 Mon Sep 17 00:00:00 2001 From: Casey Clements Date: Wed, 16 Oct 2024 12:48:51 -0400 Subject: [PATCH 26/30] Change mention of binary subtype from x09 to 9 --- source/bson-binary-vector/bson-binary-vector.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/source/bson-binary-vector/bson-binary-vector.md b/source/bson-binary-vector/bson-binary-vector.md index ed9da8b91e..8e0f838aea 100644 --- a/source/bson-binary-vector/bson-binary-vector.md +++ b/source/bson-binary-vector/bson-binary-vector.md @@ -23,7 +23,7 @@ The keywords "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SH ## Specification -This specification introduces a new BSON binary subtype, the vector, with value `"\x09"`. +This specification introduces a new BSON binary subtype, the vector, with value `9`. Drivers SHOULD provide idiomatic APIs to translate between arrays of numbers and this BSON Binary specification. @@ -50,7 +50,7 @@ final byte that are to be ignored. The least-significant bits are ignored. ### Binary structure -Following the binary subtype `\x09` a two-element byte array of metadata precedes the packed numbers. +Following the binary subtype `9`, a two-element byte array of metadata precedes the packed numbers. - The first byte (dtype) describes its data type. The table above shows those that MUST be implemented. This table may increase. dtype is an unsigned integer. From d00541a454a88abc0f70ced2171bc9a9c248981c Mon Sep 17 00:00:00 2001 From: Casey Clements Date: Wed, 16 Oct 2024 12:50:13 -0400 Subject: [PATCH 27/30] Remove github link to pymongo.binary. Reference implementation now simply references JIRA ticket. --- source/bson-binary-vector/bson-binary-vector.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/source/bson-binary-vector/bson-binary-vector.md b/source/bson-binary-vector/bson-binary-vector.md index 8e0f838aea..8e75427948 100644 --- a/source/bson-binary-vector/bson-binary-vector.md +++ b/source/bson-binary-vector/bson-binary-vector.md @@ -213,7 +213,7 @@ Struct Vector ## Reference Implementation -- PYTHON (PYTHON-4577) [pymongo.binary](https://github.com/mongodb/mongo-python-driver/blob/master/bson/binary.py) +- PYTHON (PYTHON-4577) ## Test Plan From b30ed353c33c80c722bc3d6e8ad68fcb785db477 Mon Sep 17 00:00:00 2001 From: Casey Clements Date: Tue, 22 Oct 2024 09:26:51 -0400 Subject: [PATCH 28/30] Clarification of test requirements for drivers that natively support the floating-point type being tested --- source/bson-binary-vector/tests/README.md | 13 ++++++++----- 1 file changed, 8 insertions(+), 5 deletions(-) diff --git a/source/bson-binary-vector/tests/README.md b/source/bson-binary-vector/tests/README.md index f8a433c011..bc96452678 100644 --- a/source/bson-binary-vector/tests/README.md +++ b/source/bson-binary-vector/tests/README.md @@ -37,15 +37,18 @@ Each JSON file contains three top-level keys. ## Required tests -To prove correct in a valid case (`valid: true`), one MUST +#### To prove correct in a valid case (`valid: true`), one MUST -- decode the canonical_bson into its binary form, and then assert that the numeric values, dtype, and padding all match - those provided in the JSON. - encode a document from the numeric values, dtype, and padding, along with the "test_key", and assert this matches the canonical_bson string. -- For floating point number types, numerical values need not match exactly. +- decode the canonical_bson into its binary form, and then assert that the numeric values, dtype, and padding all match + those provided in the JSON. + +Note: For floating point number types, exact numerical matches may not be possible. Drivers that natively support the +floating-point type being tested (e.g., when testing float32 vector values in a driver that natively supports float32), +MUST assert that the input float array is the same after encoding and decoding. -To prove correct in an invalid case (`valid:false`), one MUST +#### To prove correct in an invalid case (`valid:false`), one MUST - raise an exception when attempting to encode a document from the numeric values, dtype, and padding. From 0ccc39944d76c0fd0de7cc205641aebd49234015 Mon Sep 17 00:00:00 2001 From: Casey Clements Date: Tue, 22 Oct 2024 14:47:57 -0400 Subject: [PATCH 29/30] Adds Validation subsection --- source/bson-binary-vector/bson-binary-vector.md | 13 +++++++++++++ source/bson-binary-vector/tests/README.md | 2 +- 2 files changed, 14 insertions(+), 1 deletion(-) diff --git a/source/bson-binary-vector/bson-binary-vector.md b/source/bson-binary-vector/bson-binary-vector.md index 8e75427948..3daec6fce9 100644 --- a/source/bson-binary-vector/bson-binary-vector.md +++ b/source/bson-binary-vector/bson-binary-vector.md @@ -151,6 +151,9 @@ Function from_vector(vector: Iterable, dtype: DtypeEnum, padding: Intege End Function ``` +This pseudococde is suggestive. If a driver chooses to implement a Vector type (or numerous) they MAY decide that +from_vector that has a single argument, a Vector. + ### Decoding ``` @@ -175,6 +178,16 @@ Function as_vector() -> Vector End Function ``` +#### Validation + +Drivers MUST validate vector metadata and raise an error if any invariant is violated: + +- Padding MUST be 0 for all dtypes where padding doesn’t apply, and MUST be within \[0, 7\] for PACKED_BIT. +- A PACKED_BIT vector MUST NOT be empty if padding is in the range \[1, 7\]. + +Drivers MUST perform this validation when a numeric vector and padding are provided through the API, and when unpacking +binary data (BSON or similar) into a Vector structure. + #### Data Structures Drivers MAY find the following structures to represent the dtype and vector structure useful. diff --git a/source/bson-binary-vector/tests/README.md b/source/bson-binary-vector/tests/README.md index bc96452678..aa774cc022 100644 --- a/source/bson-binary-vector/tests/README.md +++ b/source/bson-binary-vector/tests/README.md @@ -44,7 +44,7 @@ Each JSON file contains three top-level keys. - decode the canonical_bson into its binary form, and then assert that the numeric values, dtype, and padding all match those provided in the JSON. -Note: For floating point number types, exact numerical matches may not be possible. Drivers that natively support the +Note: For floating point number types, exact numerical matches may not be possible. Drivers that natively support the floating-point type being tested (e.g., when testing float32 vector values in a driver that natively supports float32), MUST assert that the input float array is the same after encoding and decoding. From ae3242262c1e82f7730230d80b62cba449eca73e Mon Sep 17 00:00:00 2001 From: Casey Clements Date: Tue, 22 Oct 2024 17:14:53 -0400 Subject: [PATCH 30/30] Add note about signature of from_vector, that it be implemented as from_vector(Vector) -> Binary --- source/bson-binary-vector/bson-binary-vector.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/source/bson-binary-vector/bson-binary-vector.md b/source/bson-binary-vector/bson-binary-vector.md index 3daec6fce9..0d08ff6093 100644 --- a/source/bson-binary-vector/bson-binary-vector.md +++ b/source/bson-binary-vector/bson-binary-vector.md @@ -151,8 +151,8 @@ Function from_vector(vector: Iterable, dtype: DtypeEnum, padding: Intege End Function ``` -This pseudococde is suggestive. If a driver chooses to implement a Vector type (or numerous) they MAY decide that -from_vector that has a single argument, a Vector. +Note: If a driver chooses to implement a `Vector` type (or numerous) like that suggested in the Data Structure +subsection below, they MAY decide that `from_vector` that has a single argument, a Vector. ### Decoding