- 
                Notifications
    You must be signed in to change notification settings 
- Fork 1.1k
[DRIVERS-2926] [PYTHON-4577] BSON Binary Vector Subtype Support #1813
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 30 commits
245c869
              031cd8c
              8d4e8a2
              2df0d6b
              315a115
              d74314d
              27f13c8
              263f8c7
              f8bcdef
              5435785
              5c4d152
              f86d040
              adcb945
              7986cc5
              26b8398
              28de28a
              68235b8
              e2a1a3c
              bf9758a
              e1590aa
              c4c7af7
              de5a245
              43bcce4
              41ee0bb
              9d52aeb
              0d34464
              1d49656
              2af0ca4
              c93bae1
              f374b5a
              0db9866
              0532803
              3edeef6
              d199597
              abc7cd3
              4550c20
              99d44e1
              001636d
              637c474
              2d511f6
              17e1d33
              ce5f3e3
              edfe972
              8aaa2f6
              dfb322c
              8946daf
              9397129
              913403b
              File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||
|---|---|---|---|---|
|  | @@ -13,7 +13,10 @@ | |||
| # limitations under the License. | ||||
| from __future__ import annotations | ||||
|  | ||||
| from typing import TYPE_CHECKING, Any, Tuple, Type, Union | ||||
| import struct | ||||
| from dataclasses import dataclass | ||||
| from enum import Enum | ||||
| from typing import TYPE_CHECKING, Any, Optional, Sequence, Tuple, Type, Union | ||||
| from uuid import UUID | ||||
|  | ||||
| """Tools for representing BSON binary data. | ||||
|  | @@ -191,21 +194,76 @@ class UuidRepresentation: | |||
| """ | ||||
|  | ||||
|  | ||||
| VECTOR_SUBTYPE = 9 | ||||
| """BSON binary subtype for densely packed vector data. | ||||
|  | ||||
| .. versionadded:: 4.9 | ||||
| """ | ||||
|  | ||||
|  | ||||
| USER_DEFINED_SUBTYPE = 128 | ||||
| """BSON binary subtype for any user defined structure. | ||||
| """ | ||||
|  | ||||
|  | ||||
| class BinaryVectorDtype(Enum): | ||||
| """Datatypes of vector subtype. | ||||
|  | ||||
| :param FLOAT32: (0x27) Pack list of :class:`float` as float32 | ||||
| :param INT8: (0x03) Pack list of :class:`int` in [-128, 127] as signed int8 | ||||
| :param PACKED_BIT: (0x10) Pack list of :class:`int` in [0, 255] as unsigned uint8 | ||||
|  | ||||
| The `PACKED_BIT` value represents a special case where vector values themselves | ||||
| can only be of two values (0 or 1) but these are packed together into groups of 8, | ||||
| a byte. In Python, these are displayed as ints in range [0, 255] | ||||
|  | ||||
| Each value is of type bytes with a length of one. | ||||
|  | ||||
| .. versionadded:: 4.9 | ||||
| """ | ||||
|  | ||||
| INT8 = b"\x03" | ||||
| FLOAT32 = b"\x27" | ||||
| PACKED_BIT = b"\x10" | ||||
|  | ||||
|  | ||||
| # Map from bytes to enum value, for decoding. | ||||
| DTYPE_FROM_HEX = {key.value: key for key in BinaryVectorDtype} | ||||
|  | ||||
|  | ||||
| @dataclass | ||||
| class BinaryVector: | ||||
| """Vector of numbers along with metadata for binary interoperability. | ||||
|  | ||||
| :param data: Sequence of numbers representing the mathematical vector. | ||||
| :param dtype: The data type stored in binary | ||||
| :param padding: The number of bits in the final byte that are to be ignored | ||||
| when a vector element's size is less than a byte | ||||
| and the length of the vector is not a multiple of 8. | ||||
|  | ||||
| .. versionadded:: 4.9 | ||||
| """ | ||||
|  | ||||
| data: Sequence[float | int] | ||||
|          | ||||
| __slots__ = ("__time", "__inc") | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could pass slots=True instead: https://docs.python.org/3/library/dataclasses.html#dataclasses.dataclass
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Strange. And you're cool with that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have to add them manually instead of using slots=True.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
class BinaryVector:
    """**(BETA)** Vector of numbers along with metadata for binary interoperability.
    :param data (Sequence[float | int]): Sequence of numbers representing the mathematical vector.
    :param dtype (:class:`bson.Binary.BinaryVectorDtype`):  The data type stored in binary
    :param padding (Optional[int] = 0): The number of bits in the final byte that are to be ignored
      when a vector element's size is less than a byte
      and the length of the vector is not a multiple of 8. Default is 0.
    .. versionadded:: 4.10
    """
    __slots__ = ("data", "dtype", "padding")
    def __init__(self, data, dtype, padding=0):
        self.data = data
        self.dtype = dtype
        self.padding = padding
Is this right? @blink1073
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh I didn't realize __slots__ with dataclass was problematic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should be sorted now
        
          
              
                Outdated
          
        
      There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Optional[int] -> int, unless padding=None is considered valid.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
padding=None is expected when it is not a packed type
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it? I see the code using padding=0 in that case.
        
          
              
                Outdated
          
        
      There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Optional[int] -> int
        
          
              
                  blink1073 marked this conversation as resolved.
              
          
            Show resolved
            Hide resolved
        
              
          
              
                Outdated
          
        
      There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment about assert.
        
          
              
                Outdated
          
        
      There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment about assert.
        
          
              
                Outdated
          
        
      There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Optional[bool] -> bool unless uncompressed=None is valid.
        
          
              
                  blink1073 marked this conversation as resolved.
              
              
                Outdated
          
            Show resolved
            Hide resolved
        
      There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to validate self.subtype == 9 here before attempting to decode.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ShaneHarvey How do I do that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something like:
if self.subtype != VECTOR_SUBTYPE:
   raise ValueError(...)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Look at as_uuid it should be the same as that validation:
        if self.subtype not in ALL_UUID_SUBTYPES:
            raise ValueError(f"cannot decode subtype {self.subtype} as a uuid")        
          
              
                Outdated
          
        
      There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should not use assert for validation data. If the argument type is incorrect we raise a TypeError, if the type is correct but the value is invalid we raise a ValueError.
        
          
              
                Outdated
          
        
      There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why add an uncompressed option here at all? It looks like this option is irreversible because from_vector does not support the same option.  Even if it did, is this option useful outside of test code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I think we should leave out uncompressed, it was a quality of life addition, but is not likely to be standardized.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. It's irreversible. We don't yet provide an API to go from a full vector of zeros and ones to a packed bit vector.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Going in, you'd be using something like numpy.packbits
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's omit it since it adds unneeded complexity at this point.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
| Original file line number | Diff line number | Diff line change | 
|---|---|---|
| @@ -0,0 +1,42 @@ | ||
| { | ||
| "description": "Tests of Binary subtype 9, Vectors, with dtype FLOAT32", | ||
| "test_key": "vector", | ||
| "tests": [ | ||
| { | ||
| "description": "Simple Vector FLOAT32", | ||
| "valid": true, | ||
| "vector": [127.0, 7.0], | ||
| "dtype_hex": "0x27", | ||
| "dtype_alias": "FLOAT32", | ||
| "padding": 0, | ||
| "canonical_bson": "1C00000005766563746F72000A0000000927000000FE420000E04000" | ||
| }, | ||
| { | ||
| "description": "Empty Vector FLOAT32", | ||
| "valid": true, | ||
| "vector": [], | ||
| "dtype_hex": "0x27", | ||
| "dtype_alias": "FLOAT32", | ||
| "padding": 0, | ||
| "canonical_bson": "1400000005766563746F72000200000009270000" | ||
| }, | ||
| { | ||
| "description": "Infinity Vector FLOAT32", | ||
| "valid": true, | ||
| "vector": ["-inf", 0.0, "inf"], | ||
| "dtype_hex": "0x27", | ||
| "dtype_alias": "FLOAT32", | ||
| "padding": 0, | ||
| "canonical_bson": "2000000005766563746F72000E000000092700000080FF000000000000807F00" | ||
| }, | ||
| { | ||
| "description": "FLOAT32 with padding", | ||
| "valid": false, | ||
| "vector": [127.0, 7.0], | ||
| "dtype_hex": "0x27", | ||
| "dtype_alias": "FLOAT32", | ||
| "padding": 3 | ||
| } | ||
| ] | ||
| } | ||
|  | 
| Original file line number | Diff line number | Diff line change | 
|---|---|---|
| @@ -0,0 +1,57 @@ | ||
| { | ||
| "description": "Tests of Binary subtype 9, Vectors, with dtype INT8", | ||
| "test_key": "vector", | ||
| "tests": [ | ||
| { | ||
| "description": "Simple Vector INT8", | ||
| "valid": true, | ||
| "vector": [127, 7], | ||
| "dtype_hex": "0x03", | ||
| "dtype_alias": "INT8", | ||
| "padding": 0, | ||
| "canonical_bson": "1600000005766563746F7200040000000903007F0700" | ||
| }, | ||
| { | ||
| "description": "Empty Vector INT8", | ||
| "valid": true, | ||
| "vector": [], | ||
| "dtype_hex": "0x03", | ||
| "dtype_alias": "INT8", | ||
| "padding": 0, | ||
| "canonical_bson": "1400000005766563746F72000200000009030000" | ||
| }, | ||
| { | ||
| "description": "Overflow Vector INT8", | ||
| "valid": false, | ||
| "vector": [128], | ||
| "dtype_hex": "0x03", | ||
| "dtype_alias": "INT8", | ||
| "padding": 0 | ||
| }, | ||
| { | ||
| "description": "Underflow Vector INT8", | ||
| "valid": false, | ||
| "vector": [-129], | ||
| "dtype_hex": "0x03", | ||
| "dtype_alias": "INT8", | ||
| "padding": 0 | ||
| }, | ||
| { | ||
| "description": "INT8 with padding", | ||
| "valid": false, | ||
| "vector": [127, 7], | ||
| "dtype_hex": "0x03", | ||
| "dtype_alias": "INT8", | ||
| "padding": 3 | ||
| }, | ||
| { | ||
| "description": "INT8 with float inputs", | ||
| "valid": false, | ||
| "vector": [127.77, 7.77], | ||
| "dtype_hex": "0x03", | ||
| "dtype_alias": "INT8", | ||
| "padding": 0 | ||
| } | ||
| ] | ||
| } | ||
|  | 
| Original file line number | Diff line number | Diff line change | 
|---|---|---|
| @@ -0,0 +1,50 @@ | ||
| { | ||
| "description": "Tests of Binary subtype 9, Vectors, with dtype PACKED_BIT", | ||
| "test_key": "vector", | ||
| "tests": [ | ||
| { | ||
| "description": "Simple Vector PACKED_BIT", | ||
| "valid": true, | ||
| "vector": [127, 7], | ||
| "dtype_hex": "0x10", | ||
| "dtype_alias": "PACKED_BIT", | ||
| "padding": 0, | ||
| "canonical_bson": "1600000005766563746F7200040000000910007F0700" | ||
| }, | ||
| { | ||
| "description": "Empty Vector PACKED_BIT", | ||
| "valid": true, | ||
| "vector": [], | ||
| "dtype_hex": "0x10", | ||
| "dtype_alias": "PACKED_BIT", | ||
| "padding": 0, | ||
| "canonical_bson": "1400000005766563746F72000200000009100000" | ||
| }, | ||
| { | ||
| "description": "PACKED_BIT with padding", | ||
| "valid": true, | ||
| "vector": [127, 7], | ||
| "dtype_hex": "0x10", | ||
| "dtype_alias": "PACKED_BIT", | ||
| "padding": 3, | ||
| "canonical_bson": "1600000005766563746F7200040000000910037F0700" | ||
| }, | ||
| { | ||
| "description": "Overflow Vector PACKED_BIT", | ||
| "valid": false, | ||
| "vector": [256], | ||
| "dtype_hex": "0x10", | ||
| "dtype_alias": "PACKED_BIT", | ||
| "padding": 0 | ||
| }, | ||
| { | ||
| "description": "Underflow Vector PACKED_BIT", | ||
| "valid": false, | ||
| "vector": [-1], | ||
| "dtype_hex": "0x10", | ||
| "dtype_alias": "PACKED_BIT", | ||
| "padding": 0 | ||
| } | ||
| ] | ||
| } | ||
|  | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An enum of bytes is a bit unusual. Can we change it to use ints? eg
INT8 = 3There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The naming convention that Geert used has meaning in both the first and last 4 bits, so I'd prefer it to stay as-is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see how the interpretation of the values here matters. Are you saying this is a bit flag?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The user doesn't see the value locally either. Given that the bson spec itself always uses integers I think we should use the integer form.