Skip to content

Conversation

@ghost
Copy link

@ghost ghost commented Feb 7, 2025

This is a nearly a subset of #1752, covering only the front matter: goals, non-goals, terms, and scope.

Minor changes from that PR:

  • "Bit strings are grouped with insignificant whitespace for readability." removed from Meta for now, as it's not needed without examples that use this convention.
  • Conversion APIs upgraded from MAY to SHOULD. I had written MAY for this previously because of the current level of driver support, but SHOULD is more consistent with our goals and avoids making any change.
  • Changed a description from "collection type" to "data format", for consistency with our specification scope.

Other minor notes:

DRIVERS-3031 is a set of clarifications instigated by an incorrect description of bit order. See the ticket, it's a deeper issue than a single description localized to one part of the spec. The fix for the order description necessitates clarifying which portions of the design are intended to be per-type vs. generalized. The original generalized description was incorrect, and I'd suggest we are better served by specifying characteristics of individual formats and avoiding unnecessary generalizations that affect future compatibility.

To analyze possible resolutions to DRIVERS-3031, it would help if the specification could document its own goals and scope. This PR offers new front matter, with very little text removed.

Specific details on removed text:

  • The previous Abstract didn't give a clear motivation for the serialization format itself, and the references to 3rd party libraries didn't serve a clear purpose. This spec doesn't exactly match data types from any of the listed libraries, and the commonalities are limited to those shared by the basic underlying data types. (IEEE float, two's complement)
  • "META" replaced with "Meta" for consistency with other specs.
  • Two sentences about scope were replaced with the new Scope section. The text "Drivers SHOULD provide idiomatic APIs to translate between arrays of numbers and this BSON Binary specification." didn't provide clear meaning due to the ambiguous terminology "arrays of numbers".

This PR intends to write down all the relevant goals and background that I am aware of, and invite discussion to refine the list as needed.

To discuss:

There are still unresolved questions about our compatibility goals. Should we intentionally provide named reduced levels of support, like PACKED_BITS with padding=0 only? How do we intend to roll out updates that add additional formats?

Please complete the following before merging:

  • Update changelog.
  • [ ] Test changes in at least one language driver. No changes.
  • [ ] Test these changes against all server versions and topologies (including standalone, replica set, sharded
    clusters, and serverless).
    No changes.

@ghost ghost requested review from caseyclements, kevinAlbs and qingyang-hu February 7, 2025 22:33
@ghost ghost self-requested a review as a code owner February 7, 2025 22:33
Copy link
Contributor

@kevinAlbs kevinAlbs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we intentionally provide named reduced levels of support, like PACKED_BITS with padding=0 only?

I expect only the three defined data types are supported. I am unsure if there is much value in noting cases like PACKED_BITS with padding=0 only.

How do we intend to roll out updates that add additional formats?

IIUC the data format was designed in WRITING-20326. I expect future updates might similarly be driven by other teams. I see the purpose of this specification as defining expected driver behavior. Deciding how to update to the data format seems out-of-scope of this spec.

Co-authored-by: Kevin Albertson <kevin.albertson@10gen.com>
@ghost
Copy link
Author

ghost commented Feb 10, 2025

I expect only the three defined data types are supported. I am unsure if there is much value in noting cases like PACKED_BITS with padding=0 only.
I expect future updates might similarly be driven by other teams. I see the purpose of this specification as defining expected driver behavior. Deciding how to update to the data format seems out-of-scope of this spec.

I'm fine with both of these points of view, it's a strict but low-complexity approach. This does imply that we might have some additional driver work to do in response to tests that fully exercise the formats described in the spec. (Making non-divisible-by-8 length packed_bits a fully supported type in all environments, and the additional strictness around unused bits/bytes.)

@ghost ghost requested a review from BorisDog February 10, 2025 22:08
@caseyclements
Copy link
Contributor

@mdbmes Thank you for splitting your original PR into pieces. This will make it easier, I think.

My first request is actually to add a sentence in a paragraph that you didn't edit. :) After "but it is represented in memory as a list of integers in [0, 255]." would you please add something like, "Thus, each number represents 8 values of zeros and ones"?

Copy link
Contributor

@caseyclements caseyclements left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only small changes requested.


This document describes the subtype of the Binary BSON type used for efficient storage and retrieval of vectors. Vectors
here refer to densely packed arrays of numbers, all of the same type.
This document describes a new *Vector* subtype (9) for BSON Binary items, used to compactly represent ordered
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you please drop the 9 in this first sentence, or perhaps change to this?
"This document describes a new Vector subtype (9) of the BSON Binary type (5) ..."

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok

- Drivers SHOULD provide low-overhead APIs for producing and consuming Vector data in the closest compatible language
types, without conversions more expensive than copying or byte-swapping. These APIs are not standardized across
languages.
- Drivers SHOULD provide facilities for converting between BSON Binary Vector and BSON Array representations. When they
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Python doesn't provide a facility to convert between BSON Arrays and Binary representations. We do nothing with BSON Arrays. (That said, everything mentioned in non-goals and motivation above is great.)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My thinking here is that the Python reference implementation provides substantially similar functionality as far as users are concerned, since Python lists can be used as a close analog to BSON arrays. My preference in the spec and especially the spec tests would be to standardize conversion exactly around BSON arrays (avoiding Pythonisms) but to give individual drivers plenty of freedom to use the interfaces that make the most sense for their language.

As an example, the C implementation has no applicable native container types, so the array-to-vector and vector-to-array conversions make natural additions to the API to fill the same functionality niche that from_vector and as_vector handle in Python. In C, it's important to describe access and conversion distinctly because both are possible and useful.

If discussion reveals that this isn't actually as useful as I thought, I'm happy to downgrade it to MAY. I'd like to standardize these conversions if possible, since at the very least they seem like a natural way to implement language-independent tests.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the goal that it will provide more robust unified testing, then I am open to the idea. It would be good to get the implementers of each driver together to discuss unified testing, maybe early March?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm ready for this as soon as possible, my current project is the C and C++ driver implementation and those PRs have a list of testing improvements that are being deferred pending spec work.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the value for our users for the BSON Array <-> BSON Binary Vector conversion?
Do we anticipate some common use cases for such conversions?

In C# we did not see the need for the direct conversion yet. BSON Array is mostly used as internal representation by the driver, the end user just uses native types. If needed, the conversion can be easily done via Binary Vector <-> Native types <-> BsonArray transformation. I suspect that would be the case for most drivers. And this is covered by the previous bullet point. ( "closest compatible language types").

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's for sure needed as an internal feature for test support, right? The current tests use an ad-hoc format that is scheduled for replacement by DRIVERS-3095. If there's truly no use to users we could define these conversions as for testing only, but in C I'd expect these conversions to be useful for ingesting and serializing data, and in all languages I would have thought they would provide useful migration tools. In designing the C API, I wanted to avoid requiring additional intermediate steps in operations that could be defined as direct conversions.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe because that in C# the conversions are pretty trivial, and that's why I struggle to see the value of this additional clause from C# pov, given the previous one.
BSON Array from/to BSON Vector conversion is just one liner in the test runner itself.

I am not opposed to this clause, just trying to understand whether it brings any additional value to other drivers, as it seems not applicable in C#.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mdbmes What is the list of testing improvements?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not opposed to this clause, just trying to understand whether it brings any additional value to other drivers, as it seems not applicable in C#.

Understood. Even if it's simple, my thought is that it's the domain we want to be specifying conversions in. (As this spec already depends on BSON semantics, but should be language independent)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the list of testing improvements?

See DRIVERS-3095, DRIVERS-3097, and the TODOs in mongodb/mongo-c-driver#1868

## Specification

This specification introduces a new BSON binary subtype, the vector, with value `9`.
### Scope
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is your intention in adding this scope section of bullet points? Are all of them covered in this pared-down pull-request?

  • Define the interpretation of the data bytes in BSON Binary items of subtype 9.
  • Detail the first two data bytes that form a header.

...

  • ...consuming Vector data in the closest type compatible with the driver's language...

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My intent was to agree on a scope for the specification. The specification should surely make an attempt to describe everything that's in scope, but this is my attempt to also describe the bounds of the scope. My earlier PR did make an attempt to align the content of the document with the claims here, but with the introduction alone it's an aspirational declaration that I'd hope would just let us get on the same page for the rest of this process.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the idea of getting onto the same page. I also want each commit on main to be self-standing. The full rewrite, which is basically what #1752 is, if needed, is going to take some time.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding of the code review process is that we're aiming for incremental improvement, and each PR doesn't need to fully resolve all known outstanding problems. I estimated that declaring a scope improves the document even if the rest of the document is known to need further work.

@ghost
Copy link
Author

ghost commented Feb 20, 2025

@mdbmes Thank you for splitting your original PR into pieces. This will make it easier, I think.

My first request is actually to add a sentence in a paragraph that you didn't edit. :) After "but it is represented in memory as a list of integers in [0, 255]." would you please add something like, "Thus, each number represents 8 values of zeros and ones"?

Can we keep this PR on topic? Type descriptions have multiple issues, but I'd prefer to keep that discussion separate. I proposed changes to this section in #1752 which are being deferred so we can discuss piece by piece.

@ghost
Copy link
Author

ghost commented Mar 5, 2025

Extracting this PR from #1752 hasn't improved the review process, and there's no support for this change. Closing.

@ghost ghost closed this Mar 5, 2025
This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants