-
Notifications
You must be signed in to change notification settings - Fork 41
Add a note on serialized form compatibility #162
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closed
ZacBlanco
wants to merge
3
commits into
apache:master
from
ZacBlanco:upstream-serialization-versioning
Closed
Changes from all commits
Commits
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for volunteering this! However, the statement is a bit too general. Because of differences between languages, there may be sketches in C++, for example, that don't exist in Java yet. Or visa-versa. So we need a little more precision in the wording. For example:
"All sketches have a serialized form which is able to be deserialized by the same or later version of the same sketch in the same language of the library since the sketch was introduced. Deserialization across languages and across time for compatible sketches is a bit trickier since there may be a time-lag between languages when a specific sketch was introduced."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After a colleague and I looked at this again, I think we can make it even more clear:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question on this:
Can we broaden this, and say something like this (to show that the converse is also true)?
i.e. do we expect the serialization format to not change over time at all? For example, if we're using these sketches in a system that can save the sketches in a database, can we expect that systems that use newer versions of the library will always serialize the sketches in a format that can be deserialized by systems that use older versions of the library?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Chiming in to add some more motivation to why we want this clarified.
It's common in our world to have many different systems writing and querying data. E.g. Apache Spark, Flink, Presto, etc. Our concern is that some of these systems may use different versions of the datasketches library to generate serialized sketches. We just need to be aware of the guarantees that the library makes on the binary format so that we can guarantee compatibility to the best of our ability, not just on upgrades from version to version of one piece of software, but also so that each different system can potentially understand sketches generated by other systems.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
“Prediction is very difficult, especially if it's about the future!” -- Niels Bohr.
I don't know of any software that guarantees forward compatibility forever. Which means old code can always read structures created by future code. Even international standards bodies don't guarantee that. The Java language doesn't guarantee that with its class version IDs. Non-compatible changes can occur for lots of reasons including changes required for security reasons, obsolescence of language features, or new capabilities that were not imagined when the original code was created.
We recognize the challenge in large system environments, with different languages, different platforms all potentially using different versions of the software. And we are trying our best to provide capabilities to at least allow these large environments to be able to interchange serialized sketches across languages and platforms efficiently. We are not aware of any other open-source sketch library that even provides this capability. Cross version compatibility of software is a challenge that all platforms face in general. It is up to the platform maintainers to keep their software up-to-date, and this not new and not different here.
Nonetheless, to put your mind somewhat at ease, realize that we have two levels of versioning in our library (this is true across all of our languages):
Software Version: this is the release version, published via Apache and specified in the POM file or equivalent, this can change relatively frequently based on bug fixes and introduction of new capabilities. Here, we try very hard to obey the principles of Semantic Versioning as specified by semver.org.
Serialization Version: (SerVer) This is a small integer placed in the preamble of the serialized byte array that indicates the version of the serialized structure for the sketch. A single SerVer may represent multiple structures all based on the same sketch when stored in different states, e.g., Single Item, Compact, Updatable, etc). This SerVer changes VERY rarely, if at all. Of all of our sketches, only 3, (Theta, KLL and Sampling) have more than one SerVer. There are and will be many Software Versions of the same sketch that still use the same SerVer. When we are forced (rarely) to update the SerVer, we provide the capability in the Software Version of the code associated with the new SerVer the ability to read and convert the old SerVer to the new SerVer. This is why our newest Software Versions can still read and interpret older SerVer serialized sketches that go back to when our project was started at Yahoo (2012), and before we went open-source (2015).
This means that as long as the SerVer is the same, older Software Versions should be able to read sketch images created by newer software versions. But the APIs may be different, obviously. An older SW version will not be able to take advantage of new features introduced in new SW versions, but it should be able to do what it did before. In other words, there will be no loss of access to the serialized sketch and its older SW version capabilities. As a user, you don't need to worry about or be able to access the SerVer. If a sketch is presented with a new SerVer that it is not compatible with, the sketch should throw an exception and say what the problem is, just like Java does.
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the thorough response @leerho. I agree with you that it's always difficult to predict the future, I just wanted to better understand the intentions of the datasketches maintainers when it comes to serialization compatibility. I think you've answered all my the questions though.
For this change I think there are two parts to highlight (1) is the compatibility across languages and (2) is compatibility across datasketches versions. I've split your suggestions above across bullets pertaining to those points and tried to word it accordingly.