Add a note on serialized form compatibility#162
Add a note on serialized form compatibility#162ZacBlanco wants to merge 3 commits intoapache:masterfrom
Conversation
According to the dicussion in apache/datasketches-java#454 sketches should be able to be serialized and deserialized by any version of the library which supports the sketch.
5bbe244 to
4f3576e
Compare
| * Sketches serialized from C++ or Python can be interpreted by compatible Java sketches and visa versa. | ||
|
|
||
| * All sketches have a serialized form which is able to be deserialized by any version of the library since the sketch was introduced. | ||
|
|
There was a problem hiding this comment.
Thank you for volunteering this! However, the statement is a bit too general. Because of differences between languages, there may be sketches in C++, for example, that don't exist in Java yet. Or visa-versa. So we need a little more precision in the wording. For example:
"All sketches have a serialized form which is able to be deserialized by the same or later version of the same sketch in the same language of the library since the sketch was introduced. Deserialization across languages and across time for compatible sketches is a bit trickier since there may be a time-lag between languages when a specific sketch was introduced."
There was a problem hiding this comment.
After a colleague and I looked at this again, I think we can make it even more clear:
Current library versions can read serialized images of sketches from older library versions within a language. The current version of a serialized image is compatible across languages with the caveat that a new sketch may be introduced in one language before being ported to others. Sketches requiring user-written custom serialize/deserialize code rely on users to port that custom code themselves for cross-language compatibility.
There was a problem hiding this comment.
Question on this:
Current library versions can read serialized images of sketches from older library versions within a language.
Can we broaden this, and say something like this (to show that the converse is also true)?
Current library versions can read serialized images of sketches from older and newer library versions within a language.
i.e. do we expect the serialization format to not change over time at all? For example, if we're using these sketches in a system that can save the sketches in a database, can we expect that systems that use newer versions of the library will always serialize the sketches in a format that can be deserialized by systems that use older versions of the library?
There was a problem hiding this comment.
Chiming in to add some more motivation to why we want this clarified.
It's common in our world to have many different systems writing and querying data. E.g. Apache Spark, Flink, Presto, etc. Our concern is that some of these systems may use different versions of the datasketches library to generate serialized sketches. We just need to be aware of the guarantees that the library makes on the binary format so that we can guarantee compatibility to the best of our ability, not just on upgrades from version to version of one piece of software, but also so that each different system can potentially understand sketches generated by other systems.
There was a problem hiding this comment.
“Prediction is very difficult, especially if it's about the future!” -- Niels Bohr.
I don't know of any software that guarantees forward compatibility forever. Which means old code can always read structures created by future code. Even international standards bodies don't guarantee that. The Java language doesn't guarantee that with its class version IDs. Non-compatible changes can occur for lots of reasons including changes required for security reasons, obsolescence of language features, or new capabilities that were not imagined when the original code was created.
We recognize the challenge in large system environments, with different languages, different platforms all potentially using different versions of the software. And we are trying our best to provide capabilities to at least allow these large environments to be able to interchange serialized sketches across languages and platforms efficiently. We are not aware of any other open-source sketch library that even provides this capability. Cross version compatibility of software is a challenge that all platforms face in general. It is up to the platform maintainers to keep their software up-to-date, and this not new and not different here.
Nonetheless, to put your mind somewhat at ease, realize that we have two levels of versioning in our library (this is true across all of our languages):
-
Software Version: this is the release version, published via Apache and specified in the POM file or equivalent, this can change relatively frequently based on bug fixes and introduction of new capabilities. Here, we try very hard to obey the principles of Semantic Versioning as specified by semver.org.
-
Serialization Version: (SerVer) This is a small integer placed in the preamble of the serialized byte array that indicates the version of the serialized structure for the sketch. A single SerVer may represent multiple structures all based on the same sketch when stored in different states, e.g., Single Item, Compact, Updatable, etc). This SerVer changes VERY rarely, if at all. Of all of our sketches, only 3, (Theta, KLL and Sampling) have more than one SerVer. There are and will be many Software Versions of the same sketch that still use the same SerVer. When we are forced (rarely) to update the SerVer, we provide the capability in the Software Version of the code associated with the new SerVer the ability to read and convert the old SerVer to the new SerVer. This is why our newest Software Versions can still read and interpret older SerVer serialized sketches that go back to when our project was started at Yahoo (2012), and before we went open-source (2015).
This means that as long as the SerVer is the same, older Software Versions should be able to read sketch images created by newer software versions. But the APIs may be different, obviously. An older SW version will not be able to take advantage of new features introduced in new SW versions, but it should be able to do what it did before. In other words, there will be no loss of access to the serialized sketch and its older SW version capabilities. As a user, you don't need to worry about or be able to access the SerVer. If a sketch is presented with a new SerVer that it is not compatible with, the sketch should throw an exception and say what the problem is, just like Java does.
There was a problem hiding this comment.
Thank you for the thorough response @leerho. I agree with you that it's always difficult to predict the future, I just wanted to better understand the intentions of the datasketches maintainers when it comes to serialization compatibility. I think you've answered all my the questions though.
For this change I think there are two parts to highlight (1) is the compatibility across languages and (2) is compatibility across datasketches versions. I've split your suggestions above across bullets pertaining to those points and tried to word it accordingly.
This statement should consider that java, CPP, etc are released on the same cadence but does not require new sketches to be ported to other languages. This nuance should be captured by the documentation.
Tried to be more clear about language and version compatibility by separating into 2 sections.
|
@ZacBlanco, Thanks again for your contribution! |
According to the dicussion in apache/datasketches-java#454 sketches should be able to be serialized and deserialized by any version of the library which supports the sketch.