algebird-serialization #119

sritchie · 2013-02-14T09:56:22Z

This module should contain thrift mappings for all Algebird case classes. Out of this we'll publish scrooge and java jars of these structs.

avibryant · 2013-06-06T04:45:58Z

@sritchie how do you see this working? I can see defining a bunch of thrift structs that mirror the existing algebird case classes, and making sure all the logic for working with them is moved into the corresponding monoids... or I guess I could see using implicits to make Rich* wrapper classes and put behavior there... but is there some fancier way to use Scrooge that would let it instantiate objects that we could actually define methods on directly?

sritchie · 2013-06-06T19:53:27Z

I don't think so, re: the fancy interactions with Scrooge. My idea here was to define the thrift mappings as you've described, then write Bijections between the case classes and thrift classes.

Using Scrooge and moving logic out of the case classes into the Monoids is a better idea :) That, or we forgo thrift and its versioning woes and go with the Bufferable solution.

avibryant · 2013-06-06T20:02:05Z

Bufferable makes me uneasy, to be honest; feels like a step backwards. Having a good separate spec for the structures and their serialization - that could even be used from other languages, in theory - would be nice. Having looked at it a bit, though: with Thrift we're going to run into problems with the lack of recursive types... we'd do a little better with protobuf (or Avro, probably).

sritchie · 2016-10-24T19:55:00Z

I'm going to wake this issue up - I think if we can get this right (and that probably does not involve using Thrift) we'll remove a good amount of pain from using these data structures in Summingbird.

sritchie · 2016-10-24T20:17:32Z

Looks like more discussion on this happened here as well: #463

isnotinvain · 2016-10-24T22:23:08Z

I'd just like to mention that we are probably better off using thrift / protobuf / avro / etc. than using our own serialization format. Not because we couldn't come up with a good format, but because ensuring that we never accidentally introduce an incompatible change to the format is hard, and these libraries do a pretty good job of giving you rules for what you can / can't do if you want to maintain long term compatibility.

That said, it's a burden on pretty much everyone, whichever one of the above we do pick. If your company is using avro and we pick protobuf, now you've got to use both avro and protobuf. I don't know if there's any alternatives that don't involve code generation libraries, but I think if we can use some sort of serialization library that thinks explicitly about compatibility over time that'd be helpful.

If not, then I thin we'll need some very strict tests around compatibility and we'll need to keep old serialized copies of these objects around for unit tests. I think it's difficult to get right unfortunately.

johnynek · 2016-10-24T22:26:02Z

the problem with picking something else is that then we have version hell, and not everyone uses that other thing.

If we just have a tested system with golden-data committed we can handle the conversion to-and-from bytes fairly easily.

The main thing is that we would then have tests to see if we changed anything.

If you want thrift, you all have it: just define an injection from algebird -> thrift and never change your thrift.

isnotinvain · 2016-10-24T22:56:44Z

It's not the correctness of the bijection, it's the correctness over time as the schema changes (if it changes). It's pretty easy to miss an unexpected corner case. And introducing even one incompatible change causes a lot of headache for users because they often don't find out until months later, when they realize a chunk of their old data isn't compatible anymore.

I think if we build our own serialization format, we'll need some very strict tests. We will also have to make sure that we check in new golden data every time we make a change to the relevant code. Or we can do what other libraries do and introduce versioned data formats, where we write what version something was serialized with into the output. There's definitely options but none of them are super straightforward.

Maybe something we could try is a shaded copy of protobuf / thrift / avro -- keep that detail entirely private but use their schema compatibility features.

sritchie · 2016-10-24T23:47:00Z

yes, the shaded copy of protobuf is what I was going to suggest for one approach. I do agree with @johnynek though that it could be hellish.

I think we should very slowly opt data structures into this system. We can be fairly strict - if we ever change the representation, we have to make a new data structure.

johnynek · 2016-10-24T23:48:34Z

The thing is, I don't think is solves the problem. You still need the same testing. You can just as easily break backwards compatibility by changing the shape of your protobuf.

isnotinvain · 2016-10-24T23:49:22Z

Yeah, but protobuf has itself been tested. You only need to test that you follow the right rules for changing the protobuf schema.

johnynek · 2016-10-24T23:50:25Z

yeah, but unless we trust humans to do that, we still need to build the golden data sets.

isnotinvain · 2016-10-24T23:50:30Z

libraires like Lucene use versioned serialization, and support reading / writing old versions in all newer versions. That's a little bit of a separate problem though.

johnynek · 2016-10-24T23:51:54Z

actually, I don't think serialization itself is that hard (at all). The challenge here is testing and not breaking.

People will make mistakes, so if we do this project, I don't see a ton of value unless we have strong tests on it.

isnotinvain · 2016-10-24T23:51:59Z

@johnynek you actually only need a test that the protobuf schema hasn't changed incorrectly. So instead of a golden data set, you just need a copy of the old schema

johnynek · 2016-10-24T23:52:24Z

if you build a schema migration checker.

isnotinvain · 2016-10-24T23:52:37Z

yes exactly

sritchie · 2016-10-24T23:53:24Z

Whatever we decide, I should state quickly my reason for waking this up!

I felt a few years ago, and I still feel now, that serialization is the primary blocker for folks leaning in to using these more advanced data structures. Adding Kryo was a HUGE help for Cascalog/Scalding.... unfortunately folks have to come back down to earth after their initial exploration and figure out how to store data The Right Way.

We should be providing really tight serializations for all of the data structures that we're all using in production.

The reason I don't really like using the existing frameworks is that it's hard to publish/share the schemas. Better to just rely on algebird to serialize/deserialize and maintain compatibility here. If you really want to store it your way, great! Write your own serializer. Totally fine. But we want all of the powerful data structures we've built to work out-of-the-box with excellent defaults when used with Summingbird / Scalding.

johnynek · 2016-10-24T23:54:20Z

I'm concerned the translation to an intermediate data structure will be a pretty high perf cost given that the real issue is data stability (which I think requires the golden data to guard against any accidental changes to the underlying schemas/library even in the other case).

MansurAshraf · 2017-04-26T01:09:46Z

I would like to wake this issue again as this is primarily the biggest impediment to introducing algebird to a larger audience at my current $work. Data scientists(read python programmers/matlab ninjas) hesitate to reach for some of the data structure as they inevitably run into serialization/persistence issue. One of the reason HLL is so widely used is because it comes with toByte/fromByte methods. Can we at least provide examples on how to serialize CMS QTree etc in any framework (thrift, avro etc) so that people can translate to their in house serialization framework? Right now they dont even know where to start.

johnynek · 2017-04-26T01:12:34Z

@MansurAshraf I think everyone likes the idea. I think we are all just trying to get the other one to do it. :)

MansurAshraf · 2017-04-26T01:13:52Z

@johnynek algebird-internal already has all the good stuff, we just need to convince @isnotinvain to open source it.

johnynek · 2017-04-26T01:14:37Z

yeah, or maybe @benpence @piyushnarang @sriramkrishnan etc...

isnotinvain · 2017-04-26T02:53:14Z

Open sourcing the algebird internal stuff should not be hard. I'm +1 to that.

On Tue, Apr 25, 2017 at 9:14 PM P. Oscar Boykin ***@***.***> wrote: yeah, or maybe @benpence <https://github.com/benpence> @piyushnarang <https://github.com/piyushnarang> @sriramkrishnan <https://github.com/sriramkrishnan> etc... — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#119 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAWw6dMAc6hk_yD9fPof4nMJpTBNlxDNks5rzpp_gaJpZM4AbiuC> .

-- Sent from my phone

johnynek · 2017-04-26T03:27:35Z

Great! This could be something useful indeed. On Tue, Apr 25, 2017 at 4:53 PM Alex Levenson <[email protected]> wrote:

…

Open sourcing the algebird internal stuff should not be hard. I'm +1 to that. On Tue, Apr 25, 2017 at 9:14 PM P. Oscar Boykin ***@***.***> wrote: > yeah, or maybe @benpence <https://github.com/benpence> @piyushnarang > <https://github.com/piyushnarang> @sriramkrishnan > <https://github.com/sriramkrishnan> etc... > > — > You are receiving this because you were mentioned. > Reply to this email directly, view it on GitHub > <#119 (comment)>, > or mute the thread > < https://github.com/notifications/unsubscribe-auth/AAWw6dMAc6hk_yD9fPof4nMJpTBNlxDNks5rzpp_gaJpZM4AbiuC > > . > -- Sent from my phone — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#119 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAEJdobEkXPwbfqCosJDJThe2PYA5U9iks5rzrGcgaJpZM4AbiuC> .

MansurAshraf · 2017-04-27T20:09:57Z

Thanks @isnotinvain, any idea on timelines?

isnotinvain · 2017-04-28T16:14:04Z

I'm asking CDL team what they think (I switched teams), but I think we should be able to do this.

isnotinvain · 2017-04-28T16:20:24Z

There's also good discussion in this thread from before. We never really picked how we wanted to handle this. We can just make our thrift stuff available, but it's going to add a thrift dependency to algebird, isn't going to help folks that wanted to use protobuf or avro, and w/o a schema compatibility checker, it'd be good to have some more tests too. There was also talk of shading in this thread as well. Any thoughts on those issues?

MansurAshraf · 2017-04-28T18:26:05Z

my 2cents are that you guys should provide the existing thrift serializer as reference implementations so that you are not responsible for schema evolution etc. however this give other people a way to port it to their own binary format such as avro, proto etc. In my case we dont use thrift so I will have to implement the same serializer in Avro & Proto for different teams, but if I have a battle tested thrift implementation to based my work off of, i would be much more confident that i implemented the serializers correctly

Let me know what you think

isnotinvain · 2017-04-28T18:47:43Z

Yeah, that seems fine. Ideally I guess we'd have all three (thrift, avro, proto) in algebird sub-modules if / when they get contributed. (don't know if your $work will let you do that).

AnthonyTruchet · 2018-08-16T21:43:32Z

Hello all,

I'm working on a new project that uses CMS and I choose to seize this opportunity to introduce algebird in the project. I'm very impressed by this project and how powerful this is one the initial abstraction level is integrated.

I reckon I'm still having to level-up on the bijection stuff and how they relate to serialization though. Posting here after following up #463 and #172 as I need to serialize CMSInstance to protobuf (I could do with thrift too) so as to read them back from another language.

I'm willing to contribute any development upstream but for now I'm missing the right entry point on how leverage bijections to do that (instead of a hand written protbuf/thrift/... serializer deserializer).

Can you provide any example or guidance or preliminary design so that my contribution is aligned with what you have in mind ?

MansurAshraf · 2018-08-17T01:22:21Z

So this is the Algebird CMSInstance class

case class CMSInstance[K](
  countsTable: CMSInstance.CountsTable[K],
  override val totalCount: Long,
  override val params: CMSParams[K]) extends CMS[K](params)

1 - create an avro/thirft/proto schema that closely matches this case class
2 - use your favorite scala library to generate case class from your schema
3 - map CMSInstance to your generated case class
4 - serialize your case class using avro/proto/thrift etc. I think Bijection has serializers for thirft & avro. Not sure about protobuf

Things that will trip you:

As you can see that CMSInstance[K] has a type parameter K. This is used to implicitly inject a Hasher instance which makes serialization tricky. One way you can solve that is by fixing K to Array[Byte]. There is already a Hasher instance for Byte array and it has a contramap method that you can use to map from your type to array byte

AnthonyTruchet · 2018-08-17T16:30:31Z

Thank you for this quick answer. I did 1) but I used the std protobuf code generation for Java instead of 2). Then I’m doing 3) starting from the CMSHash, then CMSParams then… manually. And I’m wondering if I have to do that of if this tedious systematic process is what a proper use of Bijections should enable me to do. Would that be more automatic if I had generated a case classes based Scala API for my schema ? Anyway, I’m writing this piece of code so that it is decoupled from my specific use: do you have any interest in merging a example of CMS Serialization in the project ?

…

On Aug 16, 2018, at 6:22 PM, Mansur Ashraf ***@***.***> wrote: So this is the Algebird CMSInstance class case class CMSInstance[K]( countsTable: CMSInstance.CountsTable[K], override val totalCount: Long, override val params: CMSParams[K]) extends CMS[K](params) 1 - create an avro/thirft/proto schema that closely matches this case class 2 - use your favorite scala library to generate case class from your schema 3 - map CMSInstance to your generated case class 4 - serialize your case class using avro/proto/thrift etc. I think Bijection has serializers for thirft & avro. Not sure about protobuf Things that will trip you: As you can see that CMSInstance[K] has a type parameter K. This is used to implicitly inject a Hasher instance which makes serialization tricky. One way you can solve that is by fixing K to Array[Byte]. There is already a Hasher instance for Byte array and it has a contramap method that you can use to map from your type to array byte — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#119 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ACHVE5uk_uJrJfpq7L6N_Mkl_426K56Yks5uRhrSgaJpZM4AbiuC>.

MansurAshraf · 2018-08-17T17:12:25Z

You can use tools like https://github.com/sksamuel/avro4s/ to generate schema from existing case classes. There maybe something like this for proto. As for accepting PR for CMS Serialization please ping Oscar Boyken On Fri, Aug 17, 2018 at 9:30 AM Anthony Truchet <[email protected]> wrote:

…

Thank you for this quick answer. I did 1) but I used the std protobuf code generation for Java instead of 2). Then I’m doing 3) starting from the CMSHash, then CMSParams then… manually. And I’m wondering if I have to do that of if this tedious systematic process is what a proper use of Bijections should enable me to do. Would that be more automatic if I had generated a case classes based Scala API for my schema ? Anyway, I’m writing this piece of code so that it is decoupled from my specific use: do you have any interest in merging a example of CMS Serialization in the project ? > On Aug 16, 2018, at 6:22 PM, Mansur Ashraf ***@***.***> wrote: > > So this is the Algebird CMSInstance class > > case class CMSInstance[K]( > countsTable: CMSInstance.CountsTable[K], > override val totalCount: Long, > override val params: CMSParams[K]) extends CMS[K](params) > 1 - create an avro/thirft/proto schema that closely matches this case class > 2 - use your favorite scala library to generate case class from your schema > 3 - map CMSInstance to your generated case class > 4 - serialize your case class using avro/proto/thrift etc. I think Bijection has serializers for thirft & avro. Not sure about protobuf > > Things that will trip you: > > As you can see that CMSInstance[K] has a type parameter K. This is used to implicitly inject a Hasher instance which makes serialization tricky. One way you can solve that is by fixing K to Array[Byte]. There is already a Hasher instance for Byte array and it has a contramap method that you can use to map from your type to array byte > > — > You are receiving this because you commented. > Reply to this email directly, view it on GitHub < #119 (comment)>, or mute the thread < https://github.com/notifications/unsubscribe-auth/ACHVE5uk_uJrJfpq7L6N_Mkl_426K56Yks5uRhrSgaJpZM4AbiuC >. > — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#119 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAgqaLozM8JHvUKSGTudXeaoJn5cBn2Oks5uRu-rgaJpZM4AbiuC> .

sritchie changed the title ~~algebird-thrift~~ algebird-serialization Oct 24, 2016

sritchie removed this from the 0.3.0 milestone Oct 24, 2016

sritchie mentioned this issue Oct 24, 2016

need serializers for bloomfilter, CMS, etc.... #172

Closed

sritchie mentioned this issue Oct 24, 2016

long-term serialization format for commonly used structures #463

Closed

MWYang mentioned this issue Dec 10, 2019

Unary estimator for detecting names and transforming to gender salesforce/TransmogrifAI#445

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

algebird-serialization #119

algebird-serialization #119

sritchie commented Feb 14, 2013

avibryant commented Jun 6, 2013

sritchie commented Jun 6, 2013

avibryant commented Jun 6, 2013

sritchie commented Oct 24, 2016

sritchie commented Oct 24, 2016

isnotinvain commented Oct 24, 2016

johnynek commented Oct 24, 2016

isnotinvain commented Oct 24, 2016

sritchie commented Oct 24, 2016

johnynek commented Oct 24, 2016

isnotinvain commented Oct 24, 2016

johnynek commented Oct 24, 2016

isnotinvain commented Oct 24, 2016

johnynek commented Oct 24, 2016

isnotinvain commented Oct 24, 2016

johnynek commented Oct 24, 2016

isnotinvain commented Oct 24, 2016

sritchie commented Oct 24, 2016

johnynek commented Oct 24, 2016

MansurAshraf commented Apr 26, 2017

johnynek commented Apr 26, 2017

MansurAshraf commented Apr 26, 2017

johnynek commented Apr 26, 2017

isnotinvain commented Apr 26, 2017 via email

johnynek commented Apr 26, 2017 via email

MansurAshraf commented Apr 27, 2017

isnotinvain commented Apr 28, 2017

isnotinvain commented Apr 28, 2017

MansurAshraf commented Apr 28, 2017

isnotinvain commented Apr 28, 2017

AnthonyTruchet commented Aug 16, 2018

MansurAshraf commented Aug 17, 2018

AnthonyTruchet commented Aug 17, 2018 via email

MansurAshraf commented Aug 17, 2018 via email

algebird-serialization #119

algebird-serialization #119

Comments

sritchie commented Feb 14, 2013

avibryant commented Jun 6, 2013

sritchie commented Jun 6, 2013

avibryant commented Jun 6, 2013

sritchie commented Oct 24, 2016

sritchie commented Oct 24, 2016

isnotinvain commented Oct 24, 2016

johnynek commented Oct 24, 2016

isnotinvain commented Oct 24, 2016

sritchie commented Oct 24, 2016

johnynek commented Oct 24, 2016

isnotinvain commented Oct 24, 2016

johnynek commented Oct 24, 2016

isnotinvain commented Oct 24, 2016

johnynek commented Oct 24, 2016

isnotinvain commented Oct 24, 2016

johnynek commented Oct 24, 2016

isnotinvain commented Oct 24, 2016

sritchie commented Oct 24, 2016

johnynek commented Oct 24, 2016

MansurAshraf commented Apr 26, 2017

johnynek commented Apr 26, 2017

MansurAshraf commented Apr 26, 2017

johnynek commented Apr 26, 2017

isnotinvain commented Apr 26, 2017 via email

johnynek commented Apr 26, 2017 via email

MansurAshraf commented Apr 27, 2017

isnotinvain commented Apr 28, 2017

isnotinvain commented Apr 28, 2017

MansurAshraf commented Apr 28, 2017

isnotinvain commented Apr 28, 2017

AnthonyTruchet commented Aug 16, 2018

MansurAshraf commented Aug 17, 2018

AnthonyTruchet commented Aug 17, 2018 via email

MansurAshraf commented Aug 17, 2018 via email