Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add script to re-namespace .proto files for internal use in public libraries #207

Merged
merged 3 commits into from
Mar 1, 2023

Conversation

jvanstraten
Copy link
Contributor

This adds a script that can rewrite the protobuf files with a namespace prefix, such as arrow.substrait. When public libraries generate their own internal bindings and don't do this, users of said libraries can get nasty surprises from the protobuf library (segfaults, incomprehensible exceptions, or just wrong behavior) if there is even the slightest difference between different instances of substrait's message types, even if those differences are otherwise just non-breaking changes. I can't say I understand completely when or why protobuf's global message type registry is shared between libraries, but I'd say it's better to be safe than sorry and to strongly recommend that public libraries only define proto messages within their own namespace. AFAIK protoc can't do this on its own, so having this script makes it relatively easy to do so.

As an example of just how inane protobuf's error messages are, copypasting a Python exception @gforsyth got:

In [1]: import duckdb

In [2]: from ibis_substrait.compiler.core import SubstraitCompiler
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
Input In [2], in <cell line: 1>()
----> 1 from ibis_substrait.compiler.core import SubstraitCompiler

File .../lib/python3.10/site-packages/ibis_substrait/compiler/core.py:12, in <module>
      9 import ibis.expr.datatypes as dt
     10 import ibis.expr.types as ir
---> 12 from ..proto.substrait import algebra_pb2 as stalg
     13 from ..proto.substrait import plan_pb2 as stp
     14 from ..proto.substrait.extensions import extensions_pb2 as ste

File .../lib/python3.10/site-packages/ibis_substrait/proto/substrait/algebra_pb2.py:9, in <module>
      7 from google.protobuf import symbol_database as _symbol_database
      8 _sym_db = _symbol_database.Default()
----> 9 from google.protobuf import any_pb2 as google_dot_protobuf_dot_any__pb2
     10 from google.protobuf import empty_pb2 as google_dot_protobuf_dot_empty__pb2
     11 from ..substrait.extensions import extensions_pb2 as substrait_dot_extensions_dot_extensions__pb2

File .../lib/python3.10/site-packages/google/protobuf/any_pb2.py:21, in <module>
     12 _sym_db = _symbol_database.Default()
     17 DESCRIPTOR = _descriptor_pool.Default().AddSerializedFile(b'\n\x19google/protobuf/any.proto\x12\x0fgoogle.protobuf\"&\n\x03\x41ny\x12\x10\n\x08type_url\x18\x01 \x01(\t\x12\r\n\x05value\x18\x02 \x01(\x0c\x42v\n\x13\x63om.google.protobufB\x08\x41nyProtoP\x01Z,google.golang.org/protobuf/types/known/anypb\xa2\x02\x03GPB\xaa\x02\x1eGoogle.Protobuf.WellKnownTypesb\x06proto3')
---> 21 _ANY = DESCRIPTOR.message_types_by_name['Any']
     22 Any = _reflection.GeneratedProtocolMessageType('Any', (_message.Message,), {
     23   'DESCRIPTOR' : _ANY,
     24   '__module__' : 'google.protobuf.any_pb2'
     25   # @@protoc_insertion_point(class_scope:google.protobuf.Any)
     26   })
     27 _sym_db.RegisterMessage(Any)

KeyError: 'Any'

and apparently, swapping the import order "fixes" the error (I'm sure it's still broken, it just doesn't throw).

@westonpace
Copy link
Member

Hmm....the producer and consumer have to agree on the message content and for google.protobuf.Any I think this content includes the namespace. Do you think this approach will work in a situation where Any is used?

Copy link
Contributor

@jacques-n jacques-n left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My main thought is, if this is useful can we do the following:

  • move it to a tools folder as opposed to being in the proto folder
  • add a pre-commit test that ensures that it runs (and proto/buf runs successfully and as expected afterwards).

It blows my mind that this is the best option for some languages...

@jvanstraten
Copy link
Contributor Author

jvanstraten commented May 24, 2022

Hmm....the producer and consumer have to agree on the message content and for google.protobuf.Any I think this content includes the namespace. Do you think this approach will work in a situation where Any is used?

Yeah, I realized after submitting this that, based on the error message, this is probably caused by different versions of the protobuf library itself in a single link. If that's the case, unfortunately there's bugger-all we can do about it other than recommend against using the official protobuf library...

Regardless of this particular error though, message type conflicts between different Substrait versions would give similar errors. Even if the conflict is between two completely different, incompatible Substrait versions, a user still shouldn't be getting errors from protobuf until they try to deserialize a plan generated by an incompatible version. The primary use case here being to support multiple incompatible versions of Substrait at once, or simply just linking a bunch of libraries that happen to link to Substrait under the hood without actually using Substrait.

It blows my mind that this is the best option for some languages...

Unless you mean protobuf itself, AFAIK this applies to every programming language, whenever the official protobuf C library is used under the hood. That includes at least C++ and Python; I wouldn't know about the other ones. I would love to be wrong about this, because it's the most egregious case of "my library is special and should therefore dictate your entire workflow" that I know of.

if this is useful can we do the following: [...]

Okay, will update tomorrow. Done

@jvanstraten jvanstraten force-pushed the proto-prefix-py branch 2 times, most recently from 8b2800d to a8b6fd6 Compare May 25, 2022 15:04
@jacques-n
Copy link
Contributor

@westonpace , you good on this change?

@westonpace
Copy link
Member

I'm not really for or against this change. I have no need to import multiple Substrait versions at the same time and, perhaps the validator needs that, but I'm not sure what other use cases there are.

I'm still concerned that this approach will not be able of handling google.protobuf.Any which encodes the namespace name as part of the message.

As for the general oddities of protobuf, I know we have encountered this with Flight. I think we've avoided this in Arrow with:

  • Don't externally expose (e.g. public header file / DLL export) any generated types.
  • Compile proto files as part of the project (e.g. don't have a substrait-cpp static/dynamic library)
    • In C++ this is also needed because the version of protoc must match, exactly (even down to the patch number), the runtime libprotobuf library.
  • Statically link compiled files

Even then, I think users can still run into trouble if they have different versions of Arrow, so I don't know that things are entirely solved.

Also, I've been seriously wanting to play around with nanopb to see if it helps with all of this.

@jvanstraten
Copy link
Contributor Author

I'm still concerned that this approach will not be able of handling google.protobuf.Any which encodes the namespace name as part of the message.

Ah, sorry, I think I misunderstood you before. To paraphrase, you're questioning whether a protobuf Any for, for instance, substrait.Plan would still work when someone re-namespaces substrait to e.g. arrow.substrait, I think? If you were to treat the URL part of an Any strictly as something of the form substrait.Plan or <url>/substrait.Plan I could see that being an issue, but it'd be easily worked around by using PackFrom and UnpackTo to specify the message type you want to convert to or from manually, or by first converting the type string. The re-namespaced bindings are supposed to be private, so the sole user should know full well how to do those things. (note: there was a lot of discussion about protobuf type URLs in the resolved discussion of #197, and some actual docs here)

Anyway, I'm personally also not 100% sure this is actually needed. I've tried to break libprotobuf here to confirm; it builds a dummy 0.2.0 producer lib, a dummy 0.3.0 consumer lib, and then links them together. However, of course it works fine in those laboratory conditions.

Honestly, I'm pretty clueless about what does and doesn't break libprotobuf at this point. All I know is that it's a royal PITA because, every single time I've used libprotobuf in practice, it found a way to break itself in creative ways practically the second I start feeling somewhat confident that I implemented things right for once. The only instance where protobuf has worked for me consistently was for the Rust part of the validator, i.e. when I wasn't using the official library. At this point I'm just trying to make sacrifices to the protobuf gods in the hopes of not being smote by them every other day.

I don't know. Maybe we should just close and/or leave this PR be until I or someone else finds a use case where this actually resolves a problem. I'm not sure if it did for @gforsyth; I'd imagine it'd take some refactoring to put it to use, and I think he worked around the problem by forcing Python protobuf to use the pure-python version of the library via the PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION=python env var (because of course that's a thing).

Copy link
Contributor

@jacques-n jacques-n left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm going to put this is a request changes state since it isn't clear whether we should merge it. Let's keep in a PR for now and see what requests/demands we see. Since it is basically independent, people can use it even if it is not merged.

If people are using this, please comment here so we can keep track of demand.

@gforsyth
Copy link
Member

I'm going to bump this up for continued discussion.
I spent an entire day last week debugging a problem that this would've resolved (I think).

We updated the bundled version of substrait in ibis-substrait to 0.8 but the new message types weren't available in the TYPE descriptor pool. It turns out this was because duckdb==0.3.4 was being pulled in via a transitive dependency (and some less-than-optimal behavior from pip's dependency resolver) and because duckdb was imported before ibis-substrait, the older version of substrait that is bundled with duckdb 0.3.4 was the one that defined TYPE in the descriptor pool.

I can imagine a very reasonable scenario in which producer A has substrait version 0.8 and serializes a Plan that is simple (say TPC-H query 1) that makes use of message types that were all present in substrait version 0.3.
A consumer (in the same process) has substrait version 0.3 and they are unable to deserialize the message because of the newer types defined in the more recent substrait protos, despite those newer message types being unused.

We probably can't break google.protobuf.Any but if there's a viable workaround for that that will allow namespacing the .proto files I am a big +1 on that.

@jvanstraten jvanstraten changed the title Add script to re-namespace .proto files for internal use in public libraries feat: add script to re-namespace .proto files for internal use in public libraries Jul 26, 2022
@CLAassistant
Copy link

CLAassistant commented Oct 6, 2022

CLA assistant check
All committers have signed the CLA.

@CLAassistant
Copy link

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.

@westonpace
Copy link
Member

@gforsyth what's the latest on this in the ibis-substrait world? I seem to recall we did end up breaking google.protobuf.Any but were able to work around it. What's your verdict on this script in light of your experiences there?

@gforsyth
Copy link
Member

gforsyth commented Dec 6, 2022

Hey @westonpace -- I'm pretty happy with it. I've had no further headaches around colliding protobuf definitions. I'm generally a +1 on packages namespacing their substrait protos and this script will do that for you.

@westonpace
Copy link
Member

@jvanstraten I think we should merge this in. Do you want to address the conflicts? If not, I think I can do it.

Adds a script that can rewrite the protobuf files with a namespace
prefix, such as arrow.substrait. When public libraries generate their
own internal bindings and don't do this, users of said libraries can
get nasty surprises from the protobuf library (segfaults,
incomprehensible exceptions, or just wrong behavior) if there is even
the slightest difference between different instances of the substrait
namespace, even if those differences are otherwise just non-breaking
changes.
@jvanstraten
Copy link
Contributor Author

@westonpace Should be ready to be reviewed & merged again.

Copy link
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jacques-n this will need your review to proceed since you put it in a request changes state. I am content to merge this and would like to start closing up some of these outstanding PRs. I think this sort of renaming has seen proven usage in both ibis-substrait and the validator at this point which is enough, in my mind, to at least tacitly recommend the approach by including it in a tools folder.

@westonpace
Copy link
Member

In this interest of closing this long outstanding issue I am going to bypass github's protections and merge. I believe we have satisfied the requirements in the governance policy given this is not a change in the spec so I think we're just being blocked by Github annoyance at this point.

@github-actions
Copy link

github-actions bot commented Mar 1, 2023

ACTION NEEDED

Substrait follows the Conventional Commits specification for release automation.

The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification.

@westonpace westonpace merged commit a6f24db into substrait-io:main Mar 1, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants