Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bigquery: infer schema from protoc generated struct #1119

Open
simpajj opened this issue Aug 24, 2018 · 17 comments
Open

bigquery: infer schema from protoc generated struct #1119

simpajj opened this issue Aug 24, 2018 · 17 comments
Assignees
Labels
api: bigquery Issues related to the BigQuery API. type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design.

Comments

@simpajj
Copy link

simpajj commented Aug 24, 2018

Client

BigQuery

Describe Your Environment

macOS 10.13.6

Expected Behavior

bigquery.InferSchema() works for structs generated using protoc.

Actual Behavior

googleapi: Error 400: Field EventTime.XXX_NoUnkeyedLiteral is type RECORD but has no schema, invalid

The protobuf compiler generates the following field when generating a Go struct:

XXX_NoUnkeyedLiteral struct{}             `json:"-"`

This field prevents me from inferring the BigQuery schema from the struct. It would be nice to be able to infer the schema from structs generated by the protobuf compiler. Otherwise I'd have to manually specify the schema, which means that I also need to keep it in sync with any changes made to the protobuf message.

@jeanbza
Copy link
Contributor

jeanbza commented Aug 24, 2018

Out of curiousity, what version of protoc are you using, and are you using the latest protoc-go-gen?

@jeanbza jeanbza added the needs more info This issue needs more information from the customer to proceed. label Aug 24, 2018
@simpajj
Copy link
Author

simpajj commented Aug 24, 2018

Oh, sorry. I'm using proto3 with libprotoc 3.6.0 and the latest version of protoc-go-gen (1.2.0 I suppose, since go get will fetch the latest version or master even?).

@jeanbza jeanbza added type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns. priority: p2 Moderately-important priority. Fix may not be included in next release. and removed needs more info This issue needs more information from the customer to proceed. labels Aug 24, 2018
@jeanbza jeanbza self-assigned this Aug 24, 2018
@jeanbza
Copy link
Contributor

jeanbza commented Aug 24, 2018

@pongad Any idea what's up with this?

@jeanbza jeanbza added the api: bigquery Issues related to the BigQuery API. label Aug 30, 2018
@simpajj
Copy link
Author

simpajj commented Sep 3, 2018

Hi, any news on this?

@jeanbza
Copy link
Contributor

jeanbza commented Sep 4, 2018

Hi @simpajj - sorry, no news yet. We've had a bout of out of office activity these last two weeks. We're catching up on issues this week and next, and hope to get to this soon.

@simpajj
Copy link
Author

simpajj commented Sep 4, 2018

@jadekler Alright :) Thanks for the info!

@jba jba added type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design. and removed type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns. labels Sep 13, 2018
@jba
Copy link
Contributor

jba commented Sep 13, 2018

Unfortunately, InferSchema is working as it should. Its contract is to turn the exported fields of a struct into schema fields, and it's doing that. (Hence my change from bug to FR.)

We could add special logic that says if the struct implements proto.Message, then ignore any fields beginning XXX_. @dsnet, do you think that's a reasonable heuristic? Any plans to add extra fields to proto messages that do not have that prefix?

@simpajj
Copy link
Author

simpajj commented Sep 14, 2018

That would certainly help us out, if it is deemed as a good enough heuristic.

I suppose another alternative would be for the golang protobuf compiler plugin to provide an option for skipping the generation of the XXX_ fields, similar to how gogo/protobuf does it with protoc-gen-gogofaster. However, this might be the wrong place for such a discussion.

@dsnet
Copy link
Contributor

dsnet commented Sep 20, 2018

In the future, the XXX fields will disappear entirely (golang/protobuf#276). Adding options to make them go away at the cost of reduced protobuf functionality is just a temporarily hack.

We could add special logic that says if the struct implements proto.Message, then ignore any fields beginning XXX_. @dsnet, do you think that's a reasonable heuristic? Any plans to add extra fields to proto messages that do not have that prefix?

Well, there are discussions for an "opaque" API that has no exported fields, where everything is accessed through getter/setter methods. Also, we plan on having official support for dynamic messages, which also would not have any exported fields.

It seems to me that if you wanted to specially treat protobuf messages, you should go all the way and support them fully. With the upcoming v2 API, you can iterate over all the protobuf fields and derive your schema from that. This approach seems to be a lot more robust.

@jba
Copy link
Contributor

jba commented Sep 20, 2018

SGTM. We will wait for proto API v2, then use that to implement this feature.

@jeanbza jeanbza assigned enocom and unassigned jeanbza Oct 29, 2018
@enocom
Copy link
Member

enocom commented Oct 29, 2018

@dsnet Are there any plans for a stable release of the v2 API in the near future?

@dsnet
Copy link
Contributor

dsnet commented Oct 29, 2018

We're aiming for a stable v2 release hopefully at end of Q1 of 2019.

@enocom
Copy link
Member

enocom commented Oct 29, 2018

Thanks. We eagerly await the new API and will wait on this issue until its release.

@enocom enocom removed the priority: p2 Moderately-important priority. Fix may not be included in next release. label Oct 29, 2018
@enocom enocom removed their assignment Nov 2, 2018
@odeke-em odeke-em changed the title BigQuery infer schema from protoc generated struct bigquery: infer schema from protoc generated struct Jul 14, 2019
@jeanbza
Copy link
Contributor

jeanbza commented Oct 10, 2019

cc @shollyman

@shollyman
Copy link
Contributor

Looks like we're still not yet there on the proto v2 work, https://github.com/golang/protobuf/issues is tracking a variety of blocking issues.

@odsod
Copy link

odsod commented Apr 19, 2020

@shollyman @jadekler @jba Apologies for the broad ping, just want to check in on this issue!

Now that the v2 API is out and the XXX fields are gone, it would be absolutely superb if the schema inference understood well-known-types (timestamps most of all) but possibly also the google.type messages from https://github.com/googleapis/googleapis.

At my current shop we're using a hand-rolled protoc plugin to generate BigQuery conversion functions that correctly handle types like google.protobuf.Timestamp and google.type.Date. I believe if support for these types were added to the core BigQuery library, we could do away with this plugin altogether.

From a wider ecosystem perspective, I think supporting the google.protobuf and google.type messages would also promote usage of protobuf in general, since long-term warehousing of protobuf messages in BigQuery would become next to trivial. I can only assume this is how it already works inside Google with Dremel?

Curious to hear your thoughts here!

@odsod
Copy link

odsod commented Dec 1, 2020

Update: My team have started putting together a BigQuery encoding for protobufs here, using the v2 reflection APIs:

https://github.com/einride/protobuf-bigquery-go

It's still under development and not complete, but we're already getting some mileage out of it.

I do think that functionality like this could live in the main bigquery package (all the dependencies are already there, and I think reading/writing protobufs to BigQuery is a common use case).

For example as a new ValueSaver implementation (to avoid breaking changes to existing code) that takes protobuf semantics into account (converting google.protobuf.Timestamp to TIMESTAMP, and so forth).

Until then, we're happy to accept pull requests on the encoding above, to make it more complete and robust.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: bigquery Issues related to the BigQuery API. type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design.
Projects
None yet
Development

No branches or pull requests

7 participants