Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

encoding/protojson: DiscardUnknown doesn't work for unknown enum strings #1208

Closed
jjhuff opened this issue Sep 22, 2020 · 21 comments
Closed
Milestone

Comments

@jjhuff
Copy link

jjhuff commented Sep 22, 2020

What version of protobuf and what language are you using?
Version: 1.24.0 (golang)

What did you do?
We're using protojson to encode our internal protobufs for indexing in ElasticSearch. For interop with other users, we encode enums as strings. At some point, we added a new enum value, which has caused older golang consumers to fail with the following error:
invalid value for enum type: "CLASS_RESULTS_PRE_CLINICAL"

What did you expect to see?
We set DiscardUnknown in UnmarshalOptions, so I would have expected the value to be parsed as the default enum value.

What did you see instead?
invalid value for enum type: "CLASS_RESULTS_PRE_CLINICAL"

@dsnet
Copy link
Member

dsnet commented Sep 22, 2020

\cc @cybrcodr, do you happen to remember what the behavior for equivalent functionality in C++ is?

@dsnet dsnet changed the title protojson DiscardUnknown doesn't work for unknown enum strings encoding/protojson: DiscardUnknown doesn't work for unknown enum strings Sep 22, 2020
@dsnet
Copy link
Member

dsnet commented Sep 22, 2020

I didn't check C++, but I'm going to guess that it doesn't handle this case since there's not an obvious behavior to perform in the case of repeated fields.

@cybrcodr
Copy link
Contributor

Sorry for the late response. I just checked using libprotoc 3.13.0, and with JsonParseOptions.ignore_unknown_fields = true, C++ lib does ignore unknown enum string value.

For a singular field, it "ignores" the field, treating it as not set. For repeated fields, it "ignores" unknown string value by dropping it and hence changing the size of the repeated field.

When I added the DiscardUnknown option, I opted only for unknown fields.

For "undefined/unknown" integer enum values, C++ will treat it as an unknown field in proto2, but recognize it in proto3. Go does accept integer values that are not defined regardless of the DiscardUnknown option. I could be wrong on this, but I think treatment of integer enum values varies across different languages.

For unknown enum string values, it may seem like this may be common across language implementations though and hence I'm open to adding support for it.

Slight related, C++ also has the option for case_insensitive_enum_parsing which has the following warning:

WARNING: This option exists only to preserve legacy behavior. Avoid using this option.
If your enum needs to support different casing, consider using allow_alias instead.

While this is not the same as the original request, just want to voice out ahead that I don't think this is a feature we want to support.

@dsnet
Copy link
Member

dsnet commented Sep 22, 2020

"ignores" unknown string value by dropping it and hence changing the size of the repeated field.

Wow. That is surprising behavior, but okay.

@puellanivis
Copy link
Collaborator

puellanivis commented Sep 25, 2020

Yeah, I think it’s kind of hard to say what appropriate behavior should be here. Arguably, if you add an enum and a service/client gets that value and doesn’t know which integer value that corresponds to‌… I mean, setting it to the default value is probably not a good idea, unless everyone has followed good procedure and defined their default/zero-value to be ”INVALID VALUE”.

Imagine perhaps, something using googleapis/rpc/code.Code with strings, and a new code gets added, and a service returns that error value as an enum string, and then the client receives it, and goes “I don’t know what this value should be, so it’s just the zero-value” which is‌… OK.

I understand my example is contrived, but it’s just supposed to be an example of how treating the value as an unknown field, or a default value could cause semantics bugs in code.

P.S.: I think if a valid optional support were to be added, it should be explicitly to turn this behavior on, and not conflating unknown fields with unknown enum values. Users of the code should know exactly what behaviors they are enabling/disabling.

@jjhuff
Copy link
Author

jjhuff commented Oct 19, 2020

Decided to reread this issue since I bumped into the problem again this morning. Upon re-reading, I realized that the conversion has centered around repeated enum -- which is not actually my situation. I'm mostly looking for parity with C++ (and similar semantics as proto-encoding). It sound like there's support for that, so I'll do a PR today or tomorrow.

@dsnet
Copy link
Member

dsnet commented Oct 19, 2020

@cybrcodr mailed out a change for this already. I agree that something should be changed, but it's not clear what should be changed. It's known that C++'s approach is problematic, and sometimes even considered a bug. It's not clear that we should emulate them in this exact behavior.

@cybrcodr
Copy link
Contributor

The Language Guide (proto2 and proto3) have no mention on how to handle this for JSON. The section on JSON options simply mention about

Ignore unknown fields: Proto3 JSON parser should reject unknown fields by default but may provide an option to ignore unknown fields in parsing.

And hence we added DiscardUnknown option to ignore unknown fields, but not unknown enum values.

Btw, I just realized that the expectation on the original post here is to have the unknown value be treated as the default enum value. That is not what the C++ implementation does as I mentioned above, i.e. it does not replace the value with the default value, it really ignores the value by not using them. This may be OK for proto3 singular field since 0 is treated as the default enum value, but it is not the same in proto2, where there is no default enum value. And for repeated (or map) field, it drops the value. We have to account for consistency on how to deal with unknown enum values, and these appear in singular, repeated and map fields.

protocolbuffers/protobuf#6355 is a similar issue raised in the main protobuf project. Perhaps you may want to follow up on that issue on that project first or open another issue if you think it's unrelated.

@jjhuff
Copy link
Author

jjhuff commented Oct 19, 2020

Thanks @cybrcodr

Sorry about the confusion re 'ignored field' and 'default value'.

So, where are we leaving this? Try to get by-off from the main protobuf project? or go for parity with existing behavior as in #1208 (comment)

@cybrcodr
Copy link
Contributor

Yes, getting approval from the main protobuf project is always the right approach for getting consistency across the proto specification and across different languages.

@jjhuff
Copy link
Author

jjhuff commented Oct 19, 2020

Ok, I commented over there. I'm a bit worried that the issue is stale (mid-2019)...but at least 2019 was a better year that 2020.

Thanks all!

@dsnet dsnet added this to the unplanned milestone Mar 29, 2021
@noahdietz
Copy link

Hi folks, I've opened a CL to fix this. My team has engaged with the protobuf team to get this worked out across languages. The CL is in WIP until we have a concrete decision. Just wanted to indicate that there is some movement on this issue.

@cybrcodr
Copy link
Contributor

cybrcodr commented Sep 17, 2021

Hi @noahdietz,

Thanks for engaging the Protobuf team on this.

I'd appreciate it if your team or the Protobuf team can update the Protobuf Language Guide with regards to what the expected behavior should be across all languages for this issue.

I'm guessing the discussion was done just within the teams and not via https://groups.google.com/g/protobuf or some other channels?

@noahdietz
Copy link

I'd appreciate it if your team or the Protobuf team can update the Protobuf Language Guide

Definitely! I'll pass that along :)

I'm guessing the discussion was done just within the teams

Yeah, this is being communicated via direct contact.

@cybrcodr
Copy link
Contributor

@noahdietz

I'd appreciate it if your team or the Protobuf team can update the Protobuf Language Guide

Definitely! I'll pass that along :)

Sorry, I should be more clear on this. I'd like to request that the Protobuf Language Guide be updated with regards to this because the current behavior when deserializing enum strings already vary across language implementations. Changing it based on some internal discussion won't provide for a valid reference especially if other users have different expectations.

Thanks!

@vchudnov-g
Copy link

Hi! I'm on the same team as Noah and I've been the one engaging with the protobuf team on this issue, motivated by the immediate use case we're running into. I totally agree that the Language Guide should be updated once a decision is reached. One of the motivations driving the protobuf team's feedback is achieving more consistency across languages, so that is good news. I would expect there will a formal approval once our current, somewhat informal, discussions settle on an approach. Also happy to forward more info internally.

@cybrcodr
Copy link
Contributor

cybrcodr commented Sep 21, 2021

It would be a lot better if the discussion is being done externally, preferably on the Protobuf issue tracker. It'll at least allow for contributors and users of different implementations to provide feedback, or at the very least understand the rationale behind a decision. We do have active non-Googler contributors in the Go Protobuf open source project, and I'm sure they can provide valuable feedback. A solution for one immediate use case may present issues in other use cases. It'll also make reviewing the CL a lot easier if the reviewers have insight to the discussion.

Given the lack of external discussion, I'll go ahead and present my feedback on the proposed solution as being implemented in https://golang.org/cl/350469 here.

Even if this was only implemented with the DiscardUnknown option enabled, a JSON value with an unknown integer enum would result in a different deserialized value from a JSON value with an unknown string enum. This can be problematic.

https://developers.google.com/protocol-buffers/docs/proto3#enum

In languages that support open enum types with values outside the range of specified symbols, such as C++ and Go, the unknown enum value is simply stored as its underlying integer representation.

While the current parser can accept integer value outside the specified range, it intentionally rejects unknown string value in order to avoid producing a different enum value.

A possible solution would be to present a different option for this special treatment of enum values outside of specified range regardless of whether the value is an integer or a string. This would keep the DiscardUnknown option as discarding unknown fields only.

While one may think we should simply just implement this in the Go implementation, we have learned through experience that even optional behavior should be specified in the Protobuf spec/guide for consistency as this does present issues if not.

@vchudnov-g
Copy link

Those are all good points. The current discussions are motivated by internal use cases, which is why they started internally. I would hope once we have good proposal(s), we can move the discussion to a public forum (but that may not be entirely under my control). As to the cross-language consistency, that is a major point that all parties seem to be agreeing on: the intent was not to submit the Go changes until we have a cross-language consistent plan (which may not even wind up being this one, after all the discussions are done).

@jhump
Copy link
Contributor

jhump commented Feb 17, 2023

BTW, new conformance test cases were added specifically around unknown enum values in the JSON format: protocolbuffers/protobuf#9534

gopherbot pushed a commit to protocolbuffers/protobuf-go that referenced this issue Feb 22, 2023
This updates all generated code to match the contents of the latest
v22.0 release of Protobuf.

This involved a couple of changes to the script that does the sync'ing:
  1. The new Protobuf version no longer includes autoconf configuration
     and instead requires using bazel to build things.
  2. The new Protobuf release does not have an artifact named
     "protobuf-all-${VERSION}.tar.gz", but the one named
     "protobuf-${VERSION}.tar.gz" has all of the sources and was
     sufficient for the regenerate.bash script to complete.

This change does NOT regenerate the protos related to benchmarks.
The Protobuf repo no longer includes benchmarks. The CL removing them
says they are superceded by google/fleetbench:
    protocolbuffers/protobuf@83c499d
But that project's proto benchmark files are very different:
    https://github.com/google/fleetbench/tree/main/fleetbench/proto
So I commented out those steps in the generation code since the benchmarks
will need some work to reconcile with fleetbench.

This code adds known failing tests for conformance. New test cases were
added in protocolbuffers/protobuf#9534, but the
Go protojson package does not behave according to the new tests. There
is an existing issue in GitHub about this:
    golang/protobuf#1208

Change-Id: Iad796ec7889bc2a74b58db5224facf850cd1a1cd
Reviewed-on: https://go-review.googlesource.com/c/protobuf/+/469255
Reviewed-by: Michael Stapelberg <[email protected]>
Reviewed-by: Damien Neil <[email protected]>
@jhump
Copy link
Contributor

jhump commented Feb 22, 2023

These three of the new conformance tests in the main protobuf repo fail with Go's protojson implementation:
protocolbuffers/protobuf-go@bc1253a#diff-cadbc1c67a15bd6fc4334382feaa7a783f938482f783b9cb017c949ad26877a2R1-R3

I suspect updating protojson so these test cases pass will be enough to resolve and close this issue.

@tohyongcheng
Copy link

Seems like the conformance tests are now officialised. How can we expedite @cybrcodr 's PR ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants