encoding/protojson: DiscardUnknown doesn't work for unknown enum strings #1208

jjhuff · 2020-09-22T17:59:45Z

What version of protobuf and what language are you using?
Version: 1.24.0 (golang)

What did you do?
We're using protojson to encode our internal protobufs for indexing in ElasticSearch. For interop with other users, we encode enums as strings. At some point, we added a new enum value, which has caused older golang consumers to fail with the following error:
invalid value for enum type: "CLASS_RESULTS_PRE_CLINICAL"

What did you expect to see?
We set DiscardUnknown in UnmarshalOptions, so I would have expected the value to be parsed as the default enum value.

What did you see instead?
invalid value for enum type: "CLASS_RESULTS_PRE_CLINICAL"

The text was updated successfully, but these errors were encountered:

dsnet · 2020-09-22T18:10:11Z

\cc @cybrcodr, do you happen to remember what the behavior for equivalent functionality in C++ is?

dsnet · 2020-09-22T20:36:19Z

I didn't check C++, but I'm going to guess that it doesn't handle this case since there's not an obvious behavior to perform in the case of repeated fields.

cybrcodr · 2020-09-22T23:02:30Z

Sorry for the late response. I just checked using libprotoc 3.13.0, and with JsonParseOptions.ignore_unknown_fields = true, C++ lib does ignore unknown enum string value.

For a singular field, it "ignores" the field, treating it as not set. For repeated fields, it "ignores" unknown string value by dropping it and hence changing the size of the repeated field.

When I added the DiscardUnknown option, I opted only for unknown fields.

For "undefined/unknown" integer enum values, C++ will treat it as an unknown field in proto2, but recognize it in proto3. Go does accept integer values that are not defined regardless of the DiscardUnknown option. I could be wrong on this, but I think treatment of integer enum values varies across different languages.

For unknown enum string values, it may seem like this may be common across language implementations though and hence I'm open to adding support for it.

Slight related, C++ also has the option for case_insensitive_enum_parsing which has the following warning:

WARNING: This option exists only to preserve legacy behavior. Avoid using this option.
If your enum needs to support different casing, consider using allow_alias instead.

While this is not the same as the original request, just want to voice out ahead that I don't think this is a feature we want to support.

dsnet · 2020-09-22T23:27:06Z

"ignores" unknown string value by dropping it and hence changing the size of the repeated field.

Wow. That is surprising behavior, but okay.

puellanivis · 2020-09-25T09:51:57Z

Yeah, I think it’s kind of hard to say what appropriate behavior should be here. Arguably, if you add an enum and a service/client gets that value and doesn’t know which integer value that corresponds to‌… I mean, setting it to the default value is probably not a good idea, unless everyone has followed good procedure and defined their default/zero-value to be ”INVALID VALUE”.

Imagine perhaps, something using googleapis/rpc/code.Code with strings, and a new code gets added, and a service returns that error value as an enum string, and then the client receives it, and goes “I don’t know what this value should be, so it’s just the zero-value” which is‌… OK.

I understand my example is contrived, but it’s just supposed to be an example of how treating the value as an unknown field, or a default value could cause semantics bugs in code.

P.S.: I think if a valid optional support were to be added, it should be explicitly to turn this behavior on, and not conflating unknown fields with unknown enum values. Users of the code should know exactly what behaviors they are enabling/disabling.

jjhuff · 2020-10-19T16:29:44Z

Decided to reread this issue since I bumped into the problem again this morning. Upon re-reading, I realized that the conversion has centered around repeated enum -- which is not actually my situation. I'm mostly looking for parity with C++ (and similar semantics as proto-encoding). It sound like there's support for that, so I'll do a PR today or tomorrow.

dsnet · 2020-10-19T18:14:06Z

@cybrcodr mailed out a change for this already. I agree that something should be changed, but it's not clear what should be changed. It's known that C++'s approach is problematic, and sometimes even considered a bug. It's not clear that we should emulate them in this exact behavior.

cybrcodr · 2020-10-19T20:08:50Z

The Language Guide (proto2 and proto3) have no mention on how to handle this for JSON. The section on JSON options simply mention about

Ignore unknown fields: Proto3 JSON parser should reject unknown fields by default but may provide an option to ignore unknown fields in parsing.

And hence we added DiscardUnknown option to ignore unknown fields, but not unknown enum values.

Btw, I just realized that the expectation on the original post here is to have the unknown value be treated as the default enum value. That is not what the C++ implementation does as I mentioned above, i.e. it does not replace the value with the default value, it really ignores the value by not using them. This may be OK for proto3 singular field since 0 is treated as the default enum value, but it is not the same in proto2, where there is no default enum value. And for repeated (or map) field, it drops the value. We have to account for consistency on how to deal with unknown enum values, and these appear in singular, repeated and map fields.

protocolbuffers/protobuf#6355 is a similar issue raised in the main protobuf project. Perhaps you may want to follow up on that issue on that project first or open another issue if you think it's unrelated.

jjhuff · 2020-10-19T20:21:04Z

Thanks @cybrcodr

Sorry about the confusion re 'ignored field' and 'default value'.

So, where are we leaving this? Try to get by-off from the main protobuf project? or go for parity with existing behavior as in #1208 (comment)

cybrcodr · 2020-10-19T20:25:48Z

Yes, getting approval from the main protobuf project is always the right approach for getting consistency across the proto specification and across different languages.

jjhuff · 2020-10-19T20:31:08Z

Ok, I commented over there. I'm a bit worried that the issue is stale (mid-2019)...but at least 2019 was a better year that 2020.

Thanks all!

noahdietz · 2021-09-16T23:07:59Z

Hi folks, I've opened a CL to fix this. My team has engaged with the protobuf team to get this worked out across languages. The CL is in WIP until we have a concrete decision. Just wanted to indicate that there is some movement on this issue.

cybrcodr · 2021-09-17T02:22:13Z

Hi @noahdietz,

Thanks for engaging the Protobuf team on this.

I'd appreciate it if your team or the Protobuf team can update the Protobuf Language Guide with regards to what the expected behavior should be across all languages for this issue.

I'm guessing the discussion was done just within the teams and not via https://groups.google.com/g/protobuf or some other channels?

noahdietz · 2021-09-17T16:28:10Z

I'd appreciate it if your team or the Protobuf team can update the Protobuf Language Guide

Definitely! I'll pass that along :)

I'm guessing the discussion was done just within the teams

Yeah, this is being communicated via direct contact.

cybrcodr · 2021-09-18T05:22:56Z

@noahdietz

I'd appreciate it if your team or the Protobuf team can update the Protobuf Language Guide

Definitely! I'll pass that along :)

Sorry, I should be more clear on this. I'd like to request that the Protobuf Language Guide be updated with regards to this because the current behavior when deserializing enum strings already vary across language implementations. Changing it based on some internal discussion won't provide for a valid reference especially if other users have different expectations.

Thanks!

vchudnov-g · 2021-09-20T19:56:23Z

Hi! I'm on the same team as Noah and I've been the one engaging with the protobuf team on this issue, motivated by the immediate use case we're running into. I totally agree that the Language Guide should be updated once a decision is reached. One of the motivations driving the protobuf team's feedback is achieving more consistency across languages, so that is good news. I would expect there will a formal approval once our current, somewhat informal, discussions settle on an approach. Also happy to forward more info internally.

cybrcodr · 2021-09-21T00:44:25Z

It would be a lot better if the discussion is being done externally, preferably on the Protobuf issue tracker. It'll at least allow for contributors and users of different implementations to provide feedback, or at the very least understand the rationale behind a decision. We do have active non-Googler contributors in the Go Protobuf open source project, and I'm sure they can provide valuable feedback. A solution for one immediate use case may present issues in other use cases. It'll also make reviewing the CL a lot easier if the reviewers have insight to the discussion.

Given the lack of external discussion, I'll go ahead and present my feedback on the proposed solution as being implemented in https://golang.org/cl/350469 here.

Even if this was only implemented with the DiscardUnknown option enabled, a JSON value with an unknown integer enum would result in a different deserialized value from a JSON value with an unknown string enum. This can be problematic.

https://developers.google.com/protocol-buffers/docs/proto3#enum

In languages that support open enum types with values outside the range of specified symbols, such as C++ and Go, the unknown enum value is simply stored as its underlying integer representation.

While the current parser can accept integer value outside the specified range, it intentionally rejects unknown string value in order to avoid producing a different enum value.

A possible solution would be to present a different option for this special treatment of enum values outside of specified range regardless of whether the value is an integer or a string. This would keep the DiscardUnknown option as discarding unknown fields only.

While one may think we should simply just implement this in the Go implementation, we have learned through experience that even optional behavior should be specified in the Protobuf spec/guide for consistency as this does present issues if not.

vchudnov-g · 2021-09-21T19:27:11Z

Those are all good points. The current discussions are motivated by internal use cases, which is why they started internally. I would hope once we have good proposal(s), we can move the discussion to a public forum (but that may not be entirely under my control). As to the cross-language consistency, that is a major point that all parties seem to be agreeing on: the intent was not to submit the Go changes until we have a cross-language consistent plan (which may not even wind up being this one, after all the discussions are done).

jhump · 2023-02-17T18:58:31Z

BTW, new conformance test cases were added specifically around unknown enum values in the JSON format: protocolbuffers/protobuf#9534

This updates all generated code to match the contents of the latest v22.0 release of Protobuf. This involved a couple of changes to the script that does the sync'ing: 1. The new Protobuf version no longer includes autoconf configuration and instead requires using bazel to build things. 2. The new Protobuf release does not have an artifact named "protobuf-all-${VERSION}.tar.gz", but the one named "protobuf-${VERSION}.tar.gz" has all of the sources and was sufficient for the regenerate.bash script to complete. This change does NOT regenerate the protos related to benchmarks. The Protobuf repo no longer includes benchmarks. The CL removing them says they are superceded by google/fleetbench: protocolbuffers/protobuf@83c499d But that project's proto benchmark files are very different: https://github.com/google/fleetbench/tree/main/fleetbench/proto So I commented out those steps in the generation code since the benchmarks will need some work to reconcile with fleetbench. This code adds known failing tests for conformance. New test cases were added in protocolbuffers/protobuf#9534, but the Go protojson package does not behave according to the new tests. There is an existing issue in GitHub about this: golang/protobuf#1208 Change-Id: Iad796ec7889bc2a74b58db5224facf850cd1a1cd Reviewed-on: https://go-review.googlesource.com/c/protobuf/+/469255 Reviewed-by: Michael Stapelberg <[email protected]> Reviewed-by: Damien Neil <[email protected]>

jhump · 2023-02-22T15:12:08Z

These three of the new conformance tests in the main protobuf repo fail with Go's protojson implementation:
protocolbuffers/protobuf-go@bc1253a#diff-cadbc1c67a15bd6fc4334382feaa7a783f938482f783b9cb017c949ad26877a2R1-R3

I suspect updating protojson so these test cases pass will be enough to resolve and close this issue.

tohyongcheng · 2023-07-11T02:24:16Z

Seems like the conformance tests are now officialised. How can we expedite @cybrcodr 's PR ?

dsnet changed the title ~~protojson DiscardUnknown doesn't work for unknown enum strings~~ encoding/protojson: DiscardUnknown doesn't work for unknown enum strings Sep 22, 2020

dsnet added this to the unplanned milestone Mar 29, 2021

gopherbot closed this as completed in protocolbuffers/protobuf-go@70db1e1 Sep 8, 2023

djgilcrease mentioned this issue Aug 13, 2024

feat: make it so you can optionally have DiscardUnknown not apply to enums protocolbuffers/protobuf-go#46

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

encoding/protojson: DiscardUnknown doesn't work for unknown enum strings #1208

encoding/protojson: DiscardUnknown doesn't work for unknown enum strings #1208

jjhuff commented Sep 22, 2020

dsnet commented Sep 22, 2020

dsnet commented Sep 22, 2020

cybrcodr commented Sep 22, 2020

dsnet commented Sep 22, 2020

puellanivis commented Sep 25, 2020 •

edited

Loading

jjhuff commented Oct 19, 2020

dsnet commented Oct 19, 2020 •

edited

Loading

cybrcodr commented Oct 19, 2020

jjhuff commented Oct 19, 2020

cybrcodr commented Oct 19, 2020

jjhuff commented Oct 19, 2020

noahdietz commented Sep 16, 2021

cybrcodr commented Sep 17, 2021 •

edited

Loading

noahdietz commented Sep 17, 2021

cybrcodr commented Sep 18, 2021

vchudnov-g commented Sep 20, 2021

cybrcodr commented Sep 21, 2021 •

edited

Loading

vchudnov-g commented Sep 21, 2021

jhump commented Feb 17, 2023

jhump commented Feb 22, 2023

tohyongcheng commented Jul 11, 2023

encoding/protojson: DiscardUnknown doesn't work for unknown enum strings #1208

encoding/protojson: DiscardUnknown doesn't work for unknown enum strings #1208

Comments

jjhuff commented Sep 22, 2020

dsnet commented Sep 22, 2020

dsnet commented Sep 22, 2020

cybrcodr commented Sep 22, 2020

dsnet commented Sep 22, 2020

puellanivis commented Sep 25, 2020 • edited Loading

jjhuff commented Oct 19, 2020

dsnet commented Oct 19, 2020 • edited Loading

cybrcodr commented Oct 19, 2020

jjhuff commented Oct 19, 2020

cybrcodr commented Oct 19, 2020

jjhuff commented Oct 19, 2020

noahdietz commented Sep 16, 2021

cybrcodr commented Sep 17, 2021 • edited Loading

noahdietz commented Sep 17, 2021

cybrcodr commented Sep 18, 2021

vchudnov-g commented Sep 20, 2021

cybrcodr commented Sep 21, 2021 • edited Loading

vchudnov-g commented Sep 21, 2021

jhump commented Feb 17, 2023

jhump commented Feb 22, 2023

tohyongcheng commented Jul 11, 2023

puellanivis commented Sep 25, 2020 •

edited

Loading

dsnet commented Oct 19, 2020 •

edited

Loading

cybrcodr commented Sep 17, 2021 •

edited

Loading

cybrcodr commented Sep 21, 2021 •

edited

Loading