Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Propose ADR 027: Deterministic Protobuf Serialization #6979

Merged
merged 14 commits into from
Aug 19, 2020

Conversation

webmaster128
Copy link
Member

@webmaster128 webmaster128 commented Aug 7, 2020

Description

During the development of ARD-020 it became clear that we need a deterministic encoding of a SignDoc. The canonicalization from https://github.com/regen-network/canonical-proto3 was explicitely dropped in the process for various rasons. This is an attempt to get the necessary specification from those canonicaliztation rules while removing all decoding or JSON aspects. Motivation, implementation and a test vector were added.

I want to use this as a discussion base, which is probably more productive than long Github threads.
Also feel free to propose a better name it you don't like it.

If this spec is accepted, the change from #6949 is not necessary anymore. However, it would still be possible to do it and increase the chance that developers get something working quicker. So this is not an attempt to close PR #6949. As Aaron mentioned, it would be nice to get more people's view on the 0/1 question.

Kudos to @aaronc who brough big chunks of this document to the table.


Before we can merge this PR, please make sure that all the following items have been
checked off. If any of the checklist items are not applicable, please leave them but
write a little note why.

  • Targeted PR against correct branch (see CONTRIBUTING.md)
  • Linked to Github issue with discussion and accepted design OR link to spec that describes this work.
  • Code follows the module structure standards.
  • Wrote unit and integration tests
  • Updated relevant documentation (docs/) or specification (x/<module>/spec/)
  • Added relevant godoc comments.
  • Added a relevant changelog entry to the Unreleased section in CHANGELOG.md
  • Re-reviewed Files changed in the Github PR explorer
  • Review Codecov Report in the comment section below once CI passes

@webmaster128
Copy link
Member Author

ping @ethanfrey

Copy link
Member

@aaronc aaronc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this @webmaster128 !

This is a really good write up.

I do want to include an explanation for why omitting default values is logical. It is not at all arbitrary and I suggested some text to explain.

Can we please find some other name besides Regencode ? It's interesting but also a little awkward 🤪 . Maybe canonical proto3 or proto3 CER?

docs/architecture/adr-026-protobuf-regencode.md Outdated Show resolved Hide resolved
Comment on lines 56 to 59
Omitting empty fields is a valid option because the parser must assign the
default value to fields missing in the serialization<sup>2</sup>. Requiring to
serialize them would be equally valid. No good reason is known to the author for
preferring one of those options over the other.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Omitting empty fields is a valid option because the parser must assign the
default value to fields missing in the serialization<sup>2</sup>. Requiring to
serialize them would be equally valid. No good reason is known to the author for
preferring one of those options over the other.
Omitting empty fields is the logical choice for a canonical protobuf serialization because it allows for some amount of forward compatibility. That means that users of newer versions of a protobuf schema will produce the same serialization as users of older versions if they do not set newer fields. Effectively omitting newer fields causes the newer clients to fall back to an older version of the protocol. This would not be possible if empty fields were serialized by default.

docs/architecture/adr-026-protobuf-regencode.md Outdated Show resolved Hide resolved
docs/architecture/adr-026-protobuf-regencode.md Outdated Show resolved Hide resolved
docs/architecture/adr-026-protobuf-regencode.md Outdated Show resolved Hide resolved
docs/architecture/adr-026-protobuf-regencode.md Outdated Show resolved Hide resolved
@webmaster128
Copy link
Member Author

webmaster128 commented Aug 7, 2020

Thank you @aaronc!

I do want to include an explanation for why omitting default values is logical. It is not at all arbitrary and I suggested some text to explain.

Thanks for the suggested explanation. I agree that you have a point there, which was not explained well any anything I read before. I don't fully agree to consider this path more "logical" or natural than the other one. The feature it brings was never a requirement. But I agree this is a plus. Will merge our texts accordingly.

Can we please find some other name besides Regencode ? It's interesting but also a little awkward 🤪 . Maybe canonical proto3 or proto3 CER?

haha, yes it is. See it as a way to express attribution. But let's find something better. I'll think about it …

@webmaster128
Copy link
Member Author

Maybe canonical proto3 or proto3 CER?

I would like to have something a little bit context-specific. We cannot encode all proto3 documents (and don't need to). Also the requirements are very specific to our space.

@webmaster128
Copy link
Member Author

Let's just forget about a name and call it ADR26, just like we have bip39 and slip0010.

@webmaster128 webmaster128 changed the title Propose ADR 026: Protocol Buffer Regencode Propose ADR 026: Protocol Buffer Deterministic Encoding Aug 7, 2020
@webmaster128 webmaster128 changed the title Propose ADR 026: Protocol Buffer Deterministic Encoding Propose ADR 026: Protocol Buffer Deterministic Serialization Aug 7, 2020
@webmaster128 webmaster128 changed the title Propose ADR 026: Protocol Buffer Deterministic Serialization Propose ADR 026: Deterministic Protobuf Serialization Aug 7, 2020
@webmaster128
Copy link
Member Author

@aaronc the last two commits address the reasoning behind omitting empty values and the name

Copy link
Member

@aaronc aaronc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ACK

Thanks @webmaster128 🎉 !

Copy link
Member

@aaronc aaronc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually one more small issue. There's an existing PR for ADR 026 (#6922). Let's make this 027.

Copy link
Contributor

@ethanfrey ethanfrey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like it. Explicit with test vector and clear reasoning.

It codifies the same behavior we have in Go currently, so no sdk code changed needed.

One question, it only states that the SignDoc must be encoded via ADR026, but it is left unclear of the rest of the tx. I believe this MAY be ADR026, but also MAY be any other valid proto3 encoding. Is that correct @aaronc ?

@aaronc
Copy link
Member

aaronc commented Aug 7, 2020

One question, it only states that the SignDoc must be encoded via ADR026, but it is left unclear of the rest of the tx. I believe this MAY be ADR026, but also MAY be any other valid proto3 encoding. Is that correct @aaronc ?

Exactly. We have made efforts to make the surface area where determinism needed as small as possible. Everywhere else, the only requirement is a valid proto3 encoding. That has the side effect of reduced malleability.

I'll also note that unknown fields pretty much anywhere will cause the tx to get rejected. Unknown "non-critical" fields are allowed on TxBody only with SIGN_MODE_DIRECT. So if you have a weird proto impl that adds random fields would be problematic... But yeah in general any proto impl should work everywhere else.

@webmaster128
Copy link
Member Author

Thanks for pointing that out @aaronc. ADR 027 it is.

Thank you for the review, @ethanfrey. Typo fixed.

One question, it only states that the SignDoc must be encoded via ADR026, but it is left unclear of the rest of the tx. I believe this MAY be ADR026, but also MAY be any other valid proto3 encoding.

Right, all documents lower than SignDoc are embedded as raw bytes. Any valid protobuf serialization is fine for those.

@webmaster128 webmaster128 changed the title Propose ADR 026: Deterministic Protobuf Serialization Propose ADR 027: Deterministic Protobuf Serialization Aug 7, 2020
@aaronc aaronc added the T: ADR An issue or PR relating to an architectural decision record label Aug 7, 2020
Copy link
Contributor

@amaury1093 amaury1093 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good.

I'm reading this https://gist.github.com/kchristidis/39c8b310fd9da43d515c4394c3cd9510, and I found 2 additions to would make this ADR more exhaustive.

  1. For packed repeated, we should only allow one "pack" (i.e. 1 length prefix, followed by all the elements in the array).
  2. it goes without saying, but it will go even better if we say it: we don't allow encoding the same field twice. (iiuc, proto3 does allow that).

Without these 2 rules, I can produce 2 different serializations of the same message that both adhere to all rules described in this ADR.

(edit: there might be other edge cases which produce undeterministic serializations, but I guess we can amend this ADR as we discover them)

@aaronc aaronc mentioned this pull request Aug 11, 2020
27 tasks
@aaronc
Copy link
Member

aaronc commented Aug 17, 2020

Looks good.

I'm reading this https://gist.github.com/kchristidis/39c8b310fd9da43d515c4394c3cd9510, and I found 2 additions to would make this ADR more exhaustive.

  1. For packed repeated, we should only allow one "pack" (i.e. 1 length prefix, followed by all the elements in the array).
  2. it goes without saying, but it will go even better if we say it: we don't allow encoding the same field twice. (iiuc, proto3 does allow that).

Without these 2 rules, I can produce 2 different serializations of the same message that both adhere to all rules described in this ADR.

(edit: there might be other edge cases which produce undeterministic serializations, but I guess we can amend this ADR as we discover them)

Is this still pending? Would be good to wrap this up and get it merged.

Can any other maintainers take a quick look at this @alexanderbez @alessio @jgimeno ?

@webmaster128
Copy link
Member Author

Is this still pending? Would be good to wrap this up and get it merged.

Yes it is. I did not yet get a chance to look into this, especially sice I don't know what a packed repeated is (yet). I'll try to get it done this week.

Thanks @amaurymartiny by the way for the valuable input.

@ethanfrey
Copy link
Contributor

ethanfrey commented Aug 18, 2020

Packed repeated is a compressed form of handling repeated fields when the data is scalar (int, float) not structs

In proto3, repeated fields of scalar numeric types use packed encoding by default.: https://developers.google.com/protocol-buffers/docs/proto3#specifying_field_rules

I think we can just refer to the default behavior in the spec if we need to mention it

@ethanfrey
Copy link
Contributor

it goes without saying, but it will go even better if we say it: we don't allow encoding the same field twice. (iiuc, proto3 does allow that).

I thought that was mentioned in the spec, but if not, definitely should be added

@aaronc
Copy link
Member

aaronc commented Aug 18, 2020

@amaurymartiny could you suggest some changes for @webmaster128 to address the packed repeated issue? Once we have that taken care of, I think we can move this forward.

@webmaster128
Copy link
Member Author

Okay, updated. There are 3 new commits:

  1. The additional rules suggested by @amaurymartiny: f389d31
  2. An additional rule regarding variant encoding: 0ad5600
  3. An emoji in the test vector ensuring UTF-8 encoding: 7299e7f

Copy link
Member

@aaronc aaronc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ACK

@aaronc aaronc added the A:automerge Automatically merge PR once all prerequisites pass. label Aug 19, 2020
Copy link
Contributor

@amaury1093 amaury1093 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ACK, this looks good.

@mergify mergify bot merged commit 1b9f144 into cosmos:master Aug 19, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A:automerge Automatically merge PR once all prerequisites pass. T: ADR An issue or PR relating to an architectural decision record
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants