Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BOLT 04: Add failure code for invalid payload. #627

Merged
merged 2 commits into from
Sep 3, 2019

Conversation

t-bast
Copy link
Collaborator

@t-bast t-bast commented Jul 8, 2019

The specification currently doesn't specify the case where the onion per-hop
payload can't be correctly decoded.

This is somewhat fine with the fixed frames because every field of the payload
can always be interpreted as a numeric value from the input bytes, so it leads
to application errors in upper layers when those values are actually
interpreted (and we realize that for instance we have an invalid
short_channel_id value).

With variable-length tlv streams in the onion payloads, we will encounter
decoding errors (duplicate tlv types, invalid ordering, etc) and the spec
should define the failure code to use in that case.

@t-bast t-bast requested a review from cdecker July 8, 2019 09:51
@@ -724,6 +724,12 @@ General permanent failure of the processing node.

The processing node has a required feature which was not in this onion.

1. type: BADONION|PERM (`invalid_onion_payload`)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I treat this as the default BADONION error, but maybe it would make more sense to assign it a specific bit (e.g. BADONION|PERM|1 or BADONION|PERM|7).
Let me know what you think would be best.

@t-bast t-bast changed the title Bolt04: Add failure code for invalid payload. BOLT 04: Add failure code for invalid payload. Jul 8, 2019
@t-bast t-bast force-pushed the b4-malformed-onion-failure-message branch from 490bfd2 to 9f8d44e Compare July 9, 2019 07:34
@t-bast t-bast added the Meeting Discussion Raise at next meeting label Jul 12, 2019
@t-bast
Copy link
Collaborator Author

t-bast commented Jul 12, 2019

I'm curious to get your feedback on this @cdecker @Roasbeef.
It feels to me that we really need to define this failure code otherwise our implementations will use different errors for that scenario.

@rustyrussell
Copy link
Collaborator

Hmm, we used to have a scheme where the offset of the error was in the error message, but I can't find it so maybe it was in my imagination! Would be nice to have, to offer a debugging hint as to what field you disliked.

This shouldn't be BADONION I think; that's for when we literally can't make a reply due to parsing errors. This is a normal error we can generate ourselves, so that bit probably shouldn't be used.

@niftynei niftynei removed the Meeting Discussion Raise at next meeting label Jul 22, 2019
@t-bast
Copy link
Collaborator Author

t-bast commented Jul 23, 2019

This shouldn't be BADONION I think

Got it, it makes sense to reserve BADONION for cryptographic failures and general onion support (onion version check).

Then what about using PERM | 22 (invalid_onion_payload)? I don't think it really makes sense to include the onion's sha256 since it wasn't a BADONION error.

Should I include some data in the error (should we define an enum for TLV error codes, containing things like UnknownEvenType, DuplicateType, InvalidOrdering, etc)?

We should note as well that another reason why this could fail is that onion payload might be missing some required information (amount, expiry or channel_id). We can't really provide a byte offset of the error in that case so I don't know if it's worth adding.

@t-bast
Copy link
Collaborator Author

t-bast commented Jul 31, 2019

Gentle ping @cfromknecht @rustyrussell, I'm curious to know what error code your implementation uses in that case? It would be nice to align before merging variable-onion to implementations' master branches.

@t-bast t-bast added the Meeting Discussion Raise at next meeting label Aug 5, 2019
@joostjager
Copy link
Collaborator

@t-bast is it right that this new message would be just for debugging? The downside of another message is that it may increase the node fingerprinting surface. Especially if the contents of it is not very strict. It may make it easier to probe the network and find out what implementations and (potentially vulnerable) versions are running.

@t-bast
Copy link
Collaborator Author

t-bast commented Aug 5, 2019

I thought it could be useful for debugging (especially at the beginning since we're introducing both tlv and variable-length onion to the network).
However, if you think this doesn't provide enough benefits, I can retract this and simply send a PERM error without any information (which is good to avoid finger-printing).

The specification currently doesn't specify the case where the onion per-hop
payload can't be correctly decoded.

This is somewhat fine with the fixed frames because every field of the payload
can always be interpreted as a numeric value from the input bytes, so it leads
to application errors in upper layers when those values are actually
interpreted (and we realize that for instance we have an invalid
short_channel_id` value).

With variable-length tlv streams in the onion payloads, we will encounter
decoding errors (duplicate tlv types, invalid ordering, etc) and the spec
should define the failure code to use in that case.
@t-bast t-bast force-pushed the b4-malformed-onion-failure-message branch from e8498b1 to 60c88fa Compare August 22, 2019 09:21
@joostjager
Copy link
Collaborator

One comment along a different axis:

Do we want to move to tlv for the failure message? If yes, we could now define the last failure code we ever need. Its value would be a tlv stream. Any new failure codes like the one proposed in this pr can then be defined as tlv types in that stream.

This also opens up the possibility to send back types from the custom range. It could be an option to use those types for (temporary) debugging if needed.

It addition to that, it could carry (custom) application-specific failures, like for example "unknown deposit address" (to be used in tandem with a custom record in UpdateAddHtlc).

@t-bast
Copy link
Collaborator Author

t-bast commented Aug 22, 2019

Do we want to move to tlv for the failure message?

Good feedback, I considered this too.
This is a bigger change though and can be done in many ways (and I expect a lot of debate about it) which is why I chose to do a smaller and simpler change to address a short-term whole in the spec.

However, I could use a TLV stream in the data part of invalid_onion_payload, that could make sense and pave the way for the future.

@joostjager
Copy link
Collaborator

If you then embed into the stream a 0-length type that signals invalid_onion_payload, we only have to rename it at a later stage to fail_generic :)

@t-bast
Copy link
Collaborator Author

t-bast commented Sep 3, 2019

Merging as decided in the IRC meeting: http://www.erisian.com.au/meetbot/lightning-dev/2019/lightning-dev.2019-09-02-20.06.log.html
Exploring the use of TLV in onion failure messages is deferred to a later PR.

@t-bast t-bast merged commit db92932 into master Sep 3, 2019
@t-bast t-bast deleted the b4-malformed-onion-failure-message branch September 3, 2019 06:54

The decrypted onion per-hop payload was not understood by the processing node
or is incomplete. If the failure can be narrowed down to a specific tlv type in
the payload, the erring node may include that `type` and its byte `offset` in
Copy link
Collaborator

@cfromknecht cfromknecht Sep 4, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i missed the last spec meeting, so forgive me if i'm missing something, but atm this seems underspecified.

if we couldn't parse a record's type, what should type be? all possible values of varint correspond to valid types, so how can we distinguish this case?

also, in most cases i would think that offset is redundant information, as if the sender knows the invalid type they can inspect the payload they sent to figure out where the error is.

the main thing tho, is that it doesn't seem clear from this tho how the sender should choose between interpreting the offset or the type.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the failure can be narrowed down to a specific tlv type

So if you can't parse a record's type, just leave the default value here (0).
That's where the offset might be useful: you can indicate the starting bit where your parser failed to even extract a valid type.

I don't think sender should dynamically interpret this and take an action based on this: this error code indicates a bug in one of the nodes in the route. This is only meant to be logged for further investigation.

Does that make sense?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just my two cents since I wasn't in attendance.

There are five classes of failures I can think of:

  1. stream not canonical
  2. unknown required type
  3. unable to decode field (wrong encoding by sender or receiver)
  4. field was omitted or included improperly
  5. unable to parse type

IMO 1 should never happen, we have unit tests and test vectors to ensure this, plus it would be easy to decode your own payload and verify that it's not canonical. Either way, we only need the type to figure out where we went wrong.

For 2, 3, and 4 we already know the type, and that should be enough for the sender to narrow down where the error occurred.

If you squint tho, 5 is really just a variant of 3. The reason a type would fail to parse is likely due to the preceding record being read short/long. The only reason it wouldn't is if our varint encoding is messed up, which also shouldn't happen since we have comprehensive test vectors for BigSize as well.

Is it not sufficient to send the previously parsed type (or 0 if it occurs on the first type)? The sender can look for the next type in the stream >= to that value that is where the failure occurred. This works even if the type being parsed is 0 or 1.

I don't see much value being added by the index, given that we can identify the record in all of these cases just from the type and the payload that the sender included in the onion. If we are sorely missing it, we can always extend the message later and include it.

That being said, I think we could better specify as in the above which types to send back under which failure conditions. Is that something people are interested in? It'd be nice to have some sort of agreement as what the receiver should place in event of each failure, otherwise everyone needs to be aware of the nuisances of how other implementations pass back these debug values. Just adding them for the sake of logging isn't compelling as they're likely to get lost, however we could have actionable responses if we have some standardization, e.g. try to locate a newer node announcement or explain to the user that the node might not support a certain feature.

On a higher level, I am somewhat concerned in general about including the type at all, as it will allow people to fingerprint the set of required types a node supports assuming they themselves are properly encoding their payloads. An empty failure message would be preferable imo until the privacy implications have been considered more thoroughly.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The consensus was honestly to not over-engineer this since the contents (type and offset are only optionally included in the error). This error shouldn't come up at all between valid implementations.
I agree that if your node supports required types that aren't in the spec that may allow probing that your node runs a specific version of a specific implementation. But feature bits likely let you do that probing as well so you're not learning anything new!

@@ -875,6 +875,16 @@ The channel from the processing node has been disabled.

The CLTV expiry in the HTLC is too far in the future.

1. type: PERM|22 (`invalid_onion_payload`)
2. data:
* [`varint`:`type`]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is the first place in the wire protocol where we use this BigSize, is that intentional/what we want?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hum yeah that's intentional since this is supposed to be the tlv type, can you detail why that wouldn't be what we want?


The decrypted onion per-hop payload was not understood by the processing node
or is incomplete. If the failure can be narrowed down to a specific tlv type in
the payload, the erring node may include that `type` and its byte `offset` in
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just my two cents since I wasn't in attendance.

There are five classes of failures I can think of:

  1. stream not canonical
  2. unknown required type
  3. unable to decode field (wrong encoding by sender or receiver)
  4. field was omitted or included improperly
  5. unable to parse type

IMO 1 should never happen, we have unit tests and test vectors to ensure this, plus it would be easy to decode your own payload and verify that it's not canonical. Either way, we only need the type to figure out where we went wrong.

For 2, 3, and 4 we already know the type, and that should be enough for the sender to narrow down where the error occurred.

If you squint tho, 5 is really just a variant of 3. The reason a type would fail to parse is likely due to the preceding record being read short/long. The only reason it wouldn't is if our varint encoding is messed up, which also shouldn't happen since we have comprehensive test vectors for BigSize as well.

Is it not sufficient to send the previously parsed type (or 0 if it occurs on the first type)? The sender can look for the next type in the stream >= to that value that is where the failure occurred. This works even if the type being parsed is 0 or 1.

I don't see much value being added by the index, given that we can identify the record in all of these cases just from the type and the payload that the sender included in the onion. If we are sorely missing it, we can always extend the message later and include it.

That being said, I think we could better specify as in the above which types to send back under which failure conditions. Is that something people are interested in? It'd be nice to have some sort of agreement as what the receiver should place in event of each failure, otherwise everyone needs to be aware of the nuisances of how other implementations pass back these debug values. Just adding them for the sake of logging isn't compelling as they're likely to get lost, however we could have actionable responses if we have some standardization, e.g. try to locate a newer node announcement or explain to the user that the node might not support a certain feature.

On a higher level, I am somewhat concerned in general about including the type at all, as it will allow people to fingerprint the set of required types a node supports assuming they themselves are properly encoding their payloads. An empty failure message would be preferable imo until the privacy implications have been considered more thoroughly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Meeting Discussion Raise at next meeting
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants