Slash approval voters on approving invalid blocks - dynamically #635

eskimor · 2023-05-15T12:45:34Z

Currently we punt on slashing approval voters approving invalid blocks. This is sound from a gambler's ruin perspective, as the backers are the ones to be supposed to having skin in the game. We will always slash at least those backers.

While the security properties are maintained, there is a slight problem: Without slashing approval voters, there is no incentive left to actually do the validation. Approving without having checked is risk free, for the individual. Of course not for the network: Blindly approving puts the whole chain at risk, given that validators are heavily invested, they should have an interest in maintaining the network's security. Nevertheless there is a tragedy of the commons situation: "If it is just me not validating - I'll save a few resources and as long as everybody else is checking, everything is fine."

Right now this should be a very low risk as running modified source code is neither effort less nor risk free and also for now resources are mostly spent elsewhere. The actual validation is relatively cheap, nevertheless longer term this situation should be improved.

There are at least two possibilities:

Somehow let the validators prove that they did the validation.
Make approving no longer risk free - bring back the slash.

While 1 would be a more solid solution: It would just not be possible to not doing the validation without getting slashed/losing out on rewards. On the flip-side it is not even clear how this should work, there are some ideas how this could work, but any proposed solution is far from trivial and would add a lot of complexity.

Which leaves us with option 2. So, why did we remove that slash in the first place? The reason was an attack vector: If someone hacked the backing validators/was willing to sacrifice them and would not care about getting them slashed, he could craft a candidate timing out on two thirds of the validators: Thus the candidate would be deemed invalid and the attacker got up to a third of the validators slashed 100%, by only controlling as little as two backers. It is worth mentioning, that this attack is not resolved by time disputes as time disputes only resolve disputes concluding for the candidate and charging backers is also no threat if they are willing to get slashed.

So why is this a problem? This would clearly be an unjustified slash, governance could refund them. True, but slashing also means:

Validators will get disabled.
Validators will be removed from the active set.

Both of these would gain an attacker an advantage: Knocking out up to a third of all validators. With byzantine assumptions the attacker already controls up to a third. So with this attack, he would control half the validators!

Proposed Solution

First, we have to understand that this attack is an amplification attack: By only controlling two validators, the attacker can knock out up to a third. This attack only makes any sense if the attacker is able to knock out more honest nodes than he is sacrificing malicious nodes!

But this can easily be avoided: If we changed the slashing strategy such that not each approval voter gets slashed 100% of their stash, but instead that 100% gets distributed among all approving approval checkers:

One approving approval checker: Gets slashed 100%.
Two approving approval checkers: Each gets slashed 50%.
10 approving approval checkers: Each gets slashed 10%.
...

Or simply: Individual slash per approval voter: 100%/k ... where k is the number of approving checkers.

What does this gain us? A lot:

We have a strong incentive to actually do the validation work, because approving blindly is still risking a 100% slash! We are as good as in the initial design!
We require two backers for a candidate getting backed. This means on each attack the attacker is losing 2 of his nodes, while only taking down 1 honest node on average: The attack no longer makes any sense!

Prerequisite: Good disabling strategy, which only fully disables nodes that got slashed 100%.

Tasks

Give feedback

Implement
Document reasoning in guide/code relevant reasoning also here.
Options

burdges · 2023-05-15T14:04:40Z

Yes, approval checkers receiving some lesser slash always made sense, but yeah %/k ensures this stays below the backer slash, and solves the issue. I think 50%/k or 10%/k work fine too, while also being robust if we ever lower the backer count to one.

eskimor · 2023-05-15T15:05:17Z

Yes, approval checkers receiving some lesser slash always made sense, but yeah %/k ensures this stays below the backer slash, and solves the issue. I think 50%/k or 10%/k work fine too, while also being robust if we ever lower the backer count to one.

We should be able to make this a function of the number of required backers.

burdges · 2023-05-15T19:24:55Z

Why bother if 50% / k works, no?

Sophia-Gold · 2023-05-16T23:06:22Z

I like this! 50% is probably marginally better than 100% if we also want to make it so this doesn't result in disabling.

eskimor · 2023-05-26T14:57:34Z

Why bother if 50% / k works, no?

Yeah. I just like to maximize the slash to whatever is possible in a security preserving way. Going with simply 50% is also a good option, especially when going with 100% would complicate things a lot - calculating from the number of required backers seems pretty easy though.

burdges · 2023-07-18T17:44:53Z

As an aside, we occasionally discuss not signing approval votes, which directly conflicts here. We'd need almost a complete graph before that works, which makes not signing hard anyways. If we've enough connectivity, then could invalid approval votes be punished? In disputes, we'd need votes about who voted approve, which adds some bitfield into dispute votes, and worse likely demands repeated votes. You could then slash whoever approval voted according to a 2/3rd majority. It'd be a mess but..

It's possible worse issues exist too, ala paritytech/polkadot#7482 (comment)

burdges · 2023-07-20T12:30:17Z

We'd need to give up on this slash if paritytech/polkadot#7482 lands I think, which brings various knock on effects.

ordian · 2023-07-26T14:26:34Z

Do we really need to distinguish approval checkers from explicit "for invalid" dispute votes? If not, that would make the implementation much easier. And I don't see why we would need to distinguish.

burdges · 2023-07-26T19:45:55Z

A dispute is a dispute. It shouldn't matter if it comes from an approval checker or another validator.

Overkillus · 2023-09-04T14:51:30Z

If someone hacked the backing validators/was willing to sacrifice them and would not care about getting them slashed, he could craft a candidate timing out on two thirds of the validators

This combined with disabling for minor slashes could have very problematic consequences as attackers could force out a lot of honest nodes by sacrificing fewer of their own.

One solution to it was to implement a proportional disabling in case of minor offences (when there is already a lot of validators disabled above some threshold).

If someone hacked the backing validators/was willing to sacrifice them and would not care about getting them slashed, he could craft a candidate timing out on two thirds of the validators: Thus the candidate would be deemed invalid and the attacker got up to a third of the validators slashed 100%, by only controlling as little as two backers. It is worth mentioning, that this attack is not resolved by time disputes as time disputes only resolve disputes concluding for the candidate and charging backers is also no threat if they are willing to get slashed.

I generally believe that we would already be safe from those attacks thanks to time disputes/time overrun changes. They should also cover cases against the candidate and fully prevent this attack scenario.

Example:

malicious backer makes a block that 2/3 will say is invalid due to timeout and 1/3 will say valid hoping he would slash and disable some honest nodes
some unlucky slow nodes open a dispute after they say invalid
in dispute everyone executes and we get reliable execution time statements (median of them is reliable at least)
if median is above approval checker timeout it definitely is significantly above backer timeout so we can assume backer is malicious
backer technically would win the dispute but due to the execution time manipulations he is forced to loose, block gets rejected and only the backer gets slashed and disabled

Question now is... is there any other problem with slashing valid on invalid? Was it the only reason why it was initially disabled? If there isn't anything else this should not strictly depend on proportional disabling.

eskimor · 2023-09-04T17:33:07Z

Yes, with time disputes we should already be safe when it comes to time nondeterminism. The problem is that PVF execution is complex and we do want to support multiple implementations (and versions of those) and different machine architectures. It seems impossible to "guarantee" realistically that there will never be an exploitable non-determinism. I would be very happy if we at least eliminated all the conceptual ones: Time disputes being one of them, native stack size limits might be solvable by something similar, but different implementations might always behave slightly different on edge cases - which could be exploited.

Therefore I would consider defense in depth here not just nice to have, but actually an important requirement to have secure parachain consensus under realistic conditions.

What could be an alternative to proportional disabling is to stop disabling any non 100% slashes (assuming non accumulating slashes here), once a certain threshold of disabled validators is reached. This seems fine and avoids nasty flip-flopping of disabled state. Accumulating slashes should then really be avoided though, because then validators could get kicked out longer term (too little stake at the next election).

burdges · 2023-09-04T21:39:40Z

once a certain threshold of disabled validators is reached

Do we need any disabling, except when 100% slashed? We should list the miss-behavior types to figure this out, but at least here I think the time overrun fees and 50% / k trick permits never disables.

Overkillus · 2023-09-04T21:44:09Z

We need either disabling for non 100% slashes or accumulating slashes to deter repeating minor offences. Both have roughly similar security properties and deter repeated attacks.

The benefit of disabling for minor offences is in cases of generally honest node operators that for some reason had a minor fault and they repeatedly and rapidly commit this minor offence. In slash accumulating this would drain their stake. In disabling this disables them and gives time to react with minimal losses.

Overkillus · 2023-09-04T21:47:21Z

We should list the miss-behavior types to figure this out

I actually started doing it recently and I think have a list of different misbehaviour types, how we deal with them now. I still need to check some equivocation cases Im not sure about and I'm listing some future changes like time overruns / time disputes to see the direction we are going and what we'll be covering going ahead.

burdges · 2023-09-05T00:19:03Z

We'd still have this 50% / k rule there, so the adversary loses more nodes than he takes down, so no security concern here, right? We've changed these details enough, not sure our story from the retreat, but really the details matter here enormously..

At some point, there were once four+ voting modes in time disputes, "valid with time" which could mean under or over time, and "invalid by reason", where one reason was really exhausting the clock. A 2/3rd "invalid" vote means 100% slashing, regardless of the reasons. If not 2/3rd invalid then we do not slash per se, but we charge backers by the overrun median time.

We admit disputes that concludes 2/3rd "invalid" while some honest nodes vote "valid", but we think those honest nodes claim times far beyond the backing timeout (or else we've some other execution bug).

Q1. Could we apportion the 50%/k slash based upon how slow the nodes votes?

If yes then approval checkers could always vote "valid but slow".
If no then dispute voters could always vote invalid due to timeout.

Q2. Can 2/3rd "valid but overtime" votes trigger reversion?

If no, then adversaries could've extremely long running valid blocks, which rivaling the invalidity timeout, so we land in the situation you're discussing, right? I think Q2 no but Q1 yes looks insecure.
If yes given a median overrun time >t, then adversaries could cause reversions without escalations, so the backers fees must cover reversion costs by time t. We avoid any sharp line though, so Q1 looks less critical.

At first blush Q1 no but Q2 yes looks most robust..

Co-authored-by: Alistair Singh <[email protected]>

@ordian

Initial implementation for the plan discussed here: #701 Built on top of #1178 v0: paritytech/polkadot#7554, ## Overall idea When approval-voting checks a candidate and is ready to advertise the approval, defer it in a per-relay chain block until we either have MAX_APPROVAL_COALESCE_COUNT candidates to sign or a candidate has stayed MAX_APPROVALS_COALESCE_TICKS in the queue, in both cases we sign what candidates we have available. This should allow us to reduce the number of approvals messages we have to create/send/verify. The parameters are configurable, so we should find some values that balance: - Security of the network: Delaying broadcasting of an approval shouldn't but the finality at risk and to make sure that never happens we won't delay sending a vote if we are past 2/3 from the no-show time. - Scalability of the network: MAX_APPROVAL_COALESCE_COUNT = 1 & MAX_APPROVALS_COALESCE_TICKS =0, is what we have now and we know from the measurements we did on versi, it bottlenecks approval-distribution/approval-voting when increase significantly the number of validators and parachains - Block storage: In case of disputes we have to import this votes on chain and that increase the necessary storage with MAX_APPROVAL_COALESCE_COUNT * CandidateHash per vote. Given that disputes are not the normal way of the network functioning and we will limit MAX_APPROVAL_COALESCE_COUNT in the single digits numbers, this should be good enough. Alternatively, we could try to create a better way to store this on-chain through indirection, if that's needed. ## Other fixes: - Fixed the fact that we were sending random assignments to non-validators, that was wrong because those won't do anything with it and they won't gossip it either because they do not have a grid topology set, so we would waste the random assignments. - Added metrics to be able to debug potential no-shows and mis-processing of approvals/assignments. ## TODO: - [x] Get feedback, that this is moving in the right direction. @ordian @sandreim @eskimor @burdges, let me know what you think. - [x] More and more testing. - [x] Test in versi. - [x] Make MAX_APPROVAL_COALESCE_COUNT & MAX_APPROVAL_COALESCE_WAIT_MILLIS a parachain host configuration. - [x] Make sure the backwards compatibility works correctly - [x] Make sure this direction is compatible with other streams of work: #635 & #742 - [x] Final versi burn-in before merging --------- Signed-off-by: Alexandru Gheorghe <[email protected]>

burdges · 2024-04-05T17:46:16Z

If we've trouble having access to k then I think 2% provides a sensible approximation to 50%/k.

We've 2% > 1.67% = 50%/needed_approvals and 2% < 2.8% = 100%/needed_approvals, with the adversary needing needed_approvals guys during an attack.

If an adversary can split an escalation optimally then we're slashing 2% * num_validators/3 = 0.67% of most stake, not so terrible. In this case, our real problem comes from disabling 1/3rd anyways.

eskimor mentioned this issue Aug 24, 2023

disputes: implement validator disabling #784

Open

18 tasks

alexggh mentioned this issue Jul 10, 2023

[DNM] approval-voting: approvals signature checks optimizations paritytech/polkadot#7482

Open

4 tasks

ordian self-assigned this Jul 24, 2023

burdges mentioned this issue Jul 24, 2023

Approval votes by hash inversion vs faster assignment VRFs #593

Open

alexggh mentioned this issue Aug 24, 2023

approval-voting: coalesce multiple approvals in a single signed message #701

Closed

alexggh mentioned this issue Jul 28, 2023

Approve multiple candidates with a single signature paritytech/polkadot#7554

Open

6 tasks

Sophia-Gold transferred this issue from paritytech/polkadot Aug 24, 2023

alexggh mentioned this issue Aug 27, 2023

Approve multiple candidates with a single signature #1191

Merged

7 tasks

Overkillus mentioned this issue Sep 6, 2023

High Level PVF Nondeterminism #1434

Open

burdges mentioned this issue Sep 6, 2023

Handling of disabled validators in backing subsystem #1259

Merged

tdimitrov mentioned this issue Oct 20, 2023

Disabling - feature PR #1955

Closed

4 tasks

eskimor mentioned this issue Nov 22, 2023

Retry timeout #2438

Closed

eskimor mentioned this issue Nov 30, 2023

Guide Update with all considerations #1961

Closed

claravanstaden pushed a commit to Snowfork/polkadot-sdk that referenced this issue Dec 8, 2023

Fix parsing of bitfields (paritytech#635)

e342ca3

Co-authored-by: Alistair Singh <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Slash approval voters on approving invalid blocks - dynamically #635

Slash approval voters on approving invalid blocks - dynamically #635

eskimor commented May 15, 2023 •

edited

Loading

Tasks

burdges commented May 15, 2023

eskimor commented May 15, 2023

burdges commented May 15, 2023

Sophia-Gold commented May 16, 2023

eskimor commented May 26, 2023

burdges commented Jul 18, 2023 •

edited

Loading

burdges commented Jul 20, 2023 •

edited

Loading

ordian commented Jul 26, 2023

burdges commented Jul 26, 2023

Overkillus commented Sep 4, 2023 •

edited

Loading

eskimor commented Sep 4, 2023 •

edited

Loading

burdges commented Sep 4, 2023

Overkillus commented Sep 4, 2023

Overkillus commented Sep 4, 2023

burdges commented Sep 5, 2023

burdges commented Apr 5, 2024 •

edited

Loading

Slash approval voters on approving invalid blocks - dynamically #635

Slash approval voters on approving invalid blocks - dynamically #635

Comments

eskimor commented May 15, 2023 • edited Loading

Proposed Solution

Tasks

burdges commented May 15, 2023

eskimor commented May 15, 2023

burdges commented May 15, 2023

Sophia-Gold commented May 16, 2023

eskimor commented May 26, 2023

burdges commented Jul 18, 2023 • edited Loading

burdges commented Jul 20, 2023 • edited Loading

ordian commented Jul 26, 2023

burdges commented Jul 26, 2023

Overkillus commented Sep 4, 2023 • edited Loading

eskimor commented Sep 4, 2023 • edited Loading

burdges commented Sep 4, 2023

Overkillus commented Sep 4, 2023

Overkillus commented Sep 4, 2023

burdges commented Sep 5, 2023

burdges commented Apr 5, 2024 • edited Loading

eskimor commented May 15, 2023 •

edited

Loading

burdges commented Jul 18, 2023 •

edited

Loading

burdges commented Jul 20, 2023 •

edited

Loading

Overkillus commented Sep 4, 2023 •

edited

Loading

eskimor commented Sep 4, 2023 •

edited

Loading

burdges commented Apr 5, 2024 •

edited

Loading