Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slash approval voters on approving invalid blocks - dynamically #635

Open
2 tasks
eskimor opened this issue May 15, 2023 · 16 comments
Open
2 tasks

Slash approval voters on approving invalid blocks - dynamically #635

eskimor opened this issue May 15, 2023 · 16 comments
Assignees

Comments

@eskimor
Copy link
Member

eskimor commented May 15, 2023

Currently we punt on slashing approval voters approving invalid blocks. This is sound from a gambler's ruin perspective, as the backers are the ones to be supposed to having skin in the game. We will always slash at least those backers.

While the security properties are maintained, there is a slight problem: Without slashing approval voters, there is no incentive left to actually do the validation. Approving without having checked is risk free, for the individual. Of course not for the network: Blindly approving puts the whole chain at risk, given that validators are heavily invested, they should have an interest in maintaining the network's security. Nevertheless there is a tragedy of the commons situation: "If it is just me not validating - I'll save a few resources and as long as everybody else is checking, everything is fine."

Right now this should be a very low risk as running modified source code is neither effort less nor risk free and also for now resources are mostly spent elsewhere. The actual validation is relatively cheap, nevertheless longer term this situation should be improved.

There are at least two possibilities:

  1. Somehow let the validators prove that they did the validation.
  2. Make approving no longer risk free - bring back the slash.

While 1 would be a more solid solution: It would just not be possible to not doing the validation without getting slashed/losing out on rewards. On the flip-side it is not even clear how this should work, there are some ideas how this could work, but any proposed solution is far from trivial and would add a lot of complexity.

Which leaves us with option 2. So, why did we remove that slash in the first place? The reason was an attack vector: If someone hacked the backing validators/was willing to sacrifice them and would not care about getting them slashed, he could craft a candidate timing out on two thirds of the validators: Thus the candidate would be deemed invalid and the attacker got up to a third of the validators slashed 100%, by only controlling as little as two backers. It is worth mentioning, that this attack is not resolved by time disputes as time disputes only resolve disputes concluding for the candidate and charging backers is also no threat if they are willing to get slashed.

So why is this a problem? This would clearly be an unjustified slash, governance could refund them. True, but slashing also means:

  1. Validators will get disabled.
  2. Validators will be removed from the active set.

Both of these would gain an attacker an advantage: Knocking out up to a third of all validators. With byzantine assumptions the attacker already controls up to a third. So with this attack, he would control half the validators!

Proposed Solution

First, we have to understand that this attack is an amplification attack: By only controlling two validators, the attacker can knock out up to a third. This attack only makes any sense if the attacker is able to knock out more honest nodes than he is sacrificing malicious nodes!

But this can easily be avoided: If we changed the slashing strategy such that not each approval voter gets slashed 100% of their stash, but instead that 100% gets distributed among all approving approval checkers:

  • One approving approval checker: Gets slashed 100%.
  • Two approving approval checkers: Each gets slashed 50%.
  • 10 approving approval checkers: Each gets slashed 10%.
  • ...

Or simply: Individual slash per approval voter: 100%/k ... where k is the number of approving checkers.

What does this gain us? A lot:

  1. We have a strong incentive to actually do the validation work, because approving blindly is still risking a 100% slash! We are as good as in the initial design!
  2. We require two backers for a candidate getting backed. This means on each attack the attacker is losing 2 of his nodes, while only taking down 1 honest node on average: The attack no longer makes any sense!

Prerequisite: Good disabling strategy, which only fully disables nodes that got slashed 100%.

Tasks

@burdges
Copy link

burdges commented May 15, 2023

Yes, approval checkers receiving some lesser slash always made sense, but yeah %/k ensures this stays below the backer slash, and solves the issue. I think 50%/k or 10%/k work fine too, while also being robust if we ever lower the backer count to one.

@eskimor
Copy link
Member Author

eskimor commented May 15, 2023

Yes, approval checkers receiving some lesser slash always made sense, but yeah %/k ensures this stays below the backer slash, and solves the issue. I think 50%/k or 10%/k work fine too, while also being robust if we ever lower the backer count to one.

We should be able to make this a function of the number of required backers.

@burdges
Copy link

burdges commented May 15, 2023

Why bother if 50% / k works, no?

@Sophia-Gold
Copy link
Contributor

I like this! 50% is probably marginally better than 100% if we also want to make it so this doesn't result in disabling.

@eskimor
Copy link
Member Author

eskimor commented May 26, 2023

Why bother if 50% / k works, no?

Yeah. I just like to maximize the slash to whatever is possible in a security preserving way. Going with simply 50% is also a good option, especially when going with 100% would complicate things a lot - calculating from the number of required backers seems pretty easy though.

@burdges
Copy link

burdges commented Jul 18, 2023

As an aside, we occasionally discuss not signing approval votes, which directly conflicts here. We'd need almost a complete graph before that works, which makes not signing hard anyways. If we've enough connectivity, then could invalid approval votes be punished? In disputes, we'd need votes about who voted approve, which adds some bitfield into dispute votes, and worse likely demands repeated votes. You could then slash whoever approval voted according to a 2/3rd majority. It'd be a mess but..

It's possible worse issues exist too, ala paritytech/polkadot#7482 (comment)

@burdges
Copy link

burdges commented Jul 20, 2023

We'd need to give up on this slash if paritytech/polkadot#7482 lands I think, which brings various knock on effects.

@ordian
Copy link
Member

ordian commented Jul 26, 2023

Do we really need to distinguish approval checkers from explicit "for invalid" dispute votes? If not, that would make the implementation much easier. And I don't see why we would need to distinguish.

@burdges
Copy link

burdges commented Jul 26, 2023

A dispute is a dispute. It shouldn't matter if it comes from an approval checker or another validator.

@Overkillus
Copy link
Contributor

Overkillus commented Sep 4, 2023

If someone hacked the backing validators/was willing to sacrifice them and would not care about getting them slashed, he could craft a candidate timing out on two thirds of the validators

This combined with disabling for minor slashes could have very problematic consequences as attackers could force out a lot of honest nodes by sacrificing fewer of their own.

One solution to it was to implement a proportional disabling in case of minor offences (when there is already a lot of validators disabled above some threshold).

If someone hacked the backing validators/was willing to sacrifice them and would not care about getting them slashed, he could craft a candidate timing out on two thirds of the validators: Thus the candidate would be deemed invalid and the attacker got up to a third of the validators slashed 100%, by only controlling as little as two backers. It is worth mentioning, that this attack is not resolved by time disputes as time disputes only resolve disputes concluding for the candidate and charging backers is also no threat if they are willing to get slashed.

I generally believe that we would already be safe from those attacks thanks to time disputes/time overrun changes. They should also cover cases against the candidate and fully prevent this attack scenario.

Example:

  • malicious backer makes a block that 2/3 will say is invalid due to timeout and 1/3 will say valid hoping he would slash and disable some honest nodes
  • some unlucky slow nodes open a dispute after they say invalid
  • in dispute everyone executes and we get reliable execution time statements (median of them is reliable at least)
  • if median is above approval checker timeout it definitely is significantly above backer timeout so we can assume backer is malicious
  • backer technically would win the dispute but due to the execution time manipulations he is forced to loose, block gets rejected and only the backer gets slashed and disabled

Question now is... is there any other problem with slashing valid on invalid? Was it the only reason why it was initially disabled? If there isn't anything else this should not strictly depend on proportional disabling.

@eskimor
Copy link
Member Author

eskimor commented Sep 4, 2023

Yes, with time disputes we should already be safe when it comes to time nondeterminism. The problem is that PVF execution is complex and we do want to support multiple implementations (and versions of those) and different machine architectures. It seems impossible to "guarantee" realistically that there will never be an exploitable non-determinism. I would be very happy if we at least eliminated all the conceptual ones: Time disputes being one of them, native stack size limits might be solvable by something similar, but different implementations might always behave slightly different on edge cases - which could be exploited.

Therefore I would consider defense in depth here not just nice to have, but actually an important requirement to have secure parachain consensus under realistic conditions.

What could be an alternative to proportional disabling is to stop disabling any non 100% slashes (assuming non accumulating slashes here), once a certain threshold of disabled validators is reached. This seems fine and avoids nasty flip-flopping of disabled state. Accumulating slashes should then really be avoided though, because then validators could get kicked out longer term (too little stake at the next election).

@burdges
Copy link

burdges commented Sep 4, 2023

once a certain threshold of disabled validators is reached

Do we need any disabling, except when 100% slashed? We should list the miss-behavior types to figure this out, but at least here I think the time overrun fees and 50% / k trick permits never disables.

@Overkillus
Copy link
Contributor

We need either disabling for non 100% slashes or accumulating slashes to deter repeating minor offences. Both have roughly similar security properties and deter repeated attacks.

The benefit of disabling for minor offences is in cases of generally honest node operators that for some reason had a minor fault and they repeatedly and rapidly commit this minor offence. In slash accumulating this would drain their stake. In disabling this disables them and gives time to react with minimal losses.

@Overkillus
Copy link
Contributor

We should list the miss-behavior types to figure this out

I actually started doing it recently and I think have a list of different misbehaviour types, how we deal with them now. I still need to check some equivocation cases Im not sure about and I'm listing some future changes like time overruns / time disputes to see the direction we are going and what we'll be covering going ahead.

@burdges
Copy link

burdges commented Sep 5, 2023

We'd still have this 50% / k rule there, so the adversary loses more nodes than he takes down, so no security concern here, right? We've changed these details enough, not sure our story from the retreat, but really the details matter here enormously..

At some point, there were once four+ voting modes in time disputes, "valid with time" which could mean under or over time, and "invalid by reason", where one reason was really exhausting the clock. A 2/3rd "invalid" vote means 100% slashing, regardless of the reasons. If not 2/3rd invalid then we do not slash per se, but we charge backers by the overrun median time.

We admit disputes that concludes 2/3rd "invalid" while some honest nodes vote "valid", but we think those honest nodes claim times far beyond the backing timeout (or else we've some other execution bug).

Q1. Could we apportion the 50%/k slash based upon how slow the nodes votes?

  • If yes then approval checkers could always vote "valid but slow".
  • If no then dispute voters could always vote invalid due to timeout.

Q2. Can 2/3rd "valid but overtime" votes trigger reversion?

  • If no, then adversaries could've extremely long running valid blocks, which rivaling the invalidity timeout, so we land in the situation you're discussing, right? I think Q2 no but Q1 yes looks insecure.
  • If yes given a median overrun time >t, then adversaries could cause reversions without escalations, so the backers fees must cover reversion costs by time t. We avoid any sharp line though, so Q1 looks less critical.

At first blush Q1 no but Q2 yes looks most robust..

@tdimitrov tdimitrov mentioned this issue Oct 20, 2023
4 tasks
@eskimor eskimor mentioned this issue Nov 22, 2023
claravanstaden pushed a commit to Snowfork/polkadot-sdk that referenced this issue Dec 8, 2023
alexggh added a commit that referenced this issue Dec 13, 2023
Initial implementation for the plan discussed here: #701
Built on top of #1178
v0: paritytech/polkadot#7554,

## Overall idea

When approval-voting checks a candidate and is ready to advertise the
approval, defer it in a per-relay chain block until we either have
MAX_APPROVAL_COALESCE_COUNT candidates to sign or a candidate has stayed
MAX_APPROVALS_COALESCE_TICKS in the queue, in both cases we sign what
candidates we have available.

This should allow us to reduce the number of approvals messages we have
to create/send/verify. The parameters are configurable, so we should
find some values that balance:

- Security of the network: Delaying broadcasting of an approval
shouldn't but the finality at risk and to make sure that never happens
we won't delay sending a vote if we are past 2/3 from the no-show time.
- Scalability of the network: MAX_APPROVAL_COALESCE_COUNT = 1 &
MAX_APPROVALS_COALESCE_TICKS =0, is what we have now and we know from
the measurements we did on versi, it bottlenecks
approval-distribution/approval-voting when increase significantly the
number of validators and parachains
- Block storage: In case of disputes we have to import this votes on
chain and that increase the necessary storage with
MAX_APPROVAL_COALESCE_COUNT * CandidateHash per vote. Given that
disputes are not the normal way of the network functioning and we will
limit MAX_APPROVAL_COALESCE_COUNT in the single digits numbers, this
should be good enough. Alternatively, we could try to create a better
way to store this on-chain through indirection, if that's needed.

## Other fixes:
- Fixed the fact that we were sending random assignments to
non-validators, that was wrong because those won't do anything with it
and they won't gossip it either because they do not have a grid topology
set, so we would waste the random assignments.
- Added metrics to be able to debug potential no-shows and
mis-processing of approvals/assignments.

## TODO:
- [x] Get feedback, that this is moving in the right direction. @ordian
@sandreim @eskimor @burdges, let me know what you think.
- [x] More and more testing.
- [x]  Test in versi.
- [x] Make MAX_APPROVAL_COALESCE_COUNT &
MAX_APPROVAL_COALESCE_WAIT_MILLIS a parachain host configuration.
- [x] Make sure the backwards compatibility works correctly
- [x] Make sure this direction is compatible with other streams of work:
#635 &
#742
- [x] Final versi burn-in before merging

---------

Signed-off-by: Alexandru Gheorghe <[email protected]>
@burdges
Copy link

burdges commented Apr 5, 2024

If we've trouble having access to k then I think 2% provides a sensible approximation to 50%/k.

We've 2% > 1.67% = 50%/needed_approvals and 2% < 2.8% = 100%/needed_approvals, with the adversary needing needed_approvals guys during an attack.

If an adversary can split an escalation optimally then we're slashing 2% * num_validators/3 = 0.67% of most stake, not so terrible. In this case, our real problem comes from disabling 1/3rd anyways.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Backlog
Development

No branches or pull requests

5 participants