Cross-Shard Congestion Control #539

jakmeier · 2024-03-22T13:59:33Z

NEP Status (Updated by NEP Moderators)

Status: Approved

Meeting Recording:
https://www.youtube.com/watch?v=O1MOBmxKqhI

Protocol Work Group voting indications (❔ | 👍 | 👎 ):

it's time to get first feedback by engineers outside the focus group

jakmeier · 2024-03-22T15:59:31Z

And a first draft of "the story behind" is also available: https://github.com/near/nearcore/blob/master/docs/architecture/how/receipt-congestion.md

While the NEP focusses on specifying the proposed changes, the story behind explains our thought process why these changes lead to the desired consequences.

neps/nep-0539.md

akhi3030

Generally looks good. Some high level thoughts.

A summary of my understanding is that each shard is going to advertise how much queue space it has available and other shards will take that into account when constructing their chunks and accepting new transactions. Is that a fair summery?

If so, then my question is about fairness and relatedly load balancing. The two cases that I am thinking of are:

Shard A is congested and shard B and C both have a ton of receipts for it. Assuming all shards are created equal, how do we make sure that the remaining queue space is shared fairly between B and C? Is it by relying on the linear interpolation?
Shard A is congested and shard B has a ton of receipts for it and shard C has no receipts for it. How do we make sure that we are able to provide all the queue space to B and do not reserve any for C?

neps/nep-0539.md

jakmeier · 2024-03-25T15:45:32Z

A summary of my understanding is that each shard is going to advertise how much queue space it has available and other shards will take that into account when constructing their chunks and accepting new transactions. Is that a fair summery?

Yes, that sounds exactly right.

1. Shard A is congested and shard B and C both have a ton of receipts for it.  Assuming all shards are created equal, how do we make sure that the remaining queue space is shared fairly between B and C?  Is it by relying on the linear interpolation?

We don't give any guarantees about fairness. We hope that backpressure measures are reducing incoming transactions sharp enough that congestion resolves quickly and everyone can send again. But yes, linear interpolation of how much bandwidth (measured in gas) each shard can send per chunk should help in most practical scenarios, as the newly available space in the incoming queue of the congested shard is shared evenly across all sending shards.

2. Shard A is congested and shard B has a ton of receipts for it and shard C has no receipts for it.  How do we make sure that we are able to provide all the queue space to B and do not reserve any for C?

There is only one big incoming queue, without accounting per shard. So in this example, shard B can fill it up entirely. Shard C will be sad when it wants to send a single receipt and sees the queue full. But I personally think it's a good trade-off to make.

wacban

a few nits, typos and such

neps/nep-0539.md

Co-authored-by: wacban <[email protected]>

akhi3030 · 2024-03-25T17:46:23Z

A summary of my understanding is that each shard is going to advertise how much queue space it has available and other shards will take that into account when constructing their chunks and accepting new transactions. Is that a fair summery?

Yes, that sounds exactly right.
1. Shard A is congested and shard B and C both have a ton of receipts for it.  Assuming all shards are created equal, how do we make sure that the remaining queue space is shared fairly between B and C?  Is it by relying on the linear interpolation?
We don't give any guarantees about fairness. We hope that backpressure measures are reducing incoming transactions sharp enough that congestion resolves quickly and everyone can send again. But yes, linear interpolation of how much bandwidth (measured in gas) each shard can send per chunk should help in most practical scenarios, as the newly available space in the incoming queue of the congested shard is shared evenly across all sending shards.
2. Shard A is congested and shard B has a ton of receipts for it and shard C has no receipts for it.  How do we make sure that we are able to provide all the queue space to B and do not reserve any for C?
There is only one big incoming queue, without accounting per shard. So in this example, shard B can fill it up entirely. Shard C will be sad when it wants to send a single receipt and sees the queue full. But I personally think it's a good trade-off to make.

Generally happy with your responses here. One other approach I have seen (and implemented in the past) to guarantee fairness is some sort of credit based queuing. This lets a receiving entity decide in fine grain how much of its queue it wants to dedicate to each sender. It is natural to use this mechanics to implement fair sharing or to arbitrary types of prioritisation as well (e.g. one shard is able to send 2x more than another). The drawback of course is more state tracking and complex implementation. So I'm happy with the proposed approach.

saketh-are · 2024-03-26T00:56:23Z

neps/nep-0539.md

+
+We store the outgoing buffered receipts in the trie, similar to delayed receipts
+but in their own separate column. Therefore we add a trie column
+`BUFFERED_RECEIPT_OR_INDICES: u8 = 13;`. Then we read and write analogue to the


Minor detail: I know we use this pattern for DELAYED_RECEIPT_OR_INDICES, but it seems to be that way for historical reasons (see commit message here).

For this new queue it would be clearer to have separate BUFFERED_RECEIPT and BUFFERED_RECEIPT_INDICES columns.

akhi3030 · 2024-03-26T12:40:19Z

Another question popped into my head earlier. AFAIU, creating a promise in NEAR is infallible i.e. contract A on shard 1 can always create a receipt for contract B on shard 2. Further, it is the case that without actually executing the receipt against contract A, we cannot know for sure whether or not it will call contract B. In the worst case, many different contracts on many different shards can all target the same contract (or a set of contracts on a shard).

Does the proposed solution handle such scenarios? Is the filter operation defined going to apply to the receipts created above?

jakmeier · 2024-03-27T07:40:24Z

Another question popped into my head earlier. AFAIU, creating a promise in NEAR is infallible i.e. contract A on shard 1 can always create a receipt for contract B on shard 2. Further, it is the case that without actually executing the receipt against contract A, we cannot know for sure whether or not it will call contract B. In the worst case, many different contracts on many different shards can all target the same contract (or a set of contracts on a shard).

Does the proposed solution handle such scenarios? Is the filter operation defined going to apply to the receipts created above?

The filter operation is only applicable to transactions, not to receipts. Once receipts are created, we commit to execute them.

The described situation is indeed problematic. Of course, that's exactly what backpressure is for.

If shard 3 becomes congested, shard 1 and 2 can still create receipts for shard 3 but they are forced to keep them in their outgoing buffer before forwarding. This way, shard 3 is protected from additional inflow. Eventually, shards 1 and 2 may also become congested and the backpressure spreads further out to all shards trying to send something to them. Eventually all shards are congested and no more new transactions anywhere are accepted.

Unfortunately, it is still not handled perfectly. We only apply backpressure based on incoming congestion, to avoid deadlocks. But if we are able to handle incoming receipts quickly, it is possible shard 1 keeps filling its outgoing buffer for shard 2, growing it faster than it can forward receipts in it. But because the incoming queue is always empty, it does not apply backpressure. (cc @wacban we should probably simulate with the latest changes that decouple incoming and outgoing congested to see how bad this can become.)

akhi3030 · 2024-03-27T11:01:54Z

I think I understand the high level explanation. The drawback is that in the worst case, due to one shard not keeping up, it is possible that the entire network has to stop accepting new transactions. I am still happy with this solution and see this as a very good next step to build. Once built, I can imagine further refinements where we can address such cases as well.

wacban · 2024-03-27T12:12:59Z

@akhi3030

This lets a receiving entity decide in fine grain how much of its queue it wants to dedicate to each sender.

If I understand correctly this could be implemented by splitting the delayed receipts queue into one queue per sending shard and then implementing some fair way to pull receipts from this set of queues. This makes sense but I would rather keep this NEP in the current simpler form and work on top of it in follow ups. The good news is that as far as I can tell the current proposal should be easily extendable to what you're suggesting.

A summary of my understanding is that each shard is going to advertise how much queue space it has available and other shards will take that into account when constructing their chunks and accepting new transactions. Is that a fair summery?

That is correct, just to add a detail to it, each shard will advertise two numbers, one representing the fullness of the "outgoing queues" and one representing the fullness of the "incoming queue". Those two types of congestion are treated differently which allows us to better adapt the measures to the specific workload that the network is under.

akhi3030 · 2024-03-27T13:58:39Z

@wacban: perfect, sounds like a solid plan to me. I am always happy to build incrementally.

This document describes a few fundamental congestion control problems and ideas to solve them. The added page serves as secondary document to [NEP](near/NEPs#539) to summarise the thought process behind the most important design decisions. But it is generally applicable to congestion in Near Protocol's receipt execution system as it works today. It can even serve as documentation for how congestion can occur today. The document includes 8 graphs generated using [graphviz](https://graphviz.org/). To regenerate after modifying the `*.dot` files, install the graphviz toolbox (on systems with apt: `sudo apt install graphviz`) and then run `dot -Tsvg img_name.dot > img_name.svg`. --------- Co-authored-by: wacban <[email protected]>

- The formulas in the pseudo code were opposite to the description, fixing it by swapping incoming and general congestion. - "General" congestion is a bad name. Changing it to "Memory" congestion. - Add a sentence of motiviation to the pseudo code snippets for extra explanation - Add TODO for unbounded queue problem

neps/nep-0539.md

Co-authored-by: wacban <[email protected]>

No link to the actual reference implementation, yet. Just some clarifying text and in-place code.

I think it's better to keep it simple. While it could be useful in the future to look at guaranteed to be burnt and attached gas separately for congestion, our current strategy does not look at it.

wacban · 2024-04-05T12:20:05Z

I implemented the model of the strategy proposed in the NEP. I am now analysing different workloads to make sure that the strategy can handle them well. I will be sharing results and suggestions here as I progress.

AllToOne workload.

In this workload all shards send direct transactions to a single shard that becomes congested.

The strategy does a rather bad job at dealing with this workload as the outgoing buffers grow in gas without a reasonable limit. The memory limit is never exceeded because the receipts are small but the number and gas of receipts grows beyond acceptable values.

The reason is that the current proposal does not take the gas accumulated in outgoing buffers into account.

My suggestion would be to replace memory congestion with general_congestion as following:

ShardChunkHeaderInnerV3 {
  // as is
  incoming_congestion: u16,
  // memory -> general
  general_congestion: u16,
}

	// Same as in NEP
    MAX_CONGESTION_MEMORY_CONSUMPTION = 500 MB
    memory_consumption = 0
    memory_consumption += sum([receipt.size() for receipt in delayed_receipts_queue])
    memory_consumption += sum([receipt.size() for receipt in postponed_receipts_queue])
    memory_consumption += sum([receipt.size() for receipt in outgoing_receipts_buffer])

    memory_congestion = memory_consumption / MAX_CONGESTION_MEMORY_CONSUMPTION
    memory_congestion = min(1.0, memory_congestion)
    
    // New
    // Similar to memory but summing up gas instead of size
    MAX_CONGESTION_GAS_BACKLOG = 100 PG
    gas_backlog = 0
    gas_backlog += sum([receipt.gas() for receipt in delayed_receipts_queue])
    gas_backlog += sum([receipt.gas() for receipt in postponed_receipts_queue])
    gas_backlog += sum([receipt.gas() for receipt in outgoing_receipts_buffer])

    gas_congestion = gas_backlog / MAX_CONGESTION_GAS_BACKLOG
    gas_congestion = min(1.0, memory_congestion)
    
    // New
    general_congestion = max(memory_congestion, gas_congestion)

I implemented the suggestion in the model and the results are quite good - both the incoming queue and outgoing buffers display bounded, periodic behaviour.

In the picture below, each period is characterized by four phases:

phase 1 - rapid growth
- incoming gas grows to 150PG
- outgoing gas grows to 100PG
- shards send plenty of load to the loaded shard (0)
phase 2 - incoming decline
- incoming gas drops to 100PG
- outgoing gas stays at to 100PG
phase 3 - outgoing decline
- incoming gas stays at 100PG
- outgoing gase drops to 0PG
phase 4 - incoming decline
- incoming gas drops to 50PG - the threshold for accepting transactions
- outgoing gas stays at 0PG

We can probably smooth it out further by replacing the hard incoming congestion threshold with linear interpolation. It's not a priority right now so I'll leave it as is.

Correct some typos, grammar issues, and clarify some text.

Address various comments by SME reviewers. - Fix various grammar errors. - Remove old names and use only the correct names for variables - Start the specification section by introducing important concepts

jakmeier · 2024-05-15T22:04:17Z

Thanks a lot to @Akashin and @robin-near for taking the time to read through our proposal and giving valuable feedback! I really appreciate your expertise to ensure we end up with the best possible solution to move congestion control one step forward.

Sorry about the subpart quality in the grammar, and just in general. I thought we had the NEP cleaned up much better, otherwise I wouldn't have asked for SME reviews. I think we rushed a bit too much then, as we wanted to get the NEP processed started as soon as possible.

I have tried my best to fix it up now and added a new section about important concepts. Please, @robin-near, can you take another look? Let me know if something is still not well defined or not written clearly.

jakmeier · 2024-05-15T22:16:55Z

Oh and in the time since the last changes, we added "missed chunks congestion" as an additional indicator. I have added it to the concepts section and to the "Changes to chunk execution" section.

It's a bit of a last minute change, not something we initially wanted to address. But for stateless validation, Near Protocol needs a way to limit incoming receipts even when chunks are missed. This NEP introduces all the required tools to solve that problem, so it seemed worth it to include. But if preferred by the working group, we could also separate it out as its own NEP that builds on top of congestion control.

@wacban, since you spear-headed and implemented this, can you please double-check that I got the details around missed chunk congestion right?

birchmd

As a working group member, I lean towards approving this proposal. While Near aims to scale such that it can handle the load users place on the network, it is still critical that Near remains usable under all loads. This congestion handling protocol accomplishes this goal while leaving room for transaction prioritization in the future.

One note I would like to make is that front-ends may need to update their retry logic to specifically handle the "transaction rejected due to congestion" error. This should be communicated clearly along with the protocol change which includes congestion control.

neps/nep-0539.md

bowenwang1996 · 2024-05-21T22:12:47Z

As a working group member, I lean towards approving this NEP. It is a major step towards addressing congestion related stability issues and improving the user experience of NEAR.

mfornet

As a working group member I lean toward approving this proposal.

One observation. Say that an account on shard A wants to interact with a contract on shard B. Shard B is congested so that the transaction will be rejected. A (not so simple) alternative for the user is to route their transaction through a collocated contract on shard C (user -> A -> C -> B). The receipt between C and B will be delayed, but the transaction got in any way (is there any advantage to this?). Eventually, if everyone keeps doing this, C will get congested, and given there is nothing special about C, if these routing contracts are collocated on every shard and users do this, all shards will eventually get congested due to one app on shard B.

The situation described above exists beyond this proposal. I'm highlighting it since it will continue to exist.

neps/nep-0539.md

Co-authored-by: Marcelo Fornet <[email protected]>

Co-authored-by: Michael Birch <[email protected]>

jakmeier · 2024-05-22T13:52:12Z

@robin-near You wrote that you want to take another look. Note that a WG meeting and the voting on the NEP is scheduled for this Friday. If you have any concerns about the proposal, please raise them as early as possible so they can be incorporated in the decision.

mm-near · 2024-05-24T14:46:08Z

As a working group member I lean toward approving this proposal.

I have two meta comments:

similar to @mfornet comment above - with this change, it will suddenly matter on which shard your account is located (if you happen to be collocated with some popular contract, more of your transactions will fail etc). It should be clearly stated in documentation.
I'd suggest that the shard congestion_level info is clearly visible in the explorers. So that regular users can quickly see what's going on (and that not all of the system is under load).

jakmeier · 2024-05-24T17:53:00Z

High-level overview slides from today's WG call: https://docs.google.com/presentation/d/1zm0zZKnJpfGsj8-yo9tePqxd9CRhicKPcr1dDnePyVk/edit?usp=sharing

Summary: In this PR, we introduce a new failure mode on the RPC level when a transaction is submitted under congestion. The error is of type `InvalidTxError` and called `ShardCongested` with a single field `shard_id` referencing the congested shard. ## Details With [cross-shard congestion control](near/NEPs#539) being stabilized soon, we want to reject new transactions as early as possible when the receiver shard is already overloaded with traffic. On the chunk producer level, all transactions going to a congested shard will be dropped. This keeps the memory requirements of chunk producers bounded. Further, we decided to go for a relatively low threshold in order to keep the latency of accepted transactions low, preventing new transactions as soon as we hit 25% congestion on a specific shard. Consequently, when shards are congested, it will not be long before transactions are rejected. This has consequences for the users. On the positive side, they will no longer have to wait for a long time not knowing if their transaction will be accepted or not. Either, it is executed within a bounded time (at most 20 blocks after inclusion) or it will be rejected immediately. But on the negative side, when a shard is congested, they will have to actively retry sending the transaction until it gets accepted. We hope that this can be automated by wallets, which can also provide useful live updates to the user about what is happening. But for this, they will need to understand and handle the new error `ShardCongested` different from existing errors.

Summary: In this PR, we introduce a new failure mode on the RPC level when a transaction is submitted under congestion. The error is of type `InvalidTxError` and called `ShardCongested` with a single field `shard_id` referencing the congested shard. ## Details With [cross-shard congestion control](near/NEPs#539) being stabilized soon, we must deal with the case when a shard rejects new transactions. On the chunk producer level, all transactions going to a congested shard will be dropped. This keeps the memory requirements of chunk producers bounded. Further, we decided to go for a relatively low threshold in order to keep the latency of accepted transactions low, preventing new transactions as soon as we hit 25% congestion on a specific shard. Consequently, when shards are congested, it will not be long before transactions are rejected. This has consequences for the users. On the positive side, they will no longer have to wait for a long time not knowing if their transaction will be accepted or not. Either, it is executed within a bounded time (at most 20 blocks after inclusion) or it will be rejected immediately. But on the negative side, when a shard is congested, they will have to actively retry sending the transaction until it gets accepted. We hope that this can be automated by wallets, which can also provide useful live updates to the user about what is happening. But for this, they will need to understand and handle the new error `ShardCongested` differently from existing errors. The key difference is that the same signed transaction can be sent again and will be accepted if congestion has gone down.

# Feature to stabilize This PR stabilizes the Congestion Control and Stateless Validation protocol features. They are assigned separate protocol features and the protocol upgrades should be scheduled separately. # Context * near/NEPs#539 * near/NEPs#509 # Testing and QA Those features are well covered in unit, integration and end to end tests and were extensively tested in forknet and statelessnet. # Checklist - [x] Link to nightly nayduck run (`./scripts/nayduck.py`, [docs](https://github.com/near/nearcore/blob/master/nightly/README.md#scheduling-a-run)): https://nayduck.nearone.org/ - [x] Update CHANGELOG.md to include this protocol feature in the `Unreleased` section.

jakmeier added 3 commits March 22, 2024 11:39

first nep draft for sharing with project team

81530e6

clean up a first reasonably presentable draft

9262ddb

it's time to get first feedback by engineers outside the focus group

assign nep number from pr

262d576

bowenwang1996 reviewed Mar 22, 2024

View reviewed changes

neps/nep-0539.md Outdated Show resolved Hide resolved

staffik reviewed Mar 22, 2024

View reviewed changes

neps/nep-0539.md Outdated Show resolved Hide resolved

This was referenced Mar 25, 2024

docs: Describe congestion fundamentals near/nearcore#10873

Merged

[ProjectTracking]: congestion control near/near-one-project-tracking#48

Open

congestion control: fine-tune parametrs based on expected real-world traffic near/nearcore#10874

Open

akhi3030 reviewed Mar 25, 2024

View reviewed changes

neps/nep-0539.md Outdated Show resolved Hide resolved

wacban reviewed Mar 25, 2024

View reviewed changes

Apply suggestions from code review

fc56f4c

Co-authored-by: wacban <[email protected]>

saketh-are reviewed Mar 26, 2024

View reviewed changes

jakmeier added 2 commits April 2, 2024 11:15

formatting

315952a

wacban reviewed Apr 2, 2024

View reviewed changes

neps/nep-0539.md Outdated Show resolved Hide resolved

jakmeier and others added 4 commits April 2, 2024 17:31

Update neps/nep-0539.md

7af5edb

Co-authored-by: wacban <[email protected]>

first draft of section "Reference Implementation"

4c96103

No link to the actual reference implementation, yet. Just some clarifying text and in-place code.

add link to reference impl PR

5b018a9

simplify delayed receipts gas tracking

690fef1

I think it's better to keep it simple. While it could be useful in the future to look at guaranteed to be burnt and attached gas separately for congestion, our current strategy does not look at it.

describe buffered receipts queue in more details

99ec21f

jakmeier added 2 commits May 15, 2024 18:11

Merge pull request #3 from robin-near/congestion

1e85668

Correct some typos, grammar issues, and clarify some text.

clean up and add section on concepts

fbb674c

Address various comments by SME reviewers. - Fix various grammar errors. - Remove old names and use only the correct names for variables - Start the specification section by introducing important concepts

jakmeier requested a review from robin-near May 15, 2024 22:04

markdown lints

8be43a3

This was referenced May 20, 2024

Use cross-shard load information to limit the number of accepted receipts near/nearcore#9228

Closed

Implement test for global congestion control near/nearcore#9227

Closed

birchmd reviewed May 21, 2024

View reviewed changes

neps/nep-0539.md Outdated Show resolved Hide resolved

neps/nep-0539.md Show resolved Hide resolved

mfornet reviewed May 22, 2024

View reviewed changes

neps/nep-0539.md Outdated Show resolved Hide resolved

neps/nep-0539.md Outdated Show resolved Hide resolved

jakmeier and others added 2 commits May 22, 2024 15:42

Apply suggestions from code review

8d1f88c

Co-authored-by: Marcelo Fornet <[email protected]>

Apply suggestions from code review

dc27a23

Co-authored-by: Michael Birch <[email protected]>

jakmeier mentioned this pull request May 29, 2024

feat(congestion): reject new transactions on RPC level near/nearcore#11419

Merged

wacban mentioned this pull request Jul 2, 2024

stabilize congestion control and stateless validation near/nearcore#11701

Merged

2 tasks

Merge branch 'master' into congestion-control

53c9d45

flmel added S-approved A NEP that was approved by a working group. and removed S-review/needs-sme-review A NEP in the REVIEW stage is waiting for Subject Matter Expert review. labels Oct 1, 2024

flmel approved these changes Oct 1, 2024

View reviewed changes

flmel merged commit 62fb58f into near:master Oct 1, 2024
3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cross-Shard Congestion Control #539

Cross-Shard Congestion Control #539

jakmeier commented Mar 22, 2024 •

edited by flmel

Loading

jakmeier commented Mar 22, 2024 •

edited by ewiner

Loading

akhi3030 left a comment

jakmeier commented Mar 25, 2024 •

edited

Loading

wacban left a comment

akhi3030 commented Mar 25, 2024

saketh-are Mar 26, 2024

akhi3030 commented Mar 26, 2024

jakmeier commented Mar 27, 2024

akhi3030 commented Mar 27, 2024

wacban commented Mar 27, 2024

akhi3030 commented Mar 27, 2024

wacban commented Apr 5, 2024

jakmeier commented May 15, 2024

jakmeier commented May 15, 2024

birchmd left a comment

bowenwang1996 commented May 21, 2024

mfornet left a comment

jakmeier commented May 22, 2024

mm-near commented May 24, 2024

jakmeier commented May 24, 2024

Cross-Shard Congestion Control #539

Cross-Shard Congestion Control #539

Conversation

jakmeier commented Mar 22, 2024 • edited by flmel Loading

NEP Status (Updated by NEP Moderators)

jakmeier commented Mar 22, 2024 • edited by ewiner Loading

akhi3030 left a comment

Choose a reason for hiding this comment

jakmeier commented Mar 25, 2024 • edited Loading

wacban left a comment

Choose a reason for hiding this comment

akhi3030 commented Mar 25, 2024

saketh-are Mar 26, 2024

Choose a reason for hiding this comment

akhi3030 commented Mar 26, 2024

jakmeier commented Mar 27, 2024

akhi3030 commented Mar 27, 2024

wacban commented Mar 27, 2024

akhi3030 commented Mar 27, 2024

wacban commented Apr 5, 2024

AllToOne workload.

jakmeier commented May 15, 2024

jakmeier commented May 15, 2024

birchmd left a comment

Choose a reason for hiding this comment

bowenwang1996 commented May 21, 2024

mfornet left a comment

Choose a reason for hiding this comment

jakmeier commented May 22, 2024

mm-near commented May 24, 2024

jakmeier commented May 24, 2024

jakmeier commented Mar 22, 2024 •

edited by flmel

Loading

jakmeier commented Mar 22, 2024 •

edited by ewiner

Loading

jakmeier commented Mar 25, 2024 •

edited

Loading