Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Apply transaction batches in periodic intervals. #4504

Merged
merged 2 commits into from
Sep 11, 2023

Conversation

mtrippled
Copy link
Collaborator

@mtrippled mtrippled commented Apr 21, 2023

Add new transaction submission API field, "sync_mode", which determines behavior of the server while submitting transactions: 1) sync (default): Process transactions in a batch immediately,
and return only once the transaction has been processed.
2) async: Put transaction into the batch for the next processing
interval and return immediately.
3) wait: Put transaction into the batch for the next processing
interval and return only after it is processed.

This PR is related to 2 others that, when combined, increase throughput significantly:
#4503
#4505

High Level Overview of Change

This improves transaction throughput. For background, transactions are applied to the open ledger in batches. Only one batch is applied at a time. As they are received from either a client or a peer, transactions are added to a batch. If no batches are being processed, then the current batch is processed immediately. Otherwise, it will be processed once the current batch completes. Batches are applied continuously until no more transactions are queued this way. This pattern optimizes both throughput and latency, but only if applying batches does not contend with other activities.

However, the problem is that batch application contends with numerous other activities for the MasterLock and the LedgerMaster lock. As transaction volume increases, so does lock contention. Analysis under heavy transaction load shows that a large amount of time is spent in each transaction batch setting up the open ledger for modification. However, each individual transaction takes a very small amount of time. More importantly, duration preparing to modify the ledger is not affected by the size of the batch! Instead, this duration is related to the amount of transactions in the current open ledger--as transaction volume increases, so does the time it takes to apply each batch.

For example, assume it takes 5ms to prepare the ledger for each batch, and 50us per transaction. Minimizing wall clock time, and therefore lock contention, means minimizing the number of batches. To put in perspective, single transactions submitted just under 5ms apart would consume all available wall clock time, with the vast majority simply preparing the open ledger for modification! That's not a problem if our only workload is applying transactions. But other things need to use the lock, also.

The solution implemented instead attempts to apply batches approximately every 100ms, or 10 times per second. To contrast with the above example, reducing the number of batches this way would reduce open ledger preparation time from nearly 1 full second to only 50ms, while actual transaction processing is a trivial 10ms. That's a 94% reduction in lock contention! That's a contrived example, of course, but it plays out in testing--as transaction volume increases, lock contention decreases and the server is able to process significantly higher volume.

Along with this fix is an enhancement to the submission API. Namely, the existing behavior is to immediately process transactions as they are submitted by the client (but not the peer). However, this tends to diminish the effectiveness of the fix as volume submitted directly to the server increases. This is because more batches are being applied. The problem exhibits itself only under very high volume, but if not addressed will cause problems as livenet volume increases. The API enhancement creates an optional new field called "sync_mode" with the following possible settings:
1) sync (default): Process transactions in a batch immediately,
and return only once the transaction has been processed.
2) async: Put transaction into the batch for the next processing
interval and return immediately. If successful, return a new code: terSUBMITTED.
3) wait: Put transaction into the batch for the next processing
interval and return only after it is processed.

Trade-offs for each option are as follows:

  • sync: This is identical to existing behavior, so no users will be suprised by this as long as this stays the default. However, it will lead to instability of the server as described above.
  • async: This returns immediately, and is actually faster for the client. However, transaction submission errors that occur only once the transaction is applied will be unknown to the client. This should be used for performance testing.
  • wait: This is the slowest for the client, but it allows the server to keep up with the network as if the transaction had been submitted "async".

Context of Change

Type of Change

  • New feature (non-breaking change which adds functionality)
  • Tests (You added tests for code that already exists, or your new feature included in this PR)

@ximinez
Copy link
Collaborator

ximinez commented Apr 24, 2023

Roughly how much slower is it for the client if they submit with wait?

@mtrippled
Copy link
Collaborator Author

Roughly how much slower is it for the client if they submit with wait?

Each batch takes place 100ms after the preceding. So the average additional wait time should be about 50ms.

@ximinez
Copy link
Collaborator

ximinez commented Apr 25, 2023

Each batch takes place 100ms after the preceding. So the average additional wait time should be about 50ms.

Yeah, duh. I should have realized that. But anyway, the reason I ask is that I suspect most clients aren't going to notice an extra 50ms (or maybe less because they're already paying the overhead of preparing the open ledger). Why not make the default behavior to wait? Or is that something planned for the future once we have data about how well it works in the real world?

And while I'm talking about ideas for the future, it might be worthwhile to make the default configurable, and even to disallow certain modes for non-admin connections. I'm thinking about public nodes who might want to reduce their load by not letting anybody submit with sync.

@mtrippled
Copy link
Collaborator Author

mtrippled commented Apr 25, 2023

Each batch takes place 100ms after the preceding. So the average additional wait time should be about 50ms.

Yeah, duh. I should have realized that. But anyway, the reason I ask is that I suspect most clients aren't going to notice an extra 50ms (or maybe less because they're already paying the overhead of preparing the open ledger). Why not make the default behavior to wait? Or is that something planned for the future once we have data about how well it works in the real world?

And while I'm talking about ideas for the future, it might be worthwhile to make the default configurable, and even to disallow certain modes for non-admin connections. I'm thinking about public nodes who might want to reduce their load by not letting anybody submit with sync.

I want to keep "sync" the default for now to not surprise anybody. The necessity to phase this out and to either use "wait" or "async" will happen as volumes increase to well over 2000/s based on the testing I've done. Consider that if a current tx submission takes 10ms from the client's perspective, then adding 50ms to that will decrease throughput from 100/s to about 17/s. That's for somebody doing a lot of transaction submission.

I think the idea of allowing administrators to disable certain modes is a good idea--I can see somebody like XRPLF wanting to do that, as well as Ripple. But in practice would also mean updating clients to be wise to this change. Maybe refinements such as this can be something the broad community can debate also?

src/ripple/app/misc/NetworkOPs.cpp Outdated Show resolved Hide resolved
src/ripple/app/misc/NetworkOPs.cpp Show resolved Hide resolved
src/ripple/rpc/impl/RPCHelpers.cpp Outdated Show resolved Hide resolved
@@ -562,6 +562,7 @@ JSS(sub_index); // in: LedgerEntry
JSS(subcommand); // in: PathFind
JSS(success); // rpc
JSS(supported); // out: AmendmentTableImpl
JSS(sync); // in: Submit
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I am correct that this becomes part of our client-facing API, we should make sure this is the name we want. I find myself wondering if another name, such as submit or maybe submit_mode would be better. I'm not the right person to answer this question. I'm simply trying to make this decision more visible. If sync is the best name, I'm fine with that.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@intelliot what's the best way to define this new API field? I like it as is, since the behavior has to do with being synchronous. But I also don't care enough either way. How should we handle this?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Can you (re)explain the purpose and meaning of the field?
  2. Why does it need to be user-facing?
  3. How should users decide what value to set in the field?

Once we have simplified and accurate answers to the above, then we should collect feedback from API users like @justinr1234, @mvadari, @mDuo13 to get a recommendation for how to define and introduce the field.

Copy link
Collaborator Author

@mtrippled mtrippled May 11, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mDuo13 can you please review the proposed API change, described at the top of this PR?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ximinez what about "sync_mode"?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sync mode sounds confusing to me. I prefer mode or submit_mode.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could see the value in sync_mode. Like I said earlier, we already know that it's related to submit, but what happens if down the road there's some other processing option that needs a separate mode. I have no idea what that would be, but you never know. If we do sync_mode now, then it'll be less confusing to add foo_mode later.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"sync_mode" means that the particular mode defines the behavior having to do with synchronicity.
"mode" can be anything and "submit_mode" is redundant. On the other hand, the interface is something that strikes everybody differently. What do you guys think at this point? @justinr1234 @ximinez

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sync_mode works for me.

Copy link
Contributor

@HowardHinnant HowardHinnant left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a few comments, but no blockers.

Copy link
Collaborator

@ximinez ximinez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left several suggestions, but I like the changes overall.

src/ripple/app/misc/NetworkOPs.cpp Outdated Show resolved Hide resolved
src/ripple/app/misc/NetworkOPs.cpp Show resolved Hide resolved
src/ripple/app/misc/NetworkOPs.h Outdated Show resolved Hide resolved
src/ripple/core/Config.h Outdated Show resolved Hide resolved
src/ripple/overlay/impl/PeerImp.cpp Outdated Show resolved Hide resolved
src/ripple/rpc/handlers/SubmitMultiSigned.cpp Outdated Show resolved Hide resolved
@@ -562,6 +562,7 @@ JSS(sub_index); // in: LedgerEntry
JSS(subcommand); // in: PathFind
JSS(success); // rpc
JSS(supported); // out: AmendmentTableImpl
JSS(sync); // in: Submit
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like submit_mode or maybe just mode (since it would be a param to a "submit" command, it would be a "submit mode" either way).

src/test/jtx/Env_test.cpp Outdated Show resolved Hide resolved
src/test/app/Transaction_ordering_test.cpp Outdated Show resolved Hide resolved
src/test/app/Transaction_ordering_test.cpp Outdated Show resolved Hide resolved
@justinr1234
Copy link
Collaborator

async: This returns immediately, and is actually faster for the client. However, transaction submission errors that occur only once the transaction is applied will be unknown to the client. This should be used for performance testing.

Is it possible to asynchronously send an error message to the websocket connection that the client program can still handle later?

src/ripple/basics/SubmitSync.h Show resolved Hide resolved
src/ripple/app/ledger/impl/LedgerMaster.cpp Outdated Show resolved Hide resolved
src/ripple/app/misc/NetworkOPs.cpp Outdated Show resolved Hide resolved
src/test/jtx/Env_test.cpp Outdated Show resolved Hide resolved
@mtrippled
Copy link
Collaborator Author

I created an issue for this so it can be considered in a future project: #4587

@intelliot
Copy link
Collaborator

note: blocked until #4505 is ready to merge

@intelliot intelliot merged commit 002893f into XRPLF:develop Sep 11, 2023
30 checks passed
@mDuo13
Copy link
Collaborator

mDuo13 commented Sep 15, 2023

Proposed release notes blurb:

  • Changed transaction processing so that the initial processing happens in batches every 100ms, which allows for much higher transaction throughput. However, this adds latency of up to 100ms (50ms on average) to processing of transactions from the peer-to-peer network. If you are connected to multiple servers and sending transactions less than 100ms apart, be sure to track your sequence numbers manually; otherwise, autofilling may use the same sequence number for different transactions, resulting in one of those transactions never being confirmed by consensus. (When this happens, you may get the result code tefPAST_SEQ from submitting transactions, but even transactions with an initial result of tesSUCCESS may fail to reach a consensus due to a conflict of sequence numbers.)
  • Added a new sync_mode parameter to the submit command to control batch processing of submitted transactions. The default value, sync, is the same as current behavior, to process each transaction individually. Other values allow transactions to be processed in batches every 100ms. Use async to return immediately with a terSUBMITTED result code, or wait to return only after the transaction batch has been processed.

@mtrippled
Copy link
Collaborator Author

Proposed release notes blurb:

  • Changed transaction processing so that the initial processing happens in batches every 100ms, which allows for much higher transaction throughput. However, this adds latency of up to 100ms (50ms on average) to processing of transactions from the peer-to-peer network. If you are connected to multiple servers and sending transactions less than 100ms apart, be sure to track your sequence numbers manually; otherwise, autofilling may use the same sequence number for different transactions, resulting in one of those transactions never being confirmed by consensus. (When this happens, you may get the result code tefPAST_SEQ from submitting transactions, but even transactions with an initial result of tesSUCCESS may fail to reach a consensus due to a conflict of sequence numbers.)

Good. Perhaps reiterate that this is only a potential problem if people use multiple rippled servers for the same sending account and rapidly send transactions. Up to you. Otherwise, thanks @mDuo13

  • Added a new sync_mode parameter to the submit command to control batch processing of submitted transactions. The default value, sync, is the same as current behavior, to process each transaction individually. Other values allow transactions to be processed in batches every 100ms. Use async to return immediately with a terSUBMITTED result code, or wait to return only after the transaction batch has been processed.

ckeshava pushed a commit to ckeshava/rippled that referenced this pull request Sep 22, 2023
Add new transaction submission API field, "sync", which
determines behavior of the server while submitting transactions:
1) sync (default): Process transactions in a batch immediately,
   and return only once the transaction has been processed.
2) async: Put transaction into the batch for the next processing
   interval and return immediately.
3) wait: Put transaction into the batch for the next processing
   interval and return only after it is processed.
ckeshava pushed a commit to ckeshava/rippled that referenced this pull request Sep 22, 2023
ckeshava pushed a commit to ckeshava/rippled that referenced this pull request Sep 22, 2023
Add new transaction submission API field, "sync", which
determines behavior of the server while submitting transactions:
1) sync (default): Process transactions in a batch immediately,
   and return only once the transaction has been processed.
2) async: Put transaction into the batch for the next processing
   interval and return immediately.
3) wait: Put transaction into the batch for the next processing
   interval and return only after it is processed.
ckeshava pushed a commit to ckeshava/rippled that referenced this pull request Sep 25, 2023
Add new transaction submission API field, "sync", which
determines behavior of the server while submitting transactions:
1) sync (default): Process transactions in a batch immediately,
   and return only once the transaction has been processed.
2) async: Put transaction into the batch for the next processing
   interval and return immediately.
3) wait: Put transaction into the batch for the next processing
   interval and return only after it is processed.
ckeshava pushed a commit to ckeshava/rippled that referenced this pull request Sep 25, 2023
ckeshava pushed a commit to ckeshava/rippled that referenced this pull request Sep 25, 2023
Add new transaction submission API field, "sync", which
determines behavior of the server while submitting transactions:
1) sync (default): Process transactions in a batch immediately,
   and return only once the transaction has been processed.
2) async: Put transaction into the batch for the next processing
   interval and return immediately.
3) wait: Put transaction into the batch for the next processing
   interval and return only after it is processed.
ximinez added a commit to ximinez/rippled that referenced this pull request Dec 14, 2023
ximinez added a commit to ximinez/rippled that referenced this pull request Dec 14, 2023
scottschurr added a commit to scottschurr/rippled that referenced this pull request Dec 18, 2023
This reverts commit 002893f.

Therefore two files with conflicts in the automated revert:
- src/ripple/rpc/impl/RPCHelpers.h and
- src/test/rpc/JSONRPC_test.cpp
Those files were manually resolved.

There is currently no evidence that any problems were introduced
by XRPLF#4504.  However something is misbehaving on the current state
of develop, and pull request XRPLF#4504 was identified as a possible
suspect.
intelliot pushed a commit that referenced this pull request Dec 20, 2023
This reverts commit 002893f.

There were two files with conflicts in the automated revert:

- src/ripple/rpc/impl/RPCHelpers.h and
- src/test/rpc/JSONRPC_test.cpp

Those files were manually resolved.
@intelliot intelliot mentioned this pull request Dec 20, 2023
2 tasks
@intelliot intelliot added the Reverted Changes which should still be considered for re-merging. See "Closed" PRs with this label label Jan 10, 2024
@intelliot intelliot modified the milestones: TPS 2023-09, TPS Jan 10, 2024
@intelliot
Copy link
Collaborator

Open question: How would this perform if the transaction rebroadcast interval is reduced?

  • Retransmit more often, so txs are stuck on one server for less time?

@intelliot intelliot added Perf Attn Needed Attention needed from RippleX Performance Team and removed Performance/Resource Improvement labels Jan 11, 2024
@intelliot
Copy link
Collaborator

  • Requires additional testing (and possibly debugging)

legleux pushed a commit to legleux/rippled that referenced this pull request Jan 27, 2024
…XRPLF#4852)

This reverts commit 002893f.

There were two files with conflicts in the automated revert:

- src/ripple/rpc/impl/RPCHelpers.h and
- src/test/rpc/JSONRPC_test.cpp

Those files were manually resolved.
legleux pushed a commit to legleux/rippled that referenced this pull request Jan 27, 2024
…XRPLF#4852)

This reverts commit 002893f.

There were two files with conflicts in the automated revert:

- src/ripple/rpc/impl/RPCHelpers.h and
- src/test/rpc/JSONRPC_test.cpp

Those files were manually resolved.
@ximinez ximinez mentioned this pull request Apr 10, 2024
3 tasks
sophiax851 pushed a commit to sophiax851/rippled that referenced this pull request Jun 12, 2024
sophiax851 pushed a commit to sophiax851/rippled that referenced this pull request Jun 12, 2024
Add new transaction submission API field, "sync", which
determines behavior of the server while submitting transactions:
1) sync (default): Process transactions in a batch immediately,
   and return only once the transaction has been processed.
2) async: Put transaction into the batch for the next processing
   interval and return immediately.
3) wait: Put transaction into the batch for the next processing
   interval and return only after it is processed.
sophiax851 pushed a commit to sophiax851/rippled that referenced this pull request Jun 12, 2024
…XRPLF#4852)

This reverts commit 002893f.

There were two files with conflicts in the automated revert:

- src/ripple/rpc/impl/RPCHelpers.h and
- src/test/rpc/JSONRPC_test.cpp

Those files were manually resolved.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Change Clio Reviewed Passed Passed code review & PR owner thinks it's ready to merge. Perf sign-off may still be required. Perf Attn Needed Attention needed from RippleX Performance Team Reverted Changes which should still be considered for re-merging. See "Closed" PRs with this label Will Need Documentation
Projects
Status: 📋 Backlog
Development

Successfully merging this pull request may close these issues.

10 participants