Make transaction status service multi-threaded. #4032
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Problem
As part of an investigation into Agave OOM issues in internal private cluster tests, I found that TSS receiver channel would get severely backed up (80k+ pending msg) when the cluster is running at 40k TPS sustained (bench-TPS wkld; 80-20 FD to Agave node ratio). This would cause slow down across the system and build up memory usage until the node OOMs (crashed agave 256 GB node, and Agave tile on FD 512 GB node in my tests). This issue reproduces more prominently when running with '--enable-rpc-transaction-history' and '--enable-extended-tx-metadata-storage' enabled.
Summary of Changes
Original issue:
FD node failures are agave tile oom'ing.
Improved state:
tiv1 and tiv2 are agave nodes running the fix. Other nodes running same FD code as before.
original code (without tx history flags):