Reduce ProgramCache write lock contention#1037
Conversation
|
|
||
| execution_time.stop(); | ||
|
|
||
| /* |
There was a problem hiding this comment.
firstly, let's see which test should fail. :)
There was a problem hiding this comment.
hehe, turned out there's no test to test this code specifically...
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #1037 +/- ##
=======================================
Coverage 82.0% 82.0%
=======================================
Files 860 860
Lines 232898 232911 +13
=======================================
+ Hits 191071 191104 +33
+ Misses 41827 41807 -20 🚀 New features to boost your workflow:
|
86c433e to
1073cf7
Compare
1073cf7 to
cc12279
Compare
| } = execution_result | ||
| { | ||
| if details.status.is_ok() { | ||
| if details.status.is_ok() && !programs_modified_by_tx.is_empty() { |
There was a problem hiding this comment.
hope this one is fairly uncontroversial. haha
| // ProgramCache entries. Note that this flag is deliberately defined, so that there's still | ||
| // at least one other batch, which will evict the program cache, even after the occurrences | ||
| // of cooperative loading. | ||
| if programs_loaded_for_tx_batch.borrow().loaded_missing { |
There was a problem hiding this comment.
i guess this one isn't so straightforward. better ideas are very welcome.
There was a problem hiding this comment.
Global cache can also grow via cache.merge(programs_modified_by_tx); above, not just by loading missing.
There was a problem hiding this comment.
Better, but the number of insertions and evictions can still be unbalanced because it is only a boolean.
Also, maybe we should move eviction to the place where we merge in new deployments? That way they could share a write lock.
There was a problem hiding this comment.
Better, but the number of insertions and evictions can still be unbalanced because it is only a boolean.
I intentionally chosen boolean, thinking the number of insertions and evictions doesn't need to be balanced. That's because evict_using_2s_random_selection() continues to evict entries until they're under 90% of MAX_LOADED_ENTRY_COUNT(=256) just with a single invocation. So, we just need to ensure these are called with sufficient frequency/timings to avoid cache bomb dos attack.
Also, maybe we should move eviction to the place where we merge in new deployments? That way they could share a write lock.
This is possible and that looks appealing. however it isn't trivial. Firstly, load_and_execute_sanitized_transactions can be entered via 3 code path: replaying, banking, rpc tx simulation. I guess that's the reason this eviction is placed here to begin with as a the most shared code path for all of transaction executions?
The place where we merge in new deployments is the commit_transactions(), which isn't touched by the rpc tx simulation for obvious reason. So, moving this eviction there would expose unbounded program cache entry grow dos (theoretically; assumes no new blocks for extended duration). Also, replaying and banking take the commit code-path under slightly different semantics. so, needs a bit of care to move this eviction nevertheless, even if we ignore the rpc concern...
all that said, I think the current code change should be good enough and safe enough?
| { | ||
| if details.status.is_ok() { | ||
| if details.status.is_ok() && !programs_modified_by_tx.is_empty() { | ||
| let mut cache = self.transaction_processor.program_cache.write().unwrap(); |
There was a problem hiding this comment.
actually, i noticed that this write-lock is per-tx write-lock, if the batch contains 2 or more transactions, while writing this: #1037 (comment)
There was a problem hiding this comment.
You are right. How about:
if execution_results.iter().any(|execution_result| matches!(execution_result, TransactionExecutionResult::Executed { details, programs_modified_by_tx } if details.status.is_ok() && !programs_modified_by_tx.is_empty())) {
let mut cache = self.transaction_processor.program_cache.write().unwrap();
for execution_result in &execution_results {
if let TransactionExecutionResult::Executed { programs_modified_by_tx, .. } = execution_result {
cache.merge(programs_modified_by_tx);
}
}
}There was a problem hiding this comment.
hmm, that incurs 2 pass looping for the worst case (totaling, O(2*N)). also rather heavy code duplication.
Considering !programs_modified_by_tx.is_empty() should be rare (unless malice), I think a quick and dirty memoization like this will be enough (this worst case's overall cost is O(Cm*N), where Cm << 2, Cm == memoization overhead cce3075
|
Changes seem fine to me, but I'm less familiar with program-cache stuff. In terms of lock contention, I just ran banking-bench as a sanity check - no programs should be compiled but if locks are grabbed less often might expect small boost in throughput. Results seem to show around the same if not slightly better: |
this result isn't surprising considering:
So, banking stage is much like the blockstore processor in this regard... |
Problem
ProgramCache's current locking behavior is needlessly hurting the unified scheduler's performance.Unified scheduler is a new transaction scheduler for the block verification (i.e. the replaying stage). It doesn't employ batching at all as a design choice. Technically, the unified scheduler still is using
TransactionBatches but only with 1 transaction for each. That means it bares all the extra overheads, which has been amortized by batching.Sidenote: I believe these overheads are all solvable (but not SOON except this one...). Also note that it's already 1.8x faster than the to-be-replaced
blockstore_processorafter this pr.Among all those overhead sources, one of the most visible one is
ProgramCache's write-lock contentions. Currently,ProgramCacheis write-locking 3 times unconditionally per 1-tx batch (for loading byreplenish_program_cache(), for evictions byload_and_execute_sanitized_transactions(), for updated program commiting bycommit_transactions()). so, it acutely hampers the unified scheduler concurrency.Summary of Changes
Reduce
write-lockofProgramCacheto bare-minimal. Now the usual case is 1 time for loading byreplenish_program_cache(), while the worst case is still remaining at 3 times.roughly ~5% consistent improvement. also note that
blockstore-processorisn't affected while both are taking the same code-path.perf improvements
before
after
after-the-pr's speedup-factor is now 1.8x:
(for the record) the merged commit
before(3f7b352):
after(90bea33):
context: extracted from #593
cc: @apfitzge