Skip to content

Retry on busy and locktimeout failures for rocksdb transactions#5976

Closed
garyschulte wants to merge 13 commits intohyperledger:mainfrom
garyschulte:feature/rocks-tx-retry
Closed

Retry on busy and locktimeout failures for rocksdb transactions#5976
garyschulte wants to merge 13 commits intohyperledger:mainfrom
garyschulte:feature/rocks-tx-retry

Conversation

@garyschulte
Copy link
Contributor

@garyschulte garyschulte commented Oct 2, 2023

PR description

Simple default retry policy for rocksdb busy and locktimeout exceptions.

**This may or may not remedy the busy exceptions encountered in the listed issues, but it will at a minimum provide better signal regarding whether these are transient issues due to compaction, or more systemic issues

Fixed Issue(s)

related to #5911
related to #5549
related to #5807

@github-actions
Copy link

github-actions bot commented Oct 2, 2023

  • I thought about documentation and added the doc-change-required label to this PR if updates are required.
  • I thought about the changelog and included a changelog update if required.
  • If my PR includes database changes (e.g. KeyValueSegmentIdentifier) I have thought about compatibility and performed forwards and backwards compatibility tests

"RocksDB Busy exception caught on attempt {} of {}, retrying",
attemptNumber,
retryLimit);
commit(attemptNumber + 1, retryLimit);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you think a direct retry without any delay will fix the problem ? maybe setting the level to info like that we can debug and see if this fix can unblock this kind of situation ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't yet been able to reproduce a busy exception, but I bumped the logging to warn. It seems reasonable to leave it at that level since it likely will provide good signal and shouldn't spam the logs due to the transaction commit timeout.

@garyschulte garyschulte force-pushed the feature/rocks-tx-retry branch from c3e7d7c to 5d6e4f5 Compare October 4, 2023 19:06
@jframe
Copy link
Contributor

jframe commented Oct 5, 2023

I remember that @gfukushima tried a similar approach in https://github.com/hyperledger/besu/pull/5287/files but from what recall the retry didn't fix the busy error.

@garyschulte
Copy link
Contributor Author

I remember that @gfukushima tried a similar approach in https://github.com/hyperledger/besu/pull/5287/files but from what recall the retry didn't fix the busy error.

Thanks I didn't know he already tried this.

Was he able correlate this to rocksdb activity? The rocks code indicates that 'Status::Busy' should be transient and the remedy is to retry. I have some instances trying to reproduce this now, hoping to get some activity metrics and and rocksdb logs to see what might be causing the contention.

@garyschulte garyschulte closed this Oct 5, 2023
@jframe
Copy link
Contributor

jframe commented Oct 5, 2023

I remember that @gfukushima tried a similar approach in https://github.com/hyperledger/besu/pull/5287/files but from what recall the retry didn't fix the busy error.

Thanks I didn't know he already tried this.

Was he able correlate this to rocksdb activity? The rocks code indicates that 'Status::Busy' should be transient and the remedy is to retry. I have some instances trying to reproduce this now, hoping to get some activity metrics and and rocksdb logs to see what might be causing the contention.

Not sure about correlating it to rocksdb activity. Too long ago. He is on holiday right now. Have asked around, hopefully someone remembers more than I do.

Agree that doing a retry definitely seems like the right thing to do.

@siladu
Copy link
Contributor

siladu commented Oct 5, 2023

hopefully someone remembers more than I do.
Maybe @matkt ?

…tions

Signed-off-by: garyschulte <garyschulte@gmail.com>
Signed-off-by: garyschulte <garyschulte@gmail.com>
Signed-off-by: garyschulte <garyschulte@gmail.com>
Signed-off-by: garyschulte <garyschulte@gmail.com>
Signed-off-by: garyschulte <garyschulte@gmail.com>
Signed-off-by: garyschulte <garyschulte@gmail.com>
Signed-off-by: garyschulte <garyschulte@gmail.com>
Signed-off-by: garyschulte <garyschulte@gmail.com>
…age parsing

Signed-off-by: garyschulte <garyschulte@gmail.com>
@garyschulte
Copy link
Contributor Author

I have been able to reproduce this somewhat readily by snap syncing a forest node with slow storage.

I think Forest is particularly vulnerable to these transient locking/busy issues, because it is using TransactionDB mode, which locks keys during transaction write rather than during commit.

So far, I have seen it exclusively on .put() rather than .commit(), so I tweaked the implementation to retry all RocksDBTransaction actions that encounter a non-fatal exception and am re-testing

@garyschulte garyschulte reopened this Oct 5, 2023
@garyschulte garyschulte force-pushed the feature/rocks-tx-retry branch from c2c0cde to faa5dd6 Compare October 5, 2023 21:24
@garyschulte
Copy link
Contributor Author

garyschulte commented Oct 6, 2023

Interestingly, this did fix one of the TimedOut(LockTimeout) exceptions on one thread on scheduler-8. But 60 seconds later the same thing happened to both scheduler-8 and scheduler-13. It looks like they were both trying to obtain locks and ended up in a deadlock kind of situation:

{"@timestamp":"2023-10-05T23:50:50,129","level":"WARN","thread":"EthScheduler-Services-8 (batchPersistStorageData)","class":"RocksDBTransaction","message":"RocksDB Transient exception caught on attempt 0 of 3, status: TimedOut(LockTimeout), retrying.","throwable":""}
{"@timestamp":"2023-10-05T23:51:45,556","level":"WARN","thread":"EthScheduler-Services-8 (batchPersistStorageData)","class":"RocksDBTransaction","message":"RocksDB Transient exception caught on attempt 0 of 3, status: TimedOut(LockTimeout), retrying.","throwable":""}
{"@timestamp":"2023-10-05T23:51:45,627","level":"WARN","thread":"EthScheduler-Services-13 (batchPersistLargeStorageData)","class":"RocksDBTransaction","message":"RocksDB Transient exception caught on attempt 0 of 3, status: TimedOut(LockTimeout), retrying.","throwable":""}
{"@timestamp":"2023-10-05T23:51:46,557","level":"WARN","thread":"EthScheduler-Services-8 (batchPersistStorageData)","class":"RocksDBTransaction","message":"RocksDB Transient exception caught on attempt 1 of 3, status: TimedOut(LockTimeout), retrying.","throwable":""}
{"@timestamp":"2023-10-05T23:51:46,630","level":"WARN","thread":"EthScheduler-Services-13 (batchPersistLargeStorageData)","class":"RocksDBTransaction","message":"RocksDB Transient exception caught on attempt 1 of 3, status: TimedOut(LockTimeout), retrying.","throwable":""}
{"@timestamp":"2023-10-05T23:51:47,558","level":"WARN","thread":"EthScheduler-Services-8 (batchPersistStorageData)","class":"RocksDBTransaction","message":"RocksDB Transient exception caught on attempt 2 of 3, status: TimedOut(LockTimeout), retrying.","throwable":""}
{"@timestamp":"2023-10-05T23:51:47,630","level":"WARN","thread":"EthScheduler-Services-13 (batchPersistLargeStorageData)","class":"RocksDBTransaction","message":"RocksDB Transient exception caught on attempt 2 of 3, status: TimedOut(LockTimeout), retrying.","throwable":""}
{"@timestamp":"2023-10-05T23:51:48,558","level":"WARN","thread":"EthScheduler-Services-8 (batchPersistStorageData)","class":"RocksDBTransaction","message":"RocksDB Transient exception caught on attempt 3 of 3, status: TimedOut(LockTimeout), retrying.","throwable":""}
{"@timestamp":"2023-10-05T23:51:48,631","level":"WARN","thread":"EthScheduler-Services-13 (batchPersistLargeStorageData)","class":"RocksDBTransaction","message":"RocksDB Transient exception caught on attempt 3 of 3, status: TimedOut(LockTimeout), retrying.","throwable":""}

It is interesting that the retry worked for the first lock timeout, but 8 and 13 together seemed to be deadlocked in the same time frame, for the same timeout duration. Perhaps what we might need to do is introduce a static timeout supplier that forces an incremental backoff wait. That way at least one of the threads might be able to get the lock and progress while the other one waits.

I will try that on this node since it seems to readily reproduce it within an hour or so of starting a mainnet sync.

edit: also, I would swear we have a write timeout that is longer than 1 second??

edit-edit: Looking at the RocksDB LOG file, there is a suspicious GAP in the otherwise incrediby chatty rocksdb log when the LockTimeouts occurred:

2023/10/05-23:51:43.520518 139977706960448 EVENT_LOG_v1 {"time_micros": 1696549903520516, "job": 789, "event": "table_file_deletion", "file_number": 2509}
2023/10/05-23:51:43.521454 139977706960448 [le/delete_scheduler.cc:77] Deleted file /data/besu/database/002508.sst immediately, rate_bytes_per_sec 0, total_trash_size 0 max_trash_db_ratio 0.250000
2023/10/05-23:51:43.521469 139977706960448 EVENT_LOG_v1 {"time_micros": 1696549903521467, "job": 789, "event": "table_file_deletion", "file_number": 2508}
2023/10/05-23:51:48.571692 139977323705920 [/db_impl/db_impl_write.cc:2155] [] New memtable created with log file: #2852. Immutable memtables: 0.
2023/10/05-23:51:48.571768 139977673389632 [/db_impl/db_impl_compaction_flush.cc:105] [JOB 793] Syncing log #2848
2023/10/05-23:51:48.873910 139977673389632 (Original Log Time 2023/10/05-23:51:48.571755) [/db_impl/db_impl_compaction_flush.cc:2960] Calling FlushMemTableToOutputFile with column family [], flush slots ava

rocksdb is usually writing tens of log lines per second. it is quite conspicuous for there to be a 5 second gap. This might be correlation, but it seems odd to not have any activity logged, regardless of 2 (of many) transactions getting lock timeouts 🤔

…onDB

Signed-off-by: garyschulte <garyschulte@gmail.com>
Signed-off-by: garyschulte <garyschulte@gmail.com>
@matkt
Copy link
Contributor

matkt commented Oct 6, 2023

If the bug arrives largely during the sync. I can try to make a logic of retry but at the pipeline level in addition to this pr. Instead of failing pipeline on a lock I restart the step later. Your PR is trying to manage the commit in general but I think we can also try to have better management of rocksdb error on the pipeline side too

Signed-off-by: garyschulte <garyschulte@gmail.com>
@garyschulte
Copy link
Contributor Author

garyschulte commented Oct 6, 2023

If the bug arrives largely during the sync. I can try to make a logic of retry but at the pipeline level in addition to this pr. Instead of failing pipeline on a lock I restart the step later. Your PR is trying to manage the commit in general but I think we can also try to have better management of rocksdb error on the pipeline side too

There seems to be a variety of places this can occur. I encountered LockTimeout most often during sync in state trie commit visitors, when they are just doing a transaction put(). I haven't been able to reproduce the Busy exception during a bonsai sync, but Optimistic transaction db seems to have a different behavior/pattern since it resolves locks at transaction .commit() time.

I think it is better to address it systemically rather than having individual processes duplicate the retry logic. If we find that some code paths are particularly prone to this or need special care, we can always catch the storage exception that is eventually thrown and apply process-specific logic there.

@matkt
Copy link
Contributor

matkt commented Oct 6, 2023

If the bug arrives largely during the sync. I can try to make a logic of retry but at the pipeline level in addition to this pr. Instead of failing pipeline on a lock I restart the step later. Your PR is trying to manage the commit in general but I think we can also try to have better management of rocksdb error on the pipeline side too

There seems to be a variety of places this can occur. I encountered LockTimeout most often during sync in state trie commit visitors, when they are just doing a transaction put(). I haven't been able to reproduce the Busy exception during a bonsai sync, but Optimistic transaction db seems to have a different behavior/pattern since it resolves locks at transaction .commit() time.

I think it is better to address it systemically rather than having individual processes duplicate the retry logic. If we find that some code paths are particularly prone to this or need special care, we can always catch the storage exception that is eventually thrown and apply process-specific logic there.

Ok i will try to also start some sync next week hoping to have it with bonsai

@garyschulte
Copy link
Contributor Author

garyschulte commented Oct 6, 2023

@matkt - perhaps I spoke too soon about not trying to handle this in the sync process. With an incremental backoff retry, I am still seeing LockTimeouts. Sometimes a single thread will get locktimeouts and will resolve, but inevitably two separate threads are trying to write to overlapping key ranges and a deadlock occurs. It doesn't matter how long we wait since each process/transaction is trying to get a lock on the keys of the other. This doesn't happen with OptimisticTransactionDB because conflicts are resolved at commit time rather than transaction write.

We will have to do some instrumenting/logging to test this theory, but I think we have 2 separate worldstate download threads that are trying to write to some of the same trie keys during the commit phase.

However... I suspect the better option is going to be to ditch TransactionDB for forest and go back to OptimisticTransactionDB with transaction retries. IIRC, we reverted forest to TransactionDB to alleviate "Busy" exceptions. The current retry strategy in this PR might make TransactionDB unnecessary for forest, and we can avoid the concurrency/deadlock issue.

@matkt
Copy link
Contributor

matkt commented Oct 6, 2023

@matkt - perhaps I spoke too soon about not trying to handle this in the sync process. With an incremental backoff retry, I am still seeing LockTimeouts. Sometimes a single thread will get locktimeouts and will resolve, but inevitably two separate threads are trying to write to overlapping key ranges and a deadlock occurs. It doesn't matter how long we wait since each process/transaction is trying to get a lock on the keys of the other. This doesn't happen with OptimisticTransactionDB because conflicts are resolved at commit time rather than write.

We will have to do some instrumenting/logging to test this theory, but I think we have 2 separate worldstate download threads that are trying to write to some of the same trie keys during the commit phase.

However... I suspect the better option is going to be to ditch TransactionDB for forest and go back to OptimisticTransactionDB with transaction retries. IIRC, we reverted forest to TransactionDB to alleviate "Busy" exceptions. The current retry strategy in this PR might make TransactionDB unnecessary for forest, and we can avoid the concurrency/deadlock issue.

For the sync part. The fix can be very simple, that's why I propose this. It is juste to catch the error during a persist and to marke the task as failed. The pipeline therefore will process this task again later with the pipeline .Apart from that I think that the probability of having this error outside the sync should be limited. Because we do not have too many other cases where we have several threads that persist in the Worldstate at tthe same

@matkt
Copy link
Contributor

matkt commented Oct 6, 2023

@matkt - perhaps I spoke too soon about not trying to handle this in the sync process. With an incremental backoff retry, I am still seeing LockTimeouts. Sometimes a single thread will get locktimeouts and will resolve, but inevitably two separate threads are trying to write to overlapping key ranges and a deadlock occurs. It doesn't matter how long we wait since each process/transaction is trying to get a lock on the keys of the other. This doesn't happen with OptimisticTransactionDB because conflicts are resolved at commit time rather than write.
We will have to do some instrumenting/logging to test this theory, but I think we have 2 separate worldstate download threads that are trying to write to some of the same trie keys during the commit phase.
However... I suspect the better option is going to be to ditch TransactionDB for forest and go back to OptimisticTransactionDB with transaction retries. IIRC, we reverted forest to TransactionDB to alleviate "Busy" exceptions. The current retry strategy in this PR might make TransactionDB unnecessary for forest, and we can avoid the concurrency/deadlock issue.

For the sync part. The fix can be very simple, that's why I propose this. It is juste to catch the error during a persist and to marke the task as failed. The pipeline therefore will process this task again later with the pipeline

I think timeout is the non Optimistic version of Busy

@garyschulte garyschulte marked this pull request as draft October 6, 2023 18:51
@garyschulte
Copy link
Contributor Author

garyschulte commented Oct 6, 2023

For the sync part. The fix can be very simple, that's why I propose this. It is juste to catch the error during a persist and to marke the task as failed. The pipeline therefore will process this task again later with the pipeline .Apart from that I think that the probability of having this error outside the sync should be limited. Because we do not have too many other cases where we have several threads that persist in the Worldstate at tthe same

Seems reasonable. It isn't obvious that allowing both threads to write is going to result in a correct trie either, so allowing one or both to fail and retry the task later is probably the better strategy.

Looking back over all of the "Busy" issues, it seems all of them have been related to worldstate sync. If we handle the sync failure gracefully in the pipeline, there really isn't any need for TransactionDB anymore afaict. We know that Optimistic is more performant, so if TransactionDB isn't 'fixing' anything, we should remove it as well.

Signed-off-by: garyschulte <garyschulte@gmail.com>
@garyschulte garyschulte force-pushed the feature/rocks-tx-retry branch from 9a7f850 to 5e7c92e Compare October 6, 2023 19:07
@garyschulte
Copy link
Contributor Author

closing this in favor of #6004

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants