[DB] Data error checks and frequency changes #872
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Problem
PindexWalk->pprev assert failures occur randomly on startup
Root Cause
Note that root cause is not yet completely understood. This is not a fix, but rather a mitigation that alleviates the risk of the problem occurring.
A corruption occurs occasionally in the writing of the block database, where an index no longer points to the correct block, but rather reads a different block off disk. Usually the block read happens to be an orphan block. It definitely is connected to writing of side chains and orphans, but the root cause is yet unknown. This problem results in a valid block not being read from disk, so when it begins to connect the blocks together [for the staking algorithmic calculations], there is a block that doesn't have a valid previous [because it basically walks off the chain because there's a block missing.
Mitigation
Note that this PR will not correct an already corrupt database; it will only minimize the occurrences. Anyone currently with a corrupted database will have to recover their chain prior to being able to run this PR.
Several changes were made. Information is added to the debug log file to give an indication as to what the problem block was, when a problem block is detected on startup. When the header information is disconnected from the block information, the block is no longer written to disk (This generally occurs with orphan blocks). If a block is read when this occurs, it also is reported to the log file.
The frequency that blocks are written to disk has been changed to write a similar number of blocks at a time as Bitcoin. Since veil's block creation is 10 times faster, the write timers have been changed to be 10% of what they were in Bitcoin. This showed significant improvement to the occurrences of the corruption. It's important to remind again that this is not a fix; there is still a lingering issue. However over heavy test with two nodes, one of them running with the new write frequency and one without the write frequency; the one without the change found 63 instances of corrupted blocks in 1923 blocks. The node with the change saw zero corruptions.
For that reason, this PR is being pushed to greatly reduce the occurrences that have become prevalent again as more people are in wallet mining and staking at the same time. Issue #692 will remain open while this is continued to be worked over time, as attempts to correct the corruption are investigated, or root cause is continued to be found. This PR however does reduce the urgency of the research.