cmd, core, eth: init tx lookup in background#20270
Conversation
holiman
left a comment
There was a problem hiding this comment.
Some comments, will continue reviewing later..
| } | ||
| TxLookupLimitFlag = cli.Int64Flag{ | ||
| Name: "txlookuplimit", | ||
| Usage: "Number of recent blocks to index transactions-by-hash in (default = index all blocks)", |
There was a problem hiding this comment.
The current phrasing sound like a limit to how much it will index, but in fact it also determines how much it will delete. So in essence, how large index to maintain
| Usage: "Number of recent blocks to index transactions-by-hash in (default = index all blocks)", | |
| Usage: "Number of recent blocks to maintain transactions index by-hash (default = index all blocks)", |
| return | ||
| } | ||
| } | ||
| block := ReadBlock(db, ReadCanonicalHash(db, uint64(n)), uint64(n)) |
There was a problem hiding this comment.
The way this is done is simple, and that's good, I guess. It just irks me a bit that given the format of ancients, it has a pretty large overhead. So this is what happens, assuming that things aren't cached.
- Read hash from freezer (tiny 32 byte read)
- Read header rlp from freezer. Decode header RLP into header.
- Read body rlp from freezer. Decode body rlp into
types.Body. - Construct new Block with header with body (copying heder, transactoins and uncles)
- And then call prepare on it
Whereas, for everything that's in ancients, it would basically suffice to:
- Read body rlp from freezer. Decode into
types.Body. - call prepare on the body.Transactions
That would save us 2 disk accesses and a lot of memcpy. Maybe we could have a method ReadCanonicalTransactions(blocknumber), so we know that we can pull straight from ancients ? I don't now what the best solution is, thinking aloud here...
| return errors.New("broken database") | ||
| } | ||
| // Push the block into the import queue and process contiguous ranges | ||
| priority := -int64(block.NumberU64()) |
There was a problem hiding this comment.
In our particular case with tx indexing, we don't actually require the blocks to come in a strict sequence, so could just operate on them as they come in. So we wouldn't have to mess about with the prque. But the que is fairly small, so maybe that's nothing to worry about
Actually, now that I think about it, does it really matter in InitDatabaseFromFreezer that we deal with the blocks strictly in order?
There was a problem hiding this comment.
If batches are atomic, then it doesn't matter.
There was a problem hiding this comment.
Ah no wait, it matters because we commit batches after some data limit is reached. If we don't push blocks into batches in an ordered way and terminate/crash geth in between, we might end up with gaps.
That said, we're not setting the progress marker anywhere, so it might all just be "lost" and recomputed on restart.
There was a problem hiding this comment.
Yes, strict order is required for many cases. Like the tx indexing, we always assume that the block above the TAIL has indexes available.
There was a problem hiding this comment.
Yes, strict order is required for many cases. Like the tx indexing, we always assume that the block above the TAIL has indexes available.
| } | ||
| // hashBlock calculates block hash in advance using the multi-routine's concurrent | ||
| // computing power. | ||
| hashBlock := func(block *types.Block) { block.Hash() } |
There was a problem hiding this comment.
This somehow seems odd to me... I mean, later on, during iteration, we will look up the hash from ancients, and then load the header and body. So it feels weird that we need to hash it again?
All we theoretically need to do is,
- Traverse
ancients hashes, callWriteHeaderNumber(batch, hash, number)for each hash. There's no intrinsic need to actually go and dig up the (potentailly large) body.
There was a problem hiding this comment.
True. I guess it was just a generalization of the iteration mechanism to use blocks and not just headers. There's indeed no reason to dig up the bodies (lol) if we only need the heads (lol).
There was a problem hiding this comment.
Here's how I would do it
diff --git a/core/rawdb/chain_iterator.go b/core/rawdb/chain_iterator.go
index 7dfe69d70f..668ecc2dfc 100644
--- a/core/rawdb/chain_iterator.go
+++ b/core/rawdb/chain_iterator.go
@@ -166,19 +166,41 @@ func InitDatabaseFromFreezer(db ethdb.Database) {
if err != nil || frozen == 0 {
return
}
- // hashBlock calculates block hash in advance using the multi-routine's concurrent
- // computing power.
- hashBlock := func(block *types.Block) { block.Hash() }
-
- // writeIndex injects hash <-> number mapping into the database.
- writeIndex := func(batch ethdb.Batch, block *types.Block) { WriteHeaderNumber(batch, block.Hash(), block.NumberU64()) }
+ var (
+ batch = db.NewBatch()
+ start = time.Now()
+ logged = start.Add(-7 * time.Second) // Unindex during import is fast, don't double log
+ hash common.Hash
+ )
+ for i := uint64(0); i < frozen; i++ {
+ if h, err := db.Ancient(freezerHashTable, i); err != nil {
+ log.Crit("Failed to init database from freezer", "err", err)
+ } else {
+ hash = common.BytesToHash(h)
+ }
- if err := iterateCanonicalChain(db, 0, frozen, hashBlock, writeIndex, false, "Initializing database from freezer", "Initialized database from freezer"); err != nil {
- log.Crit("Failed to init database from freezer", "err", err)
+ WriteHeaderNumber(batch, hash, i)
+ // If enough data was accumulated in memory or we're at the last block, dump to disk
+ if batch.ValueSize() > ethdb.IdealBatchSize {
+ if err := batch.Write(); err != nil {
+ log.Crit("Failed to write data to db", "err", err)
+ }
+ batch.Reset()
+ }
+ // If we've spent too much time already, notify the user of what we're doing
+ if time.Since(logged) > 8*time.Second {
+ log.Info("Initializing database from freezer", "blocks", i, "total", frozen, "tail", i, "hash", hash, "elapsed", common.PrettyDuration(time.Since(start)))
+ logged = time.Now()
+ }
+ }
+ if err := batch.Write(); err != nil {
+ log.Crit("Failed to write data to db", "err", err)
}
- hash := ReadCanonicalHash(db, frozen-1)
+ batch.Reset()
+
WriteHeadHeaderHash(db, hash)
WriteHeadFastBlockHash(db, hash)
+ log.Info("Initialized database from freezer", "blocks", frozen, "tail", frozen, "elapsed", common.PrettyDuration(time.Since(start)))
}
// IndexTransactions creates txlookup indices of the specified block range.There was a problem hiding this comment.
And skip the generalized version, at least for InitDatabaseFromFreezer. And then I'd do optimized versions of tx indexing, and only then would I look to see if they can be generalized, but I think the current generalization is too cumbersome
There was a problem hiding this comment.
Here's the PR for that: https://github.com/karalabe/go-ethereum/pull/20
|
An alternative approach is #20302 |
This is a squashed and rebased version of #19853 from @rjl493456442. Opened a new PR because I didn't want to nuke the commit history of the old one, but with 13 conflicting commits, it was getting too annoying to fix commit by commit.
Quoting Gary from the original PR:
In our schema, for each transaction, we will maintain an index. The index is basically a pair like
txHash->blockNumberwhich indicators the canonical block number this transaction is included. In another word, tx index is necessary for users to query a transaction based on the hash.However there are 580M transactions in the mainnet. So the indexes can take more than 10GB space in the database. It's a lot.
So the idea is: for most users, they only care about the latest transactions they send. So we can drop off historical indexes to reduce disk usage.
So in this PR, we offer a new command flag called
txlookuplimit. User can specify how many indexes they want to maintain in db. For the older, we will GC them.What's more, if users want to query some historical transactions when the indexes are GCed, he can change the
txlookuplimit, so that in the background Geth will re-generate indexes. Iftxlookuplimitis 0, it means we generate all indexes.