Skip to content

cmd, core, eth: init tx lookup in background#20270

Closed
karalabe wants to merge 3 commits into
ethereum:masterfrom
karalabe:gary-bg-reindex
Closed

cmd, core, eth: init tx lookup in background#20270
karalabe wants to merge 3 commits into
ethereum:masterfrom
karalabe:gary-bg-reindex

Conversation

@karalabe
Copy link
Copy Markdown
Member

@karalabe karalabe commented Nov 12, 2019

This is a squashed and rebased version of #19853 from @rjl493456442. Opened a new PR because I didn't want to nuke the commit history of the old one, but with 13 conflicting commits, it was getting too annoying to fix commit by commit.


Quoting Gary from the original PR:

In our schema, for each transaction, we will maintain an index. The index is basically a pair like txHash->blockNumber which indicators the canonical block number this transaction is included. In another word, tx index is necessary for users to query a transaction based on the hash.

However there are 580M transactions in the mainnet. So the indexes can take more than 10GB space in the database. It's a lot.

So the idea is: for most users, they only care about the latest transactions they send. So we can drop off historical indexes to reduce disk usage.

So in this PR, we offer a new command flag called txlookuplimit. User can specify how many indexes they want to maintain in db. For the older, we will GC them.

What's more, if users want to query some historical transactions when the indexes are GCed, he can change the txlookuplimit, so that in the background Geth will re-generate indexes. If txlookuplimit is 0, it means we generate all indexes.

Copy link
Copy Markdown
Contributor

@holiman holiman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some comments, will continue reviewing later..

Comment thread cmd/utils/flags.go
}
TxLookupLimitFlag = cli.Int64Flag{
Name: "txlookuplimit",
Usage: "Number of recent blocks to index transactions-by-hash in (default = index all blocks)",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current phrasing sound like a limit to how much it will index, but in fact it also determines how much it will delete. So in essence, how large index to maintain

Suggested change
Usage: "Number of recent blocks to index transactions-by-hash in (default = index all blocks)",
Usage: "Number of recent blocks to maintain transactions index by-hash (default = index all blocks)",

return
}
}
block := ReadBlock(db, ReadCanonicalHash(db, uint64(n)), uint64(n))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The way this is done is simple, and that's good, I guess. It just irks me a bit that given the format of ancients, it has a pretty large overhead. So this is what happens, assuming that things aren't cached.

  1. Read hash from freezer (tiny 32 byte read)
  2. Read header rlp from freezer. Decode header RLP into header.
  3. Read body rlp from freezer. Decode body rlp intotypes.Body.
  4. Construct new Block with header with body (copying heder, transactoins and uncles)
  5. And then call prepare on it

Whereas, for everything that's in ancients, it would basically suffice to:

  1. Read body rlp from freezer. Decode into types.Body.
  2. call prepare on the body.Transactions

That would save us 2 disk accesses and a lot of memcpy. Maybe we could have a method ReadCanonicalTransactions(blocknumber), so we know that we can pull straight from ancients ? I don't now what the best solution is, thinking aloud here...

return errors.New("broken database")
}
// Push the block into the import queue and process contiguous ranges
priority := -int64(block.NumberU64())
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In our particular case with tx indexing, we don't actually require the blocks to come in a strict sequence, so could just operate on them as they come in. So we wouldn't have to mess about with the prque. But the que is fairly small, so maybe that's nothing to worry about

Actually, now that I think about it, does it really matter in InitDatabaseFromFreezer that we deal with the blocks strictly in order?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If batches are atomic, then it doesn't matter.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah no wait, it matters because we commit batches after some data limit is reached. If we don't push blocks into batches in an ordered way and terminate/crash geth in between, we might end up with gaps.

That said, we're not setting the progress marker anywhere, so it might all just be "lost" and recomputed on restart.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, strict order is required for many cases. Like the tx indexing, we always assume that the block above the TAIL has indexes available.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, strict order is required for many cases. Like the tx indexing, we always assume that the block above the TAIL has indexes available.

}
// hashBlock calculates block hash in advance using the multi-routine's concurrent
// computing power.
hashBlock := func(block *types.Block) { block.Hash() }
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This somehow seems odd to me... I mean, later on, during iteration, we will look up the hash from ancients, and then load the header and body. So it feels weird that we need to hash it again?

All we theoretically need to do is,

  • Traverse ancients hashes, call WriteHeaderNumber(batch, hash, number) for each hash. There's no intrinsic need to actually go and dig up the (potentailly large) body.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True. I guess it was just a generalization of the iteration mechanism to use blocks and not just headers. There's indeed no reason to dig up the bodies (lol) if we only need the heads (lol).

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's how I would do it

diff --git a/core/rawdb/chain_iterator.go b/core/rawdb/chain_iterator.go
index 7dfe69d70f..668ecc2dfc 100644
--- a/core/rawdb/chain_iterator.go
+++ b/core/rawdb/chain_iterator.go
@@ -166,19 +166,41 @@ func InitDatabaseFromFreezer(db ethdb.Database) {
        if err != nil || frozen == 0 {
                return
        }
-       // hashBlock calculates block hash in advance using the multi-routine's concurrent
-       // computing power.
-       hashBlock := func(block *types.Block) { block.Hash() }
-
-       // writeIndex injects hash <-> number mapping into the database.
-       writeIndex := func(batch ethdb.Batch, block *types.Block) { WriteHeaderNumber(batch, block.Hash(), block.NumberU64()) }
+       var (
+               batch  = db.NewBatch()
+               start  = time.Now()
+               logged = start.Add(-7 * time.Second) // Unindex during import is fast, don't double log
+               hash   common.Hash
+       )
+       for i := uint64(0); i < frozen; i++ {
+               if h, err := db.Ancient(freezerHashTable, i); err != nil {
+                       log.Crit("Failed to init database from freezer", "err", err)
+               } else {
+                       hash = common.BytesToHash(h)
+               }
 
-       if err := iterateCanonicalChain(db, 0, frozen, hashBlock, writeIndex, false, "Initializing database from freezer", "Initialized database from freezer"); err != nil {
-               log.Crit("Failed to init database from freezer", "err", err)
+               WriteHeaderNumber(batch, hash, i)
+               // If enough data was accumulated in memory or we're at the last block, dump to disk
+               if batch.ValueSize() > ethdb.IdealBatchSize {
+                       if err := batch.Write(); err != nil {
+                               log.Crit("Failed to write data to db", "err", err)
+                       }
+                       batch.Reset()
+               }
+               // If we've spent too much time already, notify the user of what we're doing
+               if time.Since(logged) > 8*time.Second {
+                       log.Info("Initializing database from freezer", "blocks", i, "total", frozen, "tail", i, "hash", hash, "elapsed", common.PrettyDuration(time.Since(start)))
+                       logged = time.Now()
+               }
+       }
+       if err := batch.Write(); err != nil {
+               log.Crit("Failed to write data to db", "err", err)
        }
-       hash := ReadCanonicalHash(db, frozen-1)
+       batch.Reset()
+
        WriteHeadHeaderHash(db, hash)
        WriteHeadFastBlockHash(db, hash)
+       log.Info("Initialized database from freezer", "blocks", frozen, "tail", frozen, "elapsed", common.PrettyDuration(time.Since(start)))
 }
 
 // IndexTransactions creates txlookup indices of the specified block range.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And skip the generalized version, at least for InitDatabaseFromFreezer. And then I'd do optimized versions of tx indexing, and only then would I look to see if they can be generalized, but I think the current generalization is too cumbersome

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@adamschmideg
Copy link
Copy Markdown
Contributor

An alternative approach is #20302

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants