cmd, core, eth: init tx lookup in background by karalabe · Pull Request #20270 · ethereum/go-ethereum

karalabe · 2019-11-12T15:04:07Z

This is a squashed and rebased version of #19853 from @rjl493456442. Opened a new PR because I didn't want to nuke the commit history of the old one, but with 13 conflicting commits, it was getting too annoying to fix commit by commit.

Quoting Gary from the original PR:

In our schema, for each transaction, we will maintain an index. The index is basically a pair like txHash->blockNumber which indicators the canonical block number this transaction is included. In another word, tx index is necessary for users to query a transaction based on the hash.

However there are 580M transactions in the mainnet. So the indexes can take more than 10GB space in the database. It's a lot.

So the idea is: for most users, they only care about the latest transactions they send. So we can drop off historical indexes to reduce disk usage.

So in this PR, we offer a new command flag called txlookuplimit. User can specify how many indexes they want to maintain in db. For the older, we will GC them.

What's more, if users want to query some historical transactions when the indexes are GCed, he can change the txlookuplimit, so that in the background Geth will re-generate indexes. If txlookuplimit is 0, it means we generate all indexes.

holiman

Some comments, will continue reviewing later..

holiman · 2019-11-14T21:55:49Z

 	}
+	TxLookupLimitFlag = cli.Int64Flag{
+		Name:  "txlookuplimit",
+		Usage: "Number of recent blocks to index transactions-by-hash in (default = index all blocks)",


The current phrasing sound like a limit to how much it will index, but in fact it also determines how much it will delete. So in essence, how large index to maintain

Suggested change

Usage: "Number of recent blocks to index transactions-by-hash in (default = index all blocks)",

Usage: "Number of recent blocks to maintain transactions index by-hash (default = index all blocks)",

holiman · 2019-11-14T22:17:07Z

+						return
+					}
+				}
+				block := ReadBlock(db, ReadCanonicalHash(db, uint64(n)), uint64(n))


The way this is done is simple, and that's good, I guess. It just irks me a bit that given the format of ancients, it has a pretty large overhead. So this is what happens, assuming that things aren't cached.

Read hash from freezer (tiny 32 byte read)

Read header rlp from freezer. Decode header RLP into header.

Read body rlp from freezer. Decode body rlp intotypes.Body.

Construct new Block with header with body (copying heder, transactoins and uncles)

And then call prepare on it

Whereas, for everything that's in ancients, it would basically suffice to:

Read body rlp from freezer. Decode into types.Body.

call prepare on the body.Transactions

That would save us 2 disk accesses and a lot of memcpy. Maybe we could have a method ReadCanonicalTransactions(blocknumber), so we know that we can pull straight from ancients ? I don't now what the best solution is, thinking aloud here...

holiman · 2019-11-14T22:25:30Z

+			return errors.New("broken database")
+		}
+		// Push the block into the import queue and process contiguous ranges
+		priority := -int64(block.NumberU64())


In our particular case with tx indexing, we don't actually require the blocks to come in a strict sequence, so could just operate on them as they come in. So we wouldn't have to mess about with the prque. But the que is fairly small, so maybe that's nothing to worry about

Actually, now that I think about it, does it really matter in InitDatabaseFromFreezer that we deal with the blocks strictly in order?

If batches are atomic, then it doesn't matter.

Ah no wait, it matters because we commit batches after some data limit is reached. If we don't push blocks into batches in an ordered way and terminate/crash geth in between, we might end up with gaps.

That said, we're not setting the progress marker anywhere, so it might all just be "lost" and recomputed on restart.

Yes, strict order is required for many cases. Like the tx indexing, we always assume that the block above the TAIL has indexes available.

holiman · 2019-11-14T22:31:48Z

+	}
+	// hashBlock calculates block hash in advance using the multi-routine's concurrent
+	// computing power.
+	hashBlock := func(block *types.Block) { block.Hash() }


This somehow seems odd to me... I mean, later on, during iteration, we will look up the hash from ancients, and then load the header and body. So it feels weird that we need to hash it again?

All we theoretically need to do is,

Traverse ancients hashes, call WriteHeaderNumber(batch, hash, number) for each hash. There's no intrinsic need to actually go and dig up the (potentailly large) body.

True. I guess it was just a generalization of the iteration mechanism to use blocks and not just headers. There's indeed no reason to dig up the bodies (lol) if we only need the heads (lol).

Here's how I would do it

diff --git a/core/rawdb/chain_iterator.go b/core/rawdb/chain_iterator.go index 7dfe69d70f..668ecc2dfc 100644 --- a/core/rawdb/chain_iterator.go +++ b/core/rawdb/chain_iterator.go @@ -166,19 +166,41 @@ func InitDatabaseFromFreezer(db ethdb.Database) { if err != nil || frozen == 0 { return } - // hashBlock calculates block hash in advance using the multi-routine's concurrent - // computing power. - hashBlock := func(block *types.Block) { block.Hash() } - - // writeIndex injects hash <-> number mapping into the database. - writeIndex := func(batch ethdb.Batch, block *types.Block) { WriteHeaderNumber(batch, block.Hash(), block.NumberU64()) } + var ( + batch = db.NewBatch() + start = time.Now() + logged = start.Add(-7 * time.Second) // Unindex during import is fast, don't double log + hash common.Hash + ) + for i := uint64(0); i < frozen; i++ { + if h, err := db.Ancient(freezerHashTable, i); err != nil { + log.Crit("Failed to init database from freezer", "err", err) + } else { + hash = common.BytesToHash(h) + } - if err := iterateCanonicalChain(db, 0, frozen, hashBlock, writeIndex, false, "Initializing database from freezer", "Initialized database from freezer"); err != nil { - log.Crit("Failed to init database from freezer", "err", err) + WriteHeaderNumber(batch, hash, i) + // If enough data was accumulated in memory or we're at the last block, dump to disk + if batch.ValueSize() > ethdb.IdealBatchSize { + if err := batch.Write(); err != nil { + log.Crit("Failed to write data to db", "err", err) + } + batch.Reset() + } + // If we've spent too much time already, notify the user of what we're doing + if time.Since(logged) > 8*time.Second { + log.Info("Initializing database from freezer", "blocks", i, "total", frozen, "tail", i, "hash", hash, "elapsed", common.PrettyDuration(time.Since(start))) + logged = time.Now() + } + } + if err := batch.Write(); err != nil { + log.Crit("Failed to write data to db", "err", err) } - hash := ReadCanonicalHash(db, frozen-1) + batch.Reset() + WriteHeadHeaderHash(db, hash) WriteHeadFastBlockHash(db, hash) + log.Info("Initialized database from freezer", "blocks", frozen, "tail", frozen, "elapsed", common.PrettyDuration(time.Since(start))) } // IndexTransactions creates txlookup indices of the specified block range.

And skip the generalized version, at least for InitDatabaseFromFreezer. And then I'd do optimized versions of tx indexing, and only then would I look to see if they can be generalized, but I think the current generalization is too cumbersome

Here's the PR for that: https://github.com/karalabe/go-ethereum/pull/20

adamschmideg · 2020-01-14T09:48:48Z

An alternative approach is #20302

cmd, core, eth: init tx lookup in background

e3ec80e

karalabe requested review from gballet, holiman, rjl493456442 and zsfelfoldi as code owners November 12, 2019 15:04

karalabe added 2 commits November 13, 2019 10:52

core/rawdb: tiny log fixes to make it clearer what's happening

756d6af

core, eth: fix rebase errors

cc846fa

holiman reviewed Nov 14, 2019

View reviewed changes

holiman mentioned this pull request Nov 16, 2019

Transaction background indexing/unindexing + improved chain initiation from freezer #20302

Merged

rjl493456442 mentioned this pull request Nov 19, 2019

core: init tx lookup in background #19853

Closed

adamschmideg added the status:triage label Jan 14, 2020

adamschmideg added status:on-hold and removed status:triage labels Jan 14, 2020

holiman closed this May 26, 2020

	Usage: "Number of recent blocks to index transactions-by-hash in (default = index all blocks)",
	Usage: "Number of recent blocks to maintain transactions index by-hash (default = index all blocks)",

Conversation

karalabe commented Nov 12, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

holiman left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

adamschmideg commented Jan 14, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

karalabe commented Nov 12, 2019 •

edited

Loading