Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support disk based reindexing #4440

Merged
merged 24 commits into from
Jan 9, 2020
Merged

Support disk based reindexing #4440

merged 24 commits into from
Jan 9, 2020

Conversation

mangalaman93
Copy link
Member

@mangalaman93 mangalaman93 commented Dec 18, 2019

While re-indexing a predicate, we now write the temp data on disk instead of keeping it in memory. This allows us to recreate index for large datasets.

On 21 million dataset, we get following results:

Disk Based Memory Based
Adding Exact Index 0m45.964s 0m47.661s
Adding Hash Index 0m50.706s 0m48.044s
Adding FullText Index 1m6.489s 2m14.316s
Adding Term Index 1m16.898s 3m5.429s

On 18 GB p directory, we get following results:

Disk Based Memory Based
Adding Exact Index 23m45.145s
Adding Hash Index 26m51.485s
Adding FullText Index 22m56.642s
Adding Term Index 22m49.796s
Adding Count Index 7m34.345s

Todo

  • Add test when same index key is written multiple times to tmp badger
  • Add logs printing progress report
  • Calculation of reduction in memory usage
  • Calculation of increase in re-indexing time
  • Performance optimization
    * Use Batch Writer for writing data to temp badger
    * Use batching to write to temp badger
  • Avoid using tmpfs for temp badger
  • Systest using 1million dataset
  • More Optimizations
    * Use managed badger and incremental version
  • Benchmarks
    * With and without compressions (without compression is faster)
    * With and without Flatten (without flatten is faster)
    * On 15 GB p directory
    * On 21 million dataset

This change is Reviewable

posting/index.go Outdated Show resolved Hide resolved
posting/index.go Outdated Show resolved Hide resolved
@mangalaman93 mangalaman93 force-pushed the aman/reindex branch 2 times, most recently from f6e7481 to d408c44 Compare December 19, 2019 11:03
@mangalaman93 mangalaman93 requested a review from a team as a code owner December 26, 2019 11:49
posting/index.go Outdated Show resolved Hide resolved
@mangalaman93 mangalaman93 changed the title [WIP] Support disk based reindexing upport disk based reindexing Dec 29, 2019
@mangalaman93 mangalaman93 changed the title upport disk based reindexing Support disk based reindexing Dec 29, 2019
Copy link
Contributor

@pawanrawal pawanrawal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not an expert of the posting package but have added some questions for you. How come adding hash index takes 4x time when temp data is written to disk whereas term index is faster now?

Reviewable status: 0 of 2 files reviewed, 9 unresolved discussions (waiting on @mangalaman93, @manishrjain, and @martinmr)


posting/index.go, line 512 at r4 (raw file):

	indexDir, err := ioutil.TempDir("", "")
	if err != nil {
		return err

Should we wrap the error so as to have more context as to where it is coming from when it is logged?


posting/index.go, line 523 at r4 (raw file):

	indexDB, err := badger.Open(dbOpts)
	if err != nil {
		return err

Wrap error?


posting/index.go, line 566 at r4 (raw file):

		// Write deltas into tmp index badger
		indexTxn := indexDB.NewTransaction(true)

Should we call defer indexTxn.Discard() here?


posting/index.go, line 575 at r4 (raw file):

			}
			if err := indexTxn.SetEntry(entry); err != nil {
				return nil, err

Does any cleanup need to happen for the txn in case we return with an error here? Also, docs for badger transaction says that we must called Discard unless we are calling Commit.


posting/index.go, line 603 at r4 (raw file):

	// flatten the LSM tree to optimize reads
	if err := indexDB.Flatten(5); err != nil {

Please add a comment about why was the value 5 chosen here and how much of a difference in total re-indexing time does Flattening makes here.


posting/index.go, line 641 at r4 (raw file):

			// which occurred before this schema mutation. Typically, we use
			// kv.Version as the timestamp.
			err := writer.SetAt([]byte(kv.Key), kv.Value, BitCompletePosting, r.startTs)

Isn't the Key already a byte slice?

Copy link
Member Author

@mangalaman93 mangalaman93 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 2 files reviewed, 9 unresolved discussions (waiting on @golangcibot, @manishrjain, @martinmr, and @pawanrawal)


posting/index.go, line 569 at r1 (raw file):

Previously, golangcibot (Bot from GolangCI) wrote…

Error return value of tmpTxn.SetEntry is not checked (from errcheck)

Done.


posting/index.go, line 580 at r1 (raw file):

Previously, golangcibot (Bot from GolangCI) wrote…

Error return value of tmpTxn.Commit is not checked (from errcheck)

Done.


posting/index.go, line 601 at r2 (raw file):

Previously, golangcibot (Bot from GolangCI) wrote…

Error return value of indexDB.Flatten is not checked (from errcheck)

Done.


posting/index.go, line 512 at r4 (raw file):

Previously, pawanrawal (Pawan Rawal) wrote…

Should we wrap the error so as to have more context as to where it is coming from when it is logged?

Done.


posting/index.go, line 523 at r4 (raw file):

Previously, pawanrawal (Pawan Rawal) wrote…

Wrap error?

Done.


posting/index.go, line 566 at r4 (raw file):

Previously, pawanrawal (Pawan Rawal) wrote…

Should we call defer indexTxn.Discard() here?

Done.


posting/index.go, line 575 at r4 (raw file):

Previously, pawanrawal (Pawan Rawal) wrote…

Does any cleanup need to happen for the txn in case we return with an error here? Also, docs for badger transaction says that we must called Discard unless we are calling Commit.

Done.


posting/index.go, line 603 at r4 (raw file):

Previously, pawanrawal (Pawan Rawal) wrote…

Please add a comment about why was the value 5 chosen here and how much of a difference in total re-indexing time does Flattening makes here.

Done.


posting/index.go, line 641 at r4 (raw file):

Previously, pawanrawal (Pawan Rawal) wrote…

Isn't the Key already a byte slice?

Done.

Copy link
Contributor

@martinmr martinmr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 3 files at r2, 1 of 1 files at r5.
Reviewable status: all files reviewed, 16 unresolved discussions (waiting on @golangcibot, @mangalaman93, @manishrjain, and @pawanrawal)


dgraph/cmd/alpha/reindex_test.go, line 408 at r5 (raw file):

Quoted 6 lines of code…

	q1 := `{
    q(func: eq(name@en, "Runtime")) {
      name@en
    }
  }`

minor: Maybe retrieve the UID as well.


dgraph/cmd/alpha/reindex_test.go, line 413 at r5 (raw file):

	res, _, err := queryWithTs(q1, "application/graphql+-", "", 0)
	require.NoError(t, err)
	require.JSONEq(t, `{"data":{"q":[{"name@en":"Runtime"},{"name@en":"Runtime"}]}}`, res)

I would also add another triplet, query again, and check that the changes are reflected.


posting/index.go, line 509 at r5 (raw file):

	// We write the index in a temporary badger first and then,
	// merge entries before writing them to p directry.

minor: this probably fits in a single line.


posting/index.go, line 561 at r5 (raw file):

"error in reading posting list from disk")

here and some places below the error message say "error in ...". The "in" can be removed (e.g. "error reading posting list from disk").


posting/index.go, line 589 at r5 (raw file):

Quoted 4 lines of code…
		// Reset deltas
		for k := range txn.cache.deltas {
			delete(txn.cache.deltas, k)
		}

If this is a map and you want to delete every entry, wouldn't it be faster if you recreated the map entirely instead of deleting every entry one by one.


posting/index.go, line 611 at r5 (raw file):

randomly.

minor: change this to "arbitrarily". "randomly" sounds a bit off.


posting/index.go, line 623 at r5 (raw file):

it

maybe rename this to indexIt or something similar so it's clear this iterator is traversing the index DB.

@mangalaman93 mangalaman93 requested a review from martinmr January 1, 2020 20:27
Copy link
Member Author

@mangalaman93 mangalaman93 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: all files reviewed, 13 unresolved discussions (waiting on @golangcibot, @manishrjain, @martinmr, and @pawanrawal)


dgraph/cmd/alpha/reindex_test.go, line 413 at r5 (raw file):

Previously, martinmr (Martin Martinez Rivera) wrote…

I would also add another triplet, query again, and check that the changes are reflected.

Done.


posting/index.go, line 561 at r5 (raw file):

Previously, martinmr (Martin Martinez Rivera) wrote…
"error in reading posting list from disk")

here and some places below the error message say "error in ...". The "in" can be removed (e.g. "error reading posting list from disk").

Done.


posting/index.go, line 589 at r5 (raw file):

Previously, martinmr (Martin Martinez Rivera) wrote…
		// Reset deltas
		for k := range txn.cache.deltas {
			delete(txn.cache.deltas, k)
		}

If this is a map and you want to delete every entry, wouldn't it be faster if you recreated the map entirely instead of deleting every entry one by one.

This is specifically optimized by go compiler to run faster than creating a new map. https://go-review.googlesource.com/c/go/+/110055/


posting/index.go, line 623 at r5 (raw file):

Previously, martinmr (Martin Martinez Rivera) wrote…
it

maybe rename this to indexIt or something similar so it's clear this iterator is traversing the index DB.

Done.

Copy link
Contributor

@martinmr martinmr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 2 of 2 files at r6.
Reviewable status: all files reviewed, 9 unresolved discussions (waiting on @golangcibot, @manishrjain, and @pawanrawal)

Copy link
Contributor

@pawanrawal pawanrawal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 2 of 2 files at r6.
Reviewable status: all files reviewed, 3 unresolved discussions (waiting on @golangcibot and @manishrjain)

@mangalaman93 mangalaman93 force-pushed the aman/reindex branch 2 times, most recently from 1a4ba01 to 4757141 Compare January 3, 2020 15:52
Copy link
Contributor

@manishrjain manishrjain left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs some work. Once done, let's review again.

Reviewable status: all files reviewed, 14 unresolved discussions (waiting on @golangcibot and @mangalaman93)


dgraph/cmd/alpha/reindex_test.go, line 1 at r7 (raw file):

package alpha

Add license


dgraph/cmd/alpha/reindex_test.go, line 43 at r7 (raw file):

	m1 := `{
    set {
      <10000> <name>	"Vidar Flataukan"@en	.

Might want to use 1million dataset. Looks like a systest.


posting/index.go, line 512 at r7 (raw file):

	// We write the index in a temporary badger first and then,
	// merge entries before writing them to p directory.
	indexDir, err := ioutil.TempDir("", "")

This can go into /tmp, which is typically in RAM, which can cause issues. Do check for that. Might also want to consider using the current dir.

You can prefix the name here, tempDir will take a prefix in the second argument. Then, you can do a glob or something to delete all with that prefix.


posting/index.go, line 522 at r7 (raw file):

		WithSyncWrites(false).
		WithNumVersionsToKeep(math.MaxInt64).
		WithCompression(options.None).

Might want to consider some compression.


posting/index.go, line 538 at r7 (raw file):

	// localized posting list cache, to avoid stressing or mixing up with the
	// global lcache (the LRU cache).
	txn := NewTxn(r.startTs)

Might not need to create this global txn.


posting/index.go, line 563 at r7 (raw file):

		}

		if err := r.fn(pk.Uid, l, txn); err != nil {

Above this line, txn := NewTxn(r.startTs)?


posting/index.go, line 588 at r7 (raw file):

	}
	stream.Send = func(kvList *bpb.KVList) error {
		for _, kv := range kvList.Kv {

Use a managed badger DB, and use a version counter, which increments for every key that gets written to the DB.

Then you can use TxnWriter here. Then, you don't need to worry about overwrites.


posting/index.go, line 617 at r7 (raw file):

	// Flatten the LSM tree to optimize reads, we chose to create 5 workers.
	// Flatten should not take too long compared to time taken in building the index.
	if err := indexDB.Flatten(5); err != nil {

This would be lot more expensive than just doing reads. Avoid this.


posting/index.go, line 623 at r7 (raw file):

	// Now we write all the created posting lists to disk.
	glog.V(1).Infof("Rebuilding index for predicate %s: writing index to badger", r.attr)
	indexTxn := indexDB.NewTransaction(false)

Use stream framework.


posting/index.go, line 636 at r7 (raw file):

	}()
	writer := NewTxnWriter(pstore)
	for indexIt.Rewind(); indexIt.Valid(); {

itr.


posting/index.go, line 658 at r7 (raw file):

			// which occurred before this schema mutation. Typically, we use
			// kv.Version as the timestamp.
			err := writer.SetAt(kv.Key, kv.Value, BitCompletePosting, r.startTs)

if err := ; err != nil If you can.

@mangalaman93 mangalaman93 force-pushed the aman/reindex branch 2 times, most recently from 032247a to 2a193cd Compare January 6, 2020 01:20
Copy link
Member Author

@mangalaman93 mangalaman93 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 2 files reviewed, 14 unresolved discussions (waiting on @golangcibot, @manishrjain, @martinmr, and @pawanrawal)


dgraph/cmd/alpha/reindex_test.go, line 1 at r7 (raw file):

Previously, manishrjain (Manish R Jain) wrote…

Add license

Done.


dgraph/cmd/alpha/reindex_test.go, line 43 at r7 (raw file):

Previously, manishrjain (Manish R Jain) wrote…

Might want to use 1million dataset. Looks like a systest.

Done.


posting/index.go, line 512 at r7 (raw file):

Previously, manishrjain (Manish R Jain) wrote…

This can go into /tmp, which is typically in RAM, which can cause issues. Do check for that. Might also want to consider using the current dir.

You can prefix the name here, tempDir will take a prefix in the second argument. Then, you can do a glob or something to delete all with that prefix.

Done.


posting/index.go, line 522 at r7 (raw file):

Previously, manishrjain (Manish R Jain) wrote…

Might want to consider some compression.

Done.


posting/index.go, line 538 at r7 (raw file):

Previously, manishrjain (Manish R Jain) wrote…

Might not need to create this global txn.

Done.


posting/index.go, line 563 at r7 (raw file):

Previously, manishrjain (Manish R Jain) wrote…

Above this line, txn := NewTxn(r.startTs)?

Done.


posting/index.go, line 588 at r7 (raw file):

Previously, manishrjain (Manish R Jain) wrote…

Use a managed badger DB, and use a version counter, which increments for every key that gets written to the DB.

Then you can use TxnWriter here. Then, you don't need to worry about overwrites.

Done.


posting/index.go, line 617 at r7 (raw file):

Previously, manishrjain (Manish R Jain) wrote…

This would be lot more expensive than just doing reads. Avoid this.

Done.


posting/index.go, line 623 at r7 (raw file):

Previously, manishrjain (Manish R Jain) wrote…

Use stream framework.

Done.


posting/index.go, line 636 at r7 (raw file):

Previously, manishrjain (Manish R Jain) wrote…

itr.

Done.


posting/index.go, line 658 at r7 (raw file):

Previously, manishrjain (Manish R Jain) wrote…

if err := ; err != nil If you can.

Done.

Copy link
Member Author

@mangalaman93 mangalaman93 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: all files reviewed, 8 unresolved discussions (waiting on @animesh2049, @golangcibot, @mangalaman93, and @manishrjain)


posting/index.go, line 534 at r11 (raw file):

Previously, manishrjain (Manish R Jain) wrote…

Add a TODO here.

Done.


posting/index.go, line 620 at r11 (raw file):

Previously, manishrjain (Manish R Jain) wrote…

TODO here to replace this with Batch Writer. Also, check if batch writer has the throttle mechanism as we have in txn writer.

Also, check where else we can use batch writer.

Done.

posting/index.go Outdated Show resolved Hide resolved
Co-Authored-By: Bot from GolangCI <[email protected]>
@mangalaman93 mangalaman93 merged commit ca73489 into master Jan 9, 2020
@mangalaman93 mangalaman93 deleted the aman/reindex branch January 9, 2020 18:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

6 participants