-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support disk based reindexing #4440
Conversation
f6e7481
to
d408c44
Compare
ff7da76
to
06c39b1
Compare
06c39b1
to
6d9b28b
Compare
6d9b28b
to
e969700
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not an expert of the posting package but have added some questions for you. How come adding hash index takes 4x time when temp data is written to disk whereas term index is faster now?
Reviewable status: 0 of 2 files reviewed, 9 unresolved discussions (waiting on @mangalaman93, @manishrjain, and @martinmr)
posting/index.go, line 512 at r4 (raw file):
indexDir, err := ioutil.TempDir("", "") if err != nil { return err
Should we wrap the error so as to have more context as to where it is coming from when it is logged?
posting/index.go, line 523 at r4 (raw file):
indexDB, err := badger.Open(dbOpts) if err != nil { return err
Wrap error?
posting/index.go, line 566 at r4 (raw file):
// Write deltas into tmp index badger indexTxn := indexDB.NewTransaction(true)
Should we call defer indexTxn.Discard()
here?
posting/index.go, line 575 at r4 (raw file):
} if err := indexTxn.SetEntry(entry); err != nil { return nil, err
Does any cleanup need to happen for the txn
in case we return with an error here? Also, docs for badger transaction says that we must called Discard unless we are calling Commit.
posting/index.go, line 603 at r4 (raw file):
// flatten the LSM tree to optimize reads if err := indexDB.Flatten(5); err != nil {
Please add a comment about why was the value 5 chosen here and how much of a difference in total re-indexing time does Flattening makes here.
posting/index.go, line 641 at r4 (raw file):
// which occurred before this schema mutation. Typically, we use // kv.Version as the timestamp. err := writer.SetAt([]byte(kv.Key), kv.Value, BitCompletePosting, r.startTs)
Isn't the Key already a byte slice?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 2 files reviewed, 9 unresolved discussions (waiting on @golangcibot, @manishrjain, @martinmr, and @pawanrawal)
posting/index.go, line 569 at r1 (raw file):
Previously, golangcibot (Bot from GolangCI) wrote…
Error return value of
tmpTxn.SetEntry
is not checked (fromerrcheck
)
Done.
posting/index.go, line 580 at r1 (raw file):
Previously, golangcibot (Bot from GolangCI) wrote…
Error return value of
tmpTxn.Commit
is not checked (fromerrcheck
)
Done.
posting/index.go, line 601 at r2 (raw file):
Previously, golangcibot (Bot from GolangCI) wrote…
Error return value of
indexDB.Flatten
is not checked (fromerrcheck
)
Done.
posting/index.go, line 512 at r4 (raw file):
Previously, pawanrawal (Pawan Rawal) wrote…
Should we wrap the error so as to have more context as to where it is coming from when it is logged?
Done.
posting/index.go, line 523 at r4 (raw file):
Previously, pawanrawal (Pawan Rawal) wrote…
Wrap error?
Done.
posting/index.go, line 566 at r4 (raw file):
Previously, pawanrawal (Pawan Rawal) wrote…
Should we call
defer indexTxn.Discard()
here?
Done.
posting/index.go, line 575 at r4 (raw file):
Previously, pawanrawal (Pawan Rawal) wrote…
Does any cleanup need to happen for the
txn
in case we return with an error here? Also, docs for badger transaction says that we must called Discard unless we are calling Commit.
Done.
posting/index.go, line 603 at r4 (raw file):
Previously, pawanrawal (Pawan Rawal) wrote…
Please add a comment about why was the value 5 chosen here and how much of a difference in total re-indexing time does Flattening makes here.
Done.
posting/index.go, line 641 at r4 (raw file):
Previously, pawanrawal (Pawan Rawal) wrote…
Isn't the Key already a byte slice?
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 3 files at r2, 1 of 1 files at r5.
Reviewable status: all files reviewed, 16 unresolved discussions (waiting on @golangcibot, @mangalaman93, @manishrjain, and @pawanrawal)
dgraph/cmd/alpha/reindex_test.go, line 408 at r5 (raw file):
Quoted 6 lines of code…
q1 := `{ q(func: eq(name@en, "Runtime")) { name@en } }`
minor: Maybe retrieve the UID as well.
dgraph/cmd/alpha/reindex_test.go, line 413 at r5 (raw file):
res, _, err := queryWithTs(q1, "application/graphql+-", "", 0) require.NoError(t, err) require.JSONEq(t, `{"data":{"q":[{"name@en":"Runtime"},{"name@en":"Runtime"}]}}`, res)
I would also add another triplet, query again, and check that the changes are reflected.
posting/index.go, line 509 at r5 (raw file):
// We write the index in a temporary badger first and then, // merge entries before writing them to p directry.
minor: this probably fits in a single line.
posting/index.go, line 561 at r5 (raw file):
"error in reading posting list from disk")
here and some places below the error message say "error in ...". The "in" can be removed (e.g. "error reading posting list from disk").
posting/index.go, line 589 at r5 (raw file):
Quoted 4 lines of code…
// Reset deltas for k := range txn.cache.deltas { delete(txn.cache.deltas, k) }
If this is a map and you want to delete every entry, wouldn't it be faster if you recreated the map entirely instead of deleting every entry one by one.
posting/index.go, line 611 at r5 (raw file):
randomly.
minor: change this to "arbitrarily". "randomly" sounds a bit off.
posting/index.go, line 623 at r5 (raw file):
it
maybe rename this to indexIt
or something similar so it's clear this iterator is traversing the index DB.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: all files reviewed, 13 unresolved discussions (waiting on @golangcibot, @manishrjain, @martinmr, and @pawanrawal)
dgraph/cmd/alpha/reindex_test.go, line 413 at r5 (raw file):
Previously, martinmr (Martin Martinez Rivera) wrote…
I would also add another triplet, query again, and check that the changes are reflected.
Done.
posting/index.go, line 561 at r5 (raw file):
Previously, martinmr (Martin Martinez Rivera) wrote…
"error in reading posting list from disk")
here and some places below the error message say "error in ...". The "in" can be removed (e.g. "error reading posting list from disk").
Done.
posting/index.go, line 589 at r5 (raw file):
Previously, martinmr (Martin Martinez Rivera) wrote…
// Reset deltas for k := range txn.cache.deltas { delete(txn.cache.deltas, k) }
If this is a map and you want to delete every entry, wouldn't it be faster if you recreated the map entirely instead of deleting every entry one by one.
This is specifically optimized by go compiler to run faster than creating a new map. https://go-review.googlesource.com/c/go/+/110055/
posting/index.go, line 623 at r5 (raw file):
Previously, martinmr (Martin Martinez Rivera) wrote…
it
maybe rename this to
indexIt
or something similar so it's clear this iterator is traversing the index DB.
Done.
d9c411b
to
46a317f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 2 of 2 files at r6.
Reviewable status: all files reviewed, 9 unresolved discussions (waiting on @golangcibot, @manishrjain, and @pawanrawal)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 2 of 2 files at r6.
Reviewable status: all files reviewed, 3 unresolved discussions (waiting on @golangcibot and @manishrjain)
1a4ba01
to
4757141
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Needs some work. Once done, let's review again.
Reviewable status: all files reviewed, 14 unresolved discussions (waiting on @golangcibot and @mangalaman93)
dgraph/cmd/alpha/reindex_test.go, line 1 at r7 (raw file):
package alpha
Add license
dgraph/cmd/alpha/reindex_test.go, line 43 at r7 (raw file):
m1 := `{ set { <10000> <name> "Vidar Flataukan"@en .
Might want to use 1million dataset. Looks like a systest.
posting/index.go, line 512 at r7 (raw file):
// We write the index in a temporary badger first and then, // merge entries before writing them to p directory. indexDir, err := ioutil.TempDir("", "")
This can go into /tmp, which is typically in RAM, which can cause issues. Do check for that. Might also want to consider using the current dir.
You can prefix the name here, tempDir will take a prefix in the second argument. Then, you can do a glob or something to delete all with that prefix.
posting/index.go, line 522 at r7 (raw file):
WithSyncWrites(false). WithNumVersionsToKeep(math.MaxInt64). WithCompression(options.None).
Might want to consider some compression.
posting/index.go, line 538 at r7 (raw file):
// localized posting list cache, to avoid stressing or mixing up with the // global lcache (the LRU cache). txn := NewTxn(r.startTs)
Might not need to create this global txn.
posting/index.go, line 563 at r7 (raw file):
} if err := r.fn(pk.Uid, l, txn); err != nil {
Above this line, txn := NewTxn(r.startTs)?
posting/index.go, line 588 at r7 (raw file):
} stream.Send = func(kvList *bpb.KVList) error { for _, kv := range kvList.Kv {
Use a managed badger DB, and use a version counter, which increments for every key that gets written to the DB.
Then you can use TxnWriter here. Then, you don't need to worry about overwrites.
posting/index.go, line 617 at r7 (raw file):
// Flatten the LSM tree to optimize reads, we chose to create 5 workers. // Flatten should not take too long compared to time taken in building the index. if err := indexDB.Flatten(5); err != nil {
This would be lot more expensive than just doing reads. Avoid this.
posting/index.go, line 623 at r7 (raw file):
// Now we write all the created posting lists to disk. glog.V(1).Infof("Rebuilding index for predicate %s: writing index to badger", r.attr) indexTxn := indexDB.NewTransaction(false)
Use stream framework.
posting/index.go, line 636 at r7 (raw file):
}() writer := NewTxnWriter(pstore) for indexIt.Rewind(); indexIt.Valid(); {
itr.
posting/index.go, line 658 at r7 (raw file):
// which occurred before this schema mutation. Typically, we use // kv.Version as the timestamp. err := writer.SetAt(kv.Key, kv.Value, BitCompletePosting, r.startTs)
if err := ; err != nil
If you can.
032247a
to
2a193cd
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 2 files reviewed, 14 unresolved discussions (waiting on @golangcibot, @manishrjain, @martinmr, and @pawanrawal)
dgraph/cmd/alpha/reindex_test.go, line 1 at r7 (raw file):
Previously, manishrjain (Manish R Jain) wrote…
Add license
Done.
dgraph/cmd/alpha/reindex_test.go, line 43 at r7 (raw file):
Previously, manishrjain (Manish R Jain) wrote…
Might want to use 1million dataset. Looks like a systest.
Done.
posting/index.go, line 512 at r7 (raw file):
Previously, manishrjain (Manish R Jain) wrote…
This can go into /tmp, which is typically in RAM, which can cause issues. Do check for that. Might also want to consider using the current dir.
You can prefix the name here, tempDir will take a prefix in the second argument. Then, you can do a glob or something to delete all with that prefix.
Done.
posting/index.go, line 522 at r7 (raw file):
Previously, manishrjain (Manish R Jain) wrote…
Might want to consider some compression.
Done.
posting/index.go, line 538 at r7 (raw file):
Previously, manishrjain (Manish R Jain) wrote…
Might not need to create this global txn.
Done.
posting/index.go, line 563 at r7 (raw file):
Previously, manishrjain (Manish R Jain) wrote…
Above this line, txn := NewTxn(r.startTs)?
Done.
posting/index.go, line 588 at r7 (raw file):
Previously, manishrjain (Manish R Jain) wrote…
Use a managed badger DB, and use a version counter, which increments for every key that gets written to the DB.
Then you can use TxnWriter here. Then, you don't need to worry about overwrites.
Done.
posting/index.go, line 617 at r7 (raw file):
Previously, manishrjain (Manish R Jain) wrote…
This would be lot more expensive than just doing reads. Avoid this.
Done.
posting/index.go, line 623 at r7 (raw file):
Previously, manishrjain (Manish R Jain) wrote…
Use stream framework.
Done.
posting/index.go, line 636 at r7 (raw file):
Previously, manishrjain (Manish R Jain) wrote…
itr.
Done.
posting/index.go, line 658 at r7 (raw file):
Previously, manishrjain (Manish R Jain) wrote…
if err := ; err != nil
If you can.
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: all files reviewed, 8 unresolved discussions (waiting on @animesh2049, @golangcibot, @mangalaman93, and @manishrjain)
posting/index.go, line 534 at r11 (raw file):
Previously, manishrjain (Manish R Jain) wrote…
Add a TODO here.
Done.
posting/index.go, line 620 at r11 (raw file):
Previously, manishrjain (Manish R Jain) wrote…
TODO here to replace this with Batch Writer. Also, check if batch writer has the throttle mechanism as we have in txn writer.
Also, check where else we can use batch writer.
Done.
3130741
to
200f77c
Compare
Co-Authored-By: Bot from GolangCI <[email protected]>
While re-indexing a predicate, we now write the temp data on disk instead of keeping it in memory. This allows us to recreate index for large datasets.
On 21 million dataset, we get following results:
On 18 GB p directory, we get following results:
Todo
* Use Batch Writer for writing data to temp badger
* Use batching to write to temp badger
* Use managed badger and incremental version
* With and without compressions (without compression is faster)
* With and without Flatten (without flatten is faster)
* On 15 GB p directory
* On 21 million dataset
This change is