-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Replace TxnWriter with WriteBatch #5007
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 3 files at r2, 1 of 2 files at r3.
Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @anurags92 and @manishrjain)
This comment has been minimized.
This comment has been minimized.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 2 of 3 files reviewed, 6 unresolved discussions (waiting on @anurags92 and @manishrjain)
posting/writer_test.go, line 2 at r6 (raw file):
/* * Copyright 2019 Dgraph Labs, Inc. and Contributors
2020
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 4 files at r13, 2 of 2 files at r14, 2 of 3 files at r15.
Reviewable status: 5 of 6 files reviewed, 14 unresolved discussions (waiting on @anurags92, @manishrjain, and @martinmr)
go.mod, line 72 at r15 (raw file):
) //replace github.com/dgraph-io/badger/v2 => /home/alvis/go/src/github.com/dgraph-io/badger
remove comment
posting/index.go, line 576 at r15 (raw file):
var counter uint64 = 1 // TODO(Aman): Replace TxnWriter with WriteBatch. While we do that we should ensure that
remove comments
posting/index.go, line 629 at r15 (raw file):
} stream.Send = func(kvList *bpb.KVList) error { for _, kv := range kvList.Kv {
could we please benchmark creating tmpWriter within Send() then flushing at the end of Send(), v/s creating one tmpWriter and calling SetEntryAt within the transaction. (since SetEntryAt requires locks, is it better to lock when flushing instead)
posting/index.go, line 631 at r15 (raw file):
for _, kv := range kvList.Kv { e := &badger.Entry{Key: kv.Key, Value: kv.Value} if len(kv.UserMeta) > 0 {
add a comment on what this is doing
posting/index.go, line 634 at r15 (raw file):
e.UserMeta = kv.UserMeta[0] } if err := tmpWriter.SetEntryAt(e, kv.Version); err != nil {
double check that SetEntryAt is thread safe
posting/index.go, line 638 at r15 (raw file):
} } // if err := tmpWriter.Write(kvList); err != nil {
remove this comment
posting/index.go, line 691 at r15 (raw file):
// which occurred before this schema mutation. //err := writer.SetAt(kv.Key, kv.Value, BitCompletePosting, r.startTs)
same logic as above. Is it faster to create the writer withing the send()
posting/writer_test.go, line 1 at r15 (raw file):
/*
Can we move this file somewhere else? Or can we create a separate directory for experimental benchmarks or something? This isn't testing production code
posting/writer_test.go, line 90 at r15 (raw file):
b.ResetTimer() for i := 0; i < b.N; i++ {
for i from 1 to 5
txnWriter = newWriter
for entry from 1 to 10000
...
commit
end
worker/draft.go, line 714 at r15 (raw file):
} // TODO(Anurag): Are we using pkey? Remove if unused.
also on a fixme, please add date.
worker/snapshot.go, line 41 at r15 (raw file):
Flush() error } type newwriteBatch struct {
writeBatchWriter
worker/snapshot.go, line 49 at r15 (raw file):
} func (nwb *newwriteBatch) Write(kvs *bpb.KVList) error {
i see this twice, can we move this to
bulkAddKvsToBatch(nwb.wb, kvs.Kv)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add the benchmark numbers to the PR description. Mostly it looks good to me.
Reviewable status: 1 of 7 files reviewed, 20 unresolved discussions (waiting on @anurags92, @gja, @manishrjain, and @martinmr)
posting/writer_test.go, line 46 at r16 (raw file):
} writeInBagder := func(db *badger.DB, KVList []kv, wg *sync.WaitGroup) {
Typo: writeInBagder => writeInBadger
posting/writer_test.go, line 51 at r16 (raw file):
for _, typ := range KVList { e := &badger.Entry{Key: typ.key, Value: typ.value} wb.SetEntryAt(e, 1)
Check error, please
posting/writer_test.go, line 57 at r16 (raw file):
} writeInBagder2 := func(wb *badger.WriteBatch, KVList []kv, wg *sync.WaitGroup) {
Typo here as well
posting/writer_test.go, line 83 at r16 (raw file):
dbOpts.Dir = tmpIndexDir dbOpts.ValueDir = tmpIndexDir var db, err2 = badger.OpenManaged(dbOpts)
Ni: err2 => err
posting/writer_test.go, line 93 at r16 (raw file):
k := typ.key v := typ.value w.SetAt(k, v, BitSchemaPosting, 1)
I think SetAt
returns an error. We should check the error here.
posting/writer_test.go, line 118 at r16 (raw file):
for _, typ := range KVList { e := &badger.Entry{Key: typ.key, Value: typ.value} wb.SetEntryAt(e, 1)
SetEntryAt
also returns error. We should check the error.
x/x.go, line 854 at r16 (raw file):
//WriteBatchWriter exposes Write API batch writer. func WriteBatchWriter(writer *badger.WriteBatch, kvList *bpb.KVList) error {
We already have a WriteBatchWriter defined in snapshot.go. Please call it something else. You could use the name suggested by @gja in the comments above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 7 files reviewed, 20 unresolved discussions (waiting on @anurags92, @gja, @jarifibrahim, @manishrjain, and @martinmr)
go.mod, line 72 at r15 (raw file):
Previously, gja (Tejas Dinkar) wrote…
remove comment
Done.
posting/index.go, line 576 at r15 (raw file):
Previously, gja (Tejas Dinkar) wrote…
remove comments
Done.
posting/index.go, line 629 at r15 (raw file):
Previously, gja (Tejas Dinkar) wrote…
could we please benchmark creating tmpWriter within Send() then flushing at the end of Send(), v/s creating one tmpWriter and calling SetEntryAt within the transaction. (since SetEntryAt requires locks, is it better to lock when flushing instead)
Updated writer_test.go to run in multi-thread way.
posting/index.go, line 631 at r15 (raw file):
Previously, gja (Tejas Dinkar) wrote…
add a comment on what this is doing
Would have to figure out what is meta used for. Have wrapped this in one function in x package. Would update that once I have more information.
posting/index.go, line 634 at r15 (raw file):
Previously, gja (Tejas Dinkar) wrote…
double check that SetEntryAt is thread safe
It internally calls SetEntry which is thread safe. (badger/batch.go Line#103)
posting/index.go, line 638 at r15 (raw file):
Previously, gja (Tejas Dinkar) wrote…
remove this comment
Done.
posting/index.go, line 691 at r15 (raw file):
Previously, gja (Tejas Dinkar) wrote…
same logic as above. Is it faster to create the writer withing the send()
Replied above. Now that I think of I feel we can actually benchmark re-indexing and compare the two approaches on 21mil dataset. Working on it.
posting/writer_test.go, line 1 at r15 (raw file):
Previously, gja (Tejas Dinkar) wrote…
Can we move this file somewhere else? Or can we create a separate directory for experimental benchmarks or something? This isn't testing production code
Sure, let me find out what's the best place. Although I think @manishrjain prefers to keep the benchmark files in the test suite itself. Please note that this is does not run any unit tests.
posting/writer_test.go, line 90 at r15 (raw file):
Previously, gja (Tejas Dinkar) wrote…
for i from 1 to 5
txnWriter = newWriter
for entry from 1 to 10000
...
commit
end
Done.
posting/writer_test.go, line 46 at r16 (raw file):
Previously, jarifibrahim (Ibrahim Jarif) wrote…
Typo: writeInBagder => writeInBadger
Done.
posting/writer_test.go, line 51 at r16 (raw file):
Previously, jarifibrahim (Ibrahim Jarif) wrote…
Check error, please
Done.
posting/writer_test.go, line 57 at r16 (raw file):
Previously, jarifibrahim (Ibrahim Jarif) wrote…
Typo here as well
Done.
posting/writer_test.go, line 93 at r16 (raw file):
Previously, jarifibrahim (Ibrahim Jarif) wrote…
I think
SetAt
returns an error. We should check the error here.
Done.
posting/writer_test.go, line 118 at r16 (raw file):
Previously, jarifibrahim (Ibrahim Jarif) wrote…
SetEntryAt
also returns error. We should check the error.
Done.
worker/draft.go, line 714 at r15 (raw file):
Previously, gja (Tejas Dinkar) wrote…
also on a fixme, please add date.
Done.
worker/snapshot.go, line 41 at r15 (raw file):
Previously, gja (Tejas Dinkar) wrote…
writeBatchWriter
Kept the name of the struct as the same. I have changed the name of the function below. I feel that makes more sense. Open to suggestions.
worker/snapshot.go, line 49 at r15 (raw file):
Previously, gja (Tejas Dinkar) wrote…
i see this twice, can we move this to
bulkAddKvsToBatch(nwb.wb, kvs.Kv)
Done.
x/x.go, line 854 at r16 (raw file):
Previously, jarifibrahim (Ibrahim Jarif) wrote…
We already have a WriteBatchWriter defined in snapshot.go. Please call it something else. You could use the name suggested by @gja in the comments above.
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good. Just a few things.
Reviewed 2 of 6 files at r16, 5 of 5 files at r17.
Reviewable status: all files reviewed, 24 unresolved discussions (waiting on @anurags92, @gja, @jarifibrahim, and @manishrjain)
posting/index.go, line 677 at r17 (raw file):
// which occurred before this schema mutation. if err := writer.SetEntryAt( (&badger.Entry{Key: kv.Key,
Create the entry first, and then set it here.
worker/snapshot.go, line 38 at r17 (raw file):
) type badgerWriter interface {
This can be removed now.
x/x.go, line 854 at r16 (raw file):
Previously, anurags92 (Anurag) wrote…
Done.
This should be moved to Badger, and made a func in WriteBatch.
Write(...)
, or WriteList
.
x/x.go, line 853 at r17 (raw file):
} //BulkWriteKVsBatchWriter exposes Write API batch writer.
Space after comment.
x/x.go, line 859 at r17 (raw file):
// if len(kv.UserMeta) > 0 { e.UserMeta = kv.UserMeta[0]
This is how Badger does it. We have to use a byte slice because protobuf does not have a single byte data type. It only has a repeated bytes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 4 of 7 files reviewed, 24 unresolved discussions (waiting on @anurags92, @gja, @jarifibrahim, and @manishrjain)
posting/index.go, line 677 at r17 (raw file):
Previously, manishrjain (Manish R Jain) wrote…
Create the entry first, and then set it here.
Done.
worker/snapshot.go, line 41 at r15 (raw file):
Previously, anurags92 (Anurag) wrote…
Kept the name of the struct as the same. I have changed the name of the function below. I feel that makes more sense. Open to suggestions.
Not applicable with new Write API in Badger
worker/snapshot.go, line 49 at r15 (raw file):
Previously, anurags92 (Anurag) wrote…
Done.
Not applicable with new Write API in Badger
worker/snapshot.go, line 38 at r17 (raw file):
Previously, manishrjain (Manish R Jain) wrote…
This can be removed now.
Done.
x/x.go, line 854 at r16 (raw file):
Previously, manishrjain (Manish R Jain) wrote…
This should be moved to Badger, and made a func in WriteBatch.
Write(...)
, orWriteList
.
Done. Badger PR: https://github.com/dgraph-io/badger/pull/1321/files
x/x.go, line 853 at r17 (raw file):
Previously, manishrjain (Manish R Jain) wrote…
Space after comment.
Don't need this function now.
x/x.go, line 859 at r17 (raw file):
Previously, manishrjain (Manish R Jain) wrote…
This is how Badger does it. We have to use a byte slice because protobuf does not have a single byte data type. It only has a repeated bytes.
Ok.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 3 of 3 files at r18.
Reviewable status: all files reviewed, 25 unresolved discussions (waiting on @anurags92, @gja, @jarifibrahim, and @manishrjain)
posting/writer_test.go, line 1 at r15 (raw file):
Previously, anurags92 (Anurag) wrote…
Sure, let me find out what's the best place. Although I think @manishrjain prefers to keep the benchmark files in the test suite itself. Please note that this is does not run any unit tests.
In Go, you'd put the tests and benchmarks in the same package. As Anurag mentioned, they don't get run unless you specifically run them.
worker/draft.go, line 714 at r15 (raw file):
Previously, anurags92 (Anurag) wrote…
Done.
Git Blame can show you the date. I don't think there's a need to add date explicitly.
x/x.go, line 44 at r18 (raw file):
"github.com/dgraph-io/dgo/v200" "github.com/dgraph-io/dgo/v200/protos/api" "github.com/golang/glog"
Avoid any unintentional change.
This comment has been minimized.
This comment has been minimized.
* Replace txnWriter with WriteBatch * Add benchmark file writer_test.go Co-authored-by: Ibrahim Jarif <[email protected]> Co-authored-by: Ibrahim Jarif <[email protected]>
Replace txnWrite with batchWrite to improve performance. txnWrite commits one transaction at a time whereas batchWrite can commit multiple transactions in a batch which reduces I/O and hence results in performance improvements.
Fixes DGRAPH-945 https://dgraph.atlassian.net/browse/DGRAPH-945
Present below are the bench-marking done for different workloads as part of the review process.
Testing production usage.
Re-indexing is triggered on
full-text
index onname
predicate in 21 million movie data set and corresponding time is noted for three different implementation of writer.Without any reasonable performance benefit, I go ahead with 2. Writebatch in this PR.
Testing ad-hoc data loads and compare two Write APIs in badger
Use
writer_test.go
to reproduce the benchmark numbers. The two Write APIs I compare are one in which lock is acquired for every SetEntry and the second in which one lock is acquired for one write-load.Although for the data-load I didn't see any significant performance difference between two Write APIs, I propose to go ahead with Single lock for one Write after discussing with @jarifibrahim and @ashish-goswami
To test hypothesis that using same Writer and Flushing it at the end should perform better than initializing multiple Writers and Flushing them all independently, I tested with heavier workloads.
KV List size 500000
KV List size 5000000
As expected for heavier workloads one Writer works better than multiple writers because WriteBatch can parallelize the writes in the background and doesn't wait for individual flushes.
This change is
Docs Preview: