-
Notifications
You must be signed in to change notification settings - Fork 0
feat: parallel read #4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good in general, but I am worried that we don't test splitting the work between workers enough.
I'll do the full review after you rebase this.
Reviewable status: 0 of 16 files reviewed, all discussions resolved
remove obsolete comments RowsetIntersectRange and BigtableSampleRowSets tweaks
d5f8011 to
5b8110a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good luck! :D
It's rebased!
Reviewable status: 0 of 17 files reviewed, all discussions resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 18 files reviewed, 8 unresolved discussions (waiting on @dopiera and @kboroszko)
tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 34 at r2 (raw file):
tensorflow::error::Code GoogleCloudErrorCodeToTfErrorCode( ::google::cloud::StatusCode code) {
Why do we need these whitespace changes?
tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 155 at r2 (raw file):
// cbt::RowRange::InfiniteRange(),
Please remove this comment
tensorflow_io/core/kernels/bigtable/bigtable_row_set.cc, line 31 at r2 (raw file):
private: StatusOr<BigtableRowSetResource*> CreateResource() override {
All these whitespace changes seem unrelated.
tensorflow_io/core/kernels/bigtable/bigtable_row_set.cc, line 152 at r2 (raw file):
BigtableRowSetIntersectOp); class BigtableRowSetIntersectTensorOp : public OpKernel {
As i mentioned somewhere below - I don't understand why this operator is necessary. The BigtableSampleRowSets already takes the user-provided row-set into account.
tensorflow_io/core/kernels/gsmemcachedfs/memcached_file_block_cache.cc, line 762 at r2 (raw file):
auto page = absl::make_unique<std::vector<char>>(); page->assign(data->begin(), data->end()); cache_buffer_map_.emplace(memc_key, page.release());
That seems accidental.
tensorflow_io/core/ops/bigtable_ops.cc, line 97 at r2 (raw file):
REGISTER_OP("BigtableRowSetIntersectTensor")
I'm sorry, but I don't understand why this is needed. BigtableSampleRowSets already intersects the the tablet list with the row set passed by the user, so why do it again?
tensorflow_io/core/ops/bigtable_ops.cc, line 106 at r2 (raw file):
BigtableSampleRowSets
I think this name can be misleading. SampleRowKeys was accurate - these was a (hopefully evenly) spread out sample of the row keys. This is a (hopefully even) row range split (i.e. covering all the given row set - not a sample).
How about BigtableSplitRangeEvenly or something along those lines?
tensorflow_io/python/ops/bigtable/bigtable_dataset_ops.py, line 47 at r2 (raw file):
self, columns: List[str], num_parallel_calls=1,
I think this default is not a fortunate one.
multiprocessing.cpu_count() is a guess like any other but at least it's likely to be significantly different from read_rows.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 18 files reviewed, 8 unresolved discussions (waiting on @dopiera)
tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 34 at r2 (raw file):
Previously, dopiera (Marek Dopiera) wrote…
Why do we need these whitespace changes?
I forgot to run the linter on that part before, that's why it's making this change. Good news is at least that it's not going to happen again.
tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 155 at r2 (raw file):
Previously, dopiera (Marek Dopiera) wrote…
// cbt::RowRange::InfiniteRange(),Please remove this comment
Done.
tensorflow_io/core/kernels/bigtable/bigtable_row_set.cc, line 31 at r2 (raw file):
Previously, dopiera (Marek Dopiera) wrote…
All these whitespace changes seem unrelated.
As I said I run the linter.
tensorflow_io/core/kernels/bigtable/bigtable_row_set.cc, line 152 at r2 (raw file):
Previously, dopiera (Marek Dopiera) wrote…
As i mentioned somewhere below - I don't understand why this operator is necessary. The
BigtableSampleRowSetsalready takes the user-provided row-set into account.
Done.
tensorflow_io/core/ops/bigtable_ops.cc, line 97 at r2 (raw file):
Previously, dopiera (Marek Dopiera) wrote…
REGISTER_OP("BigtableRowSetIntersectTensor")I'm sorry, but I don't understand why this is needed. BigtableSampleRowSets already intersects the the tablet list with the row set passed by the user, so why do it again?
Done.
tensorflow_io/core/ops/bigtable_ops.cc, line 106 at r2 (raw file):
Previously, dopiera (Marek Dopiera) wrote…
BigtableSampleRowSetsI think this name can be misleading.
SampleRowKeyswas accurate - these was a (hopefully evenly) spread out sample of the row keys. This is a (hopefully even) row range split (i.e. covering all the given row set - not a sample).How about
BigtableSplitRangeEvenlyor something along those lines?
How about BigtableSplitRowSetEvenly? since the thing we are splitting is a user-specified row set.
tensorflow_io/core/kernels/gsmemcachedfs/memcached_file_block_cache.cc, line 762 at r2 (raw file):
Previously, dopiera (Marek Dopiera) wrote…
That seems accidental.
Whoops, good catch!
tensorflow_io/python/ops/bigtable/bigtable_dataset_ops.py, line 47 at r2 (raw file):
Previously, dopiera (Marek Dopiera) wrote…
I think this default is not a fortunate one.
multiprocessing.cpu_count()is a guess like any other but at least it's likely to be significantly different fromread_rows.
Alrighty. 🐿️
It turns out that TF already has a const value tf.data.AUTOTUNE which is set to the suggested number of threads, so I used that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 18 files reviewed, 13 unresolved discussions (waiting on @dopiera and @kboroszko)
tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 272 at r3 (raw file):
mutex mu_; const std::vector<std::pair<std::string, std::string>> columns_; cbt::Table table_;
Why is table_ not GUARDED_BY(mu_)?
Also, why do you even need it? I haven't noticed any uses?
tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 316 at r3 (raw file):
private: BigtableClientResource& client_resource_; const core::ScopedUnref client_resource_unref_;
Why don't we need to unref the client resource anymore?
tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 332 at r3 (raw file):
private: BigtableClientResource& client_resource_; io::BigtableRowSetResource& row_set_resource_;
Why do we need to hold the whole resource? If we do, I think we need to ref and unref it - why not keep its contents instead?
tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 356 at r3 (raw file):
OP_REQUIRES_OK(ctx, GetResourceFromContext(ctx, "row_set", &row_set_resource)); core::ScopedUnref row_set_resource_unref_(row_set_resource);
Please remove the trailing underscore from the local variable. Also, as mentioned earlier, I think it's easier if you pass the RowSet here instead of the RowSetReasource.
Slightly unrelated, but that might also be the case for the client_resource (i.e. you could simply pass the DataClient).
tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 405 at r3 (raw file):
num_parallel_calls
That is not a self-explanatory name in this context. How about num_shards or num_splits?
tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 461 at r3 (raw file):
(size_t)
Please don't use C-style casts. Use static_cast, dynamic_cast, const_cast or reinterpret_cast depending on the context. Here, it will probably be easiest to explicitly instantiate the std::min template, i.e. std::min<std::size_t>(somthing).
tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 464 at r3 (raw file):
(long)
static_cast, please.
tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 495 at r3 (raw file):
private: mutable mutex mu_; std::string table_id_;
Please add the concurrency annotations.
tensorflow_io/core/kernels/bigtable/bigtable_row_set.cc, line 152 at r2 (raw file):
Previously, kboroszko (Kajetan Boroszko) wrote…
Done.
It's still here.
tensorflow_io/core/ops/bigtable_ops.cc, line 97 at r2 (raw file):
Previously, kboroszko (Kajetan Boroszko) wrote…
Done.
It's still here.
tensorflow_io/core/ops/bigtable_ops.cc, line 106 at r2 (raw file):
Previously, kboroszko (Kajetan Boroszko) wrote…
How about
BigtableSplitRowSetEvenly? since the thing we are splitting is a user-specified row set.
Even better.
tensorflow_io/python/ops/bigtable/bigtable_dataset_ops.py, line 47 at r2 (raw file):
Previously, kboroszko (Kajetan Boroszko) wrote…
Alrighty. 🐿️
It turns out that TF already has a const valuetf.data.AUTOTUNEwhich is set to the suggested number of threads, so I used that.
Even better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 18 files reviewed, 13 unresolved discussions (waiting on @dopiera)
tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 272 at r3 (raw file):
Previously, dopiera (Marek Dopiera) wrote…
Why is table_ not
GUARDED_BY(mu_)?Also, why do you even need it? I haven't noticed any uses?
Done.
tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 316 at r3 (raw file):
Previously, dopiera (Marek Dopiera) wrote…
Why don't we need to unref the client resource anymore?
Done.
tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 332 at r3 (raw file):
Previously, dopiera (Marek Dopiera) wrote…
Why do we need to hold the whole resource? If we do, I think we need to
refandunrefit - why not keep its contents instead?
Done.
tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 356 at r3 (raw file):
Previously, dopiera (Marek Dopiera) wrote…
Please remove the trailing underscore from the local variable. Also, as mentioned earlier, I think it's easier if you pass the
RowSethere instead of theRowSetReasource.Slightly unrelated, but that might also be the case for the
client_resource(i.e. you could simply pass theDataClient).
Done.
tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 405 at r3 (raw file):
Previously, dopiera (Marek Dopiera) wrote…
num_parallel_callsThat is not a self-explanatory name in this context. How about
num_shardsornum_splits?
Done. I think that num_splits might be best.
tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 461 at r3 (raw file):
Previously, dopiera (Marek Dopiera) wrote…
(size_t)Please don't use C-style casts. Use
static_cast,dynamic_cast,const_castorreinterpret_castdepending on the context. Here, it will probably be easiest to explicitly instantiate thestd::mintemplate, i.e.std::min<std::size_t>(somthing).
Done.
tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 464 at r3 (raw file):
Previously, dopiera (Marek Dopiera) wrote…
(long)
static_cast, please.
Done.
tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 495 at r3 (raw file):
Previously, dopiera (Marek Dopiera) wrote…
Please add the concurrency annotations.
Done.
tensorflow_io/core/kernels/bigtable/bigtable_row_set.cc, line 152 at r2 (raw file):
Previously, dopiera (Marek Dopiera) wrote…
It's still here.
No it's not. This one is just the good old RowSet::Intersect wrapper, so the user can create a RowSet and Intersect it with a RowRange or something. The other one was called RowSetIntersectTensor and got removed.
tensorflow_io/core/ops/bigtable_ops.cc, line 97 at r2 (raw file):
Previously, dopiera (Marek Dopiera) wrote…
It's still here.
It's not 😄 . BigtableRowSetIntersectTensor doesn't exist. Are you sure you're looking at full diff?
tensorflow_io/core/ops/bigtable_ops.cc, line 106 at r2 (raw file):
Previously, dopiera (Marek Dopiera) wrote…
Even better.
Done.
tensorflow_io/python/ops/bigtable/bigtable_dataset_ops.py, line 47 at r2 (raw file):
Previously, dopiera (Marek Dopiera) wrote…
Even better.
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 18 files reviewed, 13 unresolved discussions (waiting on @dopiera)
tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 332 at r3 (raw file):
Previously, kboroszko (Kajetan Boroszko) wrote…
Done.
I'm storing the data_client and row_set by value now, as you suggested.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Almost there.
Reviewable status: 0 of 18 files reviewed, 1 unresolved discussion (waiting on @dopiera and @kboroszko)
tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 272 at r3 (raw file):
Previously, kboroszko (Kajetan Boroszko) wrote…
Done.
But you only commented it out rather than removing.
tensorflow_io/core/kernels/bigtable/bigtable_row_set.cc, line 152 at r2 (raw file):
Previously, kboroszko (Kajetan Boroszko) wrote…
No it's not. This one is just the good old
RowSet::Intersectwrapper, so the user can create aRowSetand Intersect it with aRowRangeor something. The other one was calledRowSetIntersectTensorand got removed.
Oops, you're right
tensorflow_io/core/ops/bigtable_ops.cc, line 97 at r2 (raw file):
Previously, kboroszko (Kajetan Boroszko) wrote…
It's not 😄 .
BigtableRowSetIntersectTensordoesn't exist. Are you sure you're looking at full diff?
Weird, but you're right - it looks good now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 18 files reviewed, 1 unresolved discussion (waiting on @dopiera)
tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 272 at r3 (raw file):
Previously, dopiera (Marek Dopiera) wrote…
But you only commented it out rather than removing.
My bad, missed that one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 18 files reviewed, all discussions resolved (waiting on @dopiera)
In this pr we make the read methods accept a row_set reading only rows specified by the user. We also add a parallel read, that leverages the sample_row_keys method to split work among workers.
In this pr we make the read methods accept a row_set reading only rows specified by the user. We also add a parallel read, that leverages the sample_row_keys method to split work among workers.
In this pr we make the read methods accept a row_set reading only rows specified by the user. We also add a parallel read, that leverages the sample_row_keys method to split work among workers.
In this pr we make the read methods accept a row_set reading only rows specified by the user. We also add a parallel read, that leverages the sample_row_keys method to split work among workers.
In this pr we make the read methods accept a row_set reading only rows specified by the user. We also add a parallel read, that leverages the sample_row_keys method to split work among workers.
In this pr we make the read methods accept a row_set reading only rows specified by the user. We also add a parallel read, that leverages the sample_row_keys method to split work among workers.
In this pr we make the read methods accept a row_set reading only rows specified by the user.
We also add a parallel read, that leverages the sample_row_keys method to split work among workers.
This change is