Skip to content

Conversation

@kboroszko
Copy link
Collaborator

@kboroszko kboroszko commented Nov 3, 2021

In this pr we make the read methods accept a row_set reading only rows specified by the user.
We also add a parallel read, that leverages the sample_row_keys method to split work among workers.


This change is Reviewable

Copy link

@dopiera dopiera left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good in general, but I am worried that we don't test splitting the work between workers enough.

I'll do the full review after you rebase this.

Reviewable status: 0 of 16 files reviewed, all discussions resolved

Copy link
Collaborator Author

@kboroszko kboroszko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good luck! :D
It's rebased!

Reviewable status: 0 of 17 files reviewed, all discussions resolved

Copy link

@dopiera dopiera left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 18 files reviewed, 8 unresolved discussions (waiting on @dopiera and @kboroszko)


tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 34 at r2 (raw file):

tensorflow::error::Code GoogleCloudErrorCodeToTfErrorCode(
    ::google::cloud::StatusCode code) {

Why do we need these whitespace changes?


tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 155 at r2 (raw file):

// cbt::RowRange::InfiniteRange(),

Please remove this comment


tensorflow_io/core/kernels/bigtable/bigtable_row_set.cc, line 31 at r2 (raw file):

 private:
  StatusOr<BigtableRowSetResource*> CreateResource() override {

All these whitespace changes seem unrelated.


tensorflow_io/core/kernels/bigtable/bigtable_row_set.cc, line 152 at r2 (raw file):

                        BigtableRowSetIntersectOp);

class BigtableRowSetIntersectTensorOp : public OpKernel {

As i mentioned somewhere below - I don't understand why this operator is necessary. The BigtableSampleRowSets already takes the user-provided row-set into account.


tensorflow_io/core/kernels/gsmemcachedfs/memcached_file_block_cache.cc, line 762 at r2 (raw file):

    auto page = absl::make_unique<std::vector<char>>();
    page->assign(data->begin(), data->end());
    cache_buffer_map_.emplace(memc_key, page.release());

That seems accidental.


tensorflow_io/core/ops/bigtable_ops.cc, line 97 at r2 (raw file):

REGISTER_OP("BigtableRowSetIntersectTensor")

I'm sorry, but I don't understand why this is needed. BigtableSampleRowSets already intersects the the tablet list with the row set passed by the user, so why do it again?


tensorflow_io/core/ops/bigtable_ops.cc, line 106 at r2 (raw file):

BigtableSampleRowSets

I think this name can be misleading. SampleRowKeys was accurate - these was a (hopefully evenly) spread out sample of the row keys. This is a (hopefully even) row range split (i.e. covering all the given row set - not a sample).

How about BigtableSplitRangeEvenly or something along those lines?


tensorflow_io/python/ops/bigtable/bigtable_dataset_ops.py, line 47 at r2 (raw file):

        self,
        columns: List[str],
        num_parallel_calls=1,

I think this default is not a fortunate one.

multiprocessing.cpu_count() is a guess like any other but at least it's likely to be significantly different from read_rows.

Copy link
Collaborator Author

@kboroszko kboroszko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 18 files reviewed, 8 unresolved discussions (waiting on @dopiera)


tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 34 at r2 (raw file):

Previously, dopiera (Marek Dopiera) wrote…

Why do we need these whitespace changes?

I forgot to run the linter on that part before, that's why it's making this change. Good news is at least that it's not going to happen again.


tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 155 at r2 (raw file):

Previously, dopiera (Marek Dopiera) wrote…
// cbt::RowRange::InfiniteRange(),

Please remove this comment

Done.


tensorflow_io/core/kernels/bigtable/bigtable_row_set.cc, line 31 at r2 (raw file):

Previously, dopiera (Marek Dopiera) wrote…

All these whitespace changes seem unrelated.

As I said I run the linter.


tensorflow_io/core/kernels/bigtable/bigtable_row_set.cc, line 152 at r2 (raw file):

Previously, dopiera (Marek Dopiera) wrote…

As i mentioned somewhere below - I don't understand why this operator is necessary. The BigtableSampleRowSets already takes the user-provided row-set into account.

Done.


tensorflow_io/core/ops/bigtable_ops.cc, line 97 at r2 (raw file):

Previously, dopiera (Marek Dopiera) wrote…
REGISTER_OP("BigtableRowSetIntersectTensor")

I'm sorry, but I don't understand why this is needed. BigtableSampleRowSets already intersects the the tablet list with the row set passed by the user, so why do it again?

Done.


tensorflow_io/core/ops/bigtable_ops.cc, line 106 at r2 (raw file):

Previously, dopiera (Marek Dopiera) wrote…
BigtableSampleRowSets

I think this name can be misleading. SampleRowKeys was accurate - these was a (hopefully evenly) spread out sample of the row keys. This is a (hopefully even) row range split (i.e. covering all the given row set - not a sample).

How about BigtableSplitRangeEvenly or something along those lines?

How about BigtableSplitRowSetEvenly? since the thing we are splitting is a user-specified row set.


tensorflow_io/core/kernels/gsmemcachedfs/memcached_file_block_cache.cc, line 762 at r2 (raw file):

Previously, dopiera (Marek Dopiera) wrote…

That seems accidental.

Whoops, good catch!


tensorflow_io/python/ops/bigtable/bigtable_dataset_ops.py, line 47 at r2 (raw file):

Previously, dopiera (Marek Dopiera) wrote…

I think this default is not a fortunate one.

multiprocessing.cpu_count() is a guess like any other but at least it's likely to be significantly different from read_rows.

Alrighty. 🐿️
It turns out that TF already has a const value tf.data.AUTOTUNE which is set to the suggested number of threads, so I used that.

Copy link

@dopiera dopiera left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 18 files reviewed, 13 unresolved discussions (waiting on @dopiera and @kboroszko)


tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 272 at r3 (raw file):

  mutex mu_;
  const std::vector<std::pair<std::string, std::string>> columns_;
  cbt::Table table_;

Why is table_ not GUARDED_BY(mu_)?

Also, why do you even need it? I haven't noticed any uses?


tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 316 at r3 (raw file):

 private:
  BigtableClientResource& client_resource_;
  const core::ScopedUnref client_resource_unref_;

Why don't we need to unref the client resource anymore?


tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 332 at r3 (raw file):

 private:
  BigtableClientResource& client_resource_;
  io::BigtableRowSetResource& row_set_resource_;

Why do we need to hold the whole resource? If we do, I think we need to ref and unref it - why not keep its contents instead?


tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 356 at r3 (raw file):

    OP_REQUIRES_OK(ctx,
                   GetResourceFromContext(ctx, "row_set", &row_set_resource));
    core::ScopedUnref row_set_resource_unref_(row_set_resource);

Please remove the trailing underscore from the local variable. Also, as mentioned earlier, I think it's easier if you pass the RowSet here instead of the RowSetReasource.

Slightly unrelated, but that might also be the case for the client_resource (i.e. you could simply pass the DataClient).


tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 405 at r3 (raw file):

num_parallel_calls

That is not a self-explanatory name in this context. How about num_shards or num_splits?


tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 461 at r3 (raw file):

(size_t)

Please don't use C-style casts. Use static_cast, dynamic_cast, const_cast or reinterpret_cast depending on the context. Here, it will probably be easiest to explicitly instantiate the std::min template, i.e. std::min<std::size_t>(somthing).


tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 464 at r3 (raw file):

(long)

static_cast, please.


tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 495 at r3 (raw file):

 private:
  mutable mutex mu_;
  std::string table_id_;

Please add the concurrency annotations.


tensorflow_io/core/kernels/bigtable/bigtable_row_set.cc, line 152 at r2 (raw file):

Previously, kboroszko (Kajetan Boroszko) wrote…

Done.

It's still here.


tensorflow_io/core/ops/bigtable_ops.cc, line 97 at r2 (raw file):

Previously, kboroszko (Kajetan Boroszko) wrote…

Done.

It's still here.


tensorflow_io/core/ops/bigtable_ops.cc, line 106 at r2 (raw file):

Previously, kboroszko (Kajetan Boroszko) wrote…

How about BigtableSplitRowSetEvenly? since the thing we are splitting is a user-specified row set.

Even better.


tensorflow_io/python/ops/bigtable/bigtable_dataset_ops.py, line 47 at r2 (raw file):

Previously, kboroszko (Kajetan Boroszko) wrote…

Alrighty. 🐿️
It turns out that TF already has a const value tf.data.AUTOTUNE which is set to the suggested number of threads, so I used that.

Even better.

Copy link
Collaborator Author

@kboroszko kboroszko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 18 files reviewed, 13 unresolved discussions (waiting on @dopiera)


tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 272 at r3 (raw file):

Previously, dopiera (Marek Dopiera) wrote…

Why is table_ not GUARDED_BY(mu_)?

Also, why do you even need it? I haven't noticed any uses?

Done.


tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 316 at r3 (raw file):

Previously, dopiera (Marek Dopiera) wrote…

Why don't we need to unref the client resource anymore?

Done.


tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 332 at r3 (raw file):

Previously, dopiera (Marek Dopiera) wrote…

Why do we need to hold the whole resource? If we do, I think we need to ref and unref it - why not keep its contents instead?

Done.


tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 356 at r3 (raw file):

Previously, dopiera (Marek Dopiera) wrote…

Please remove the trailing underscore from the local variable. Also, as mentioned earlier, I think it's easier if you pass the RowSet here instead of the RowSetReasource.

Slightly unrelated, but that might also be the case for the client_resource (i.e. you could simply pass the DataClient).

Done.


tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 405 at r3 (raw file):

Previously, dopiera (Marek Dopiera) wrote…
num_parallel_calls

That is not a self-explanatory name in this context. How about num_shards or num_splits?

Done. I think that num_splits might be best.


tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 461 at r3 (raw file):

Previously, dopiera (Marek Dopiera) wrote…
(size_t)

Please don't use C-style casts. Use static_cast, dynamic_cast, const_cast or reinterpret_cast depending on the context. Here, it will probably be easiest to explicitly instantiate the std::min template, i.e. std::min<std::size_t>(somthing).

Done.


tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 464 at r3 (raw file):

Previously, dopiera (Marek Dopiera) wrote…
(long)

static_cast, please.

Done.


tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 495 at r3 (raw file):

Previously, dopiera (Marek Dopiera) wrote…

Please add the concurrency annotations.

Done.


tensorflow_io/core/kernels/bigtable/bigtable_row_set.cc, line 152 at r2 (raw file):

Previously, dopiera (Marek Dopiera) wrote…

It's still here.

No it's not. This one is just the good old RowSet::Intersect wrapper, so the user can create a RowSet and Intersect it with a RowRange or something. The other one was called RowSetIntersectTensor and got removed.


tensorflow_io/core/ops/bigtable_ops.cc, line 97 at r2 (raw file):

Previously, dopiera (Marek Dopiera) wrote…

It's still here.

It's not 😄 . BigtableRowSetIntersectTensor doesn't exist. Are you sure you're looking at full diff?


tensorflow_io/core/ops/bigtable_ops.cc, line 106 at r2 (raw file):

Previously, dopiera (Marek Dopiera) wrote…

Even better.

Done.


tensorflow_io/python/ops/bigtable/bigtable_dataset_ops.py, line 47 at r2 (raw file):

Previously, dopiera (Marek Dopiera) wrote…

Even better.

Done.

Copy link
Collaborator Author

@kboroszko kboroszko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 18 files reviewed, 13 unresolved discussions (waiting on @dopiera)


tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 332 at r3 (raw file):

Previously, kboroszko (Kajetan Boroszko) wrote…

Done.

I'm storing the data_client and row_set by value now, as you suggested.

Copy link

@dopiera dopiera left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Almost there.

Reviewable status: 0 of 18 files reviewed, 1 unresolved discussion (waiting on @dopiera and @kboroszko)


tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 272 at r3 (raw file):

Previously, kboroszko (Kajetan Boroszko) wrote…

Done.

But you only commented it out rather than removing.


tensorflow_io/core/kernels/bigtable/bigtable_row_set.cc, line 152 at r2 (raw file):

Previously, kboroszko (Kajetan Boroszko) wrote…

No it's not. This one is just the good old RowSet::Intersect wrapper, so the user can create a RowSet and Intersect it with a RowRange or something. The other one was called RowSetIntersectTensor and got removed.

Oops, you're right


tensorflow_io/core/ops/bigtable_ops.cc, line 97 at r2 (raw file):

Previously, kboroszko (Kajetan Boroszko) wrote…

It's not 😄 . BigtableRowSetIntersectTensor doesn't exist. Are you sure you're looking at full diff?

Weird, but you're right - it looks good now.

Copy link
Collaborator Author

@kboroszko kboroszko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 18 files reviewed, 1 unresolved discussion (waiting on @dopiera)


tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 272 at r3 (raw file):

Previously, dopiera (Marek Dopiera) wrote…

But you only commented it out rather than removing.

My bad, missed that one.

Copy link

@dopiera dopiera left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 18 files reviewed, all discussions resolved (waiting on @dopiera)

@kboroszko kboroszko merged commit 44fa86d into master Nov 22, 2021
@kboroszko kboroszko deleted the kb/parallel branch November 22, 2021 13:10
kboroszko added a commit that referenced this pull request Nov 23, 2021
In this pr we make the read methods accept a row_set reading only rows specified by the user.
We also add a parallel read, that leverages the sample_row_keys method to split work among workers.
dopiera pushed a commit that referenced this pull request Nov 24, 2021
In this pr we make the read methods accept a row_set reading only rows specified by the user.
We also add a parallel read, that leverages the sample_row_keys method to split work among workers.
dopiera pushed a commit that referenced this pull request Dec 3, 2021
In this pr we make the read methods accept a row_set reading only rows specified by the user.
We also add a parallel read, that leverages the sample_row_keys method to split work among workers.
dopiera pushed a commit that referenced this pull request Dec 3, 2021
In this pr we make the read methods accept a row_set reading only rows specified by the user.
We also add a parallel read, that leverages the sample_row_keys method to split work among workers.
kboroszko added a commit that referenced this pull request Dec 13, 2021
In this pr we make the read methods accept a row_set reading only rows specified by the user.
We also add a parallel read, that leverages the sample_row_keys method to split work among workers.
kboroszko added a commit that referenced this pull request Dec 20, 2021
In this pr we make the read methods accept a row_set reading only rows specified by the user.
We also add a parallel read, that leverages the sample_row_keys method to split work among workers.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants