feat: parallel read #4

kboroszko · 2021-11-03T10:56:14Z

In this pr we make the read methods accept a row_set reading only rows specified by the user.
We also add a parallel read, that leverages the sample_row_keys method to split work among workers.

This change is

dopiera

This looks good in general, but I am worried that we don't test splitting the work between workers enough.

I'll do the full review after you rebase this.

Reviewable status: 0 of 16 files reviewed, all discussions resolved

remove obsolete comments RowsetIntersectRange and BigtableSampleRowSets tweaks

kboroszko

Good luck! :D
It's rebased!

Reviewable status: 0 of 17 files reviewed, all discussions resolved

dopiera

Reviewable status: 0 of 18 files reviewed, 8 unresolved discussions (waiting on @dopiera and @kboroszko)

tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 34 at r2 (raw file):

tensorflow::error::Code GoogleCloudErrorCodeToTfErrorCode(
    ::google::cloud::StatusCode code) {

Why do we need these whitespace changes?

tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 155 at r2 (raw file):

// cbt::RowRange::InfiniteRange(),

Please remove this comment

tensorflow_io/core/kernels/bigtable/bigtable_row_set.cc, line 31 at r2 (raw file):

 private:
  StatusOr<BigtableRowSetResource*> CreateResource() override {

All these whitespace changes seem unrelated.

tensorflow_io/core/kernels/bigtable/bigtable_row_set.cc, line 152 at r2 (raw file):

                        BigtableRowSetIntersectOp);

class BigtableRowSetIntersectTensorOp : public OpKernel {

As i mentioned somewhere below - I don't understand why this operator is necessary. The BigtableSampleRowSets already takes the user-provided row-set into account.

tensorflow_io/core/kernels/gsmemcachedfs/memcached_file_block_cache.cc, line 762 at r2 (raw file):

    auto page = absl::make_unique<std::vector<char>>();
    page->assign(data->begin(), data->end());
    cache_buffer_map_.emplace(memc_key, page.release());

That seems accidental.

tensorflow_io/core/ops/bigtable_ops.cc, line 97 at r2 (raw file):

REGISTER_OP("BigtableRowSetIntersectTensor")

I'm sorry, but I don't understand why this is needed. BigtableSampleRowSets already intersects the the tablet list with the row set passed by the user, so why do it again?

tensorflow_io/core/ops/bigtable_ops.cc, line 106 at r2 (raw file):

BigtableSampleRowSets

I think this name can be misleading. SampleRowKeys was accurate - these was a (hopefully evenly) spread out sample of the row keys. This is a (hopefully even) row range split (i.e. covering all the given row set - not a sample).

How about BigtableSplitRangeEvenly or something along those lines?

tensorflow_io/python/ops/bigtable/bigtable_dataset_ops.py, line 47 at r2 (raw file):

        self,
        columns: List[str],
        num_parallel_calls=1,

I think this default is not a fortunate one.

multiprocessing.cpu_count() is a guess like any other but at least it's likely to be significantly different from read_rows.

kboroszko

Reviewable status: 0 of 18 files reviewed, 8 unresolved discussions (waiting on @dopiera)

tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 34 at r2 (raw file):

Previously, dopiera (Marek Dopiera) wrote…

Why do we need these whitespace changes?

I forgot to run the linter on that part before, that's why it's making this change. Good news is at least that it's not going to happen again.

tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 155 at r2 (raw file):

Previously, dopiera (Marek Dopiera) wrote…

// cbt::RowRange::InfiniteRange(),
Please remove this comment

Done.

tensorflow_io/core/kernels/bigtable/bigtable_row_set.cc, line 31 at r2 (raw file):

Previously, dopiera (Marek Dopiera) wrote…

All these whitespace changes seem unrelated.

As I said I run the linter.

tensorflow_io/core/kernels/bigtable/bigtable_row_set.cc, line 152 at r2 (raw file):

Previously, dopiera (Marek Dopiera) wrote…

As i mentioned somewhere below - I don't understand why this operator is necessary. The BigtableSampleRowSets already takes the user-provided row-set into account.

Done.

tensorflow_io/core/ops/bigtable_ops.cc, line 97 at r2 (raw file):

Previously, dopiera (Marek Dopiera) wrote…

REGISTER_OP("BigtableRowSetIntersectTensor")
I'm sorry, but I don't understand why this is needed. BigtableSampleRowSets already intersects the the tablet list with the row set passed by the user, so why do it again?

Done.

tensorflow_io/core/ops/bigtable_ops.cc, line 106 at r2 (raw file):

Previously, dopiera (Marek Dopiera) wrote…

BigtableSampleRowSets
I think this name can be misleading. SampleRowKeys was accurate - these was a (hopefully evenly) spread out sample of the row keys. This is a (hopefully even) row range split (i.e. covering all the given row set - not a sample).

How about BigtableSplitRangeEvenly or something along those lines?

How about BigtableSplitRowSetEvenly? since the thing we are splitting is a user-specified row set.

tensorflow_io/core/kernels/gsmemcachedfs/memcached_file_block_cache.cc, line 762 at r2 (raw file):

Previously, dopiera (Marek Dopiera) wrote…

That seems accidental.

Whoops, good catch!

tensorflow_io/python/ops/bigtable/bigtable_dataset_ops.py, line 47 at r2 (raw file):

Previously, dopiera (Marek Dopiera) wrote…

I think this default is not a fortunate one.

multiprocessing.cpu_count() is a guess like any other but at least it's likely to be significantly different from read_rows.

Alrighty. 🐿️
It turns out that TF already has a const value tf.data.AUTOTUNE which is set to the suggested number of threads, so I used that.

dopiera

Reviewable status: 0 of 18 files reviewed, 13 unresolved discussions (waiting on @dopiera and @kboroszko)

tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 272 at r3 (raw file):

  mutex mu_;
  const std::vector<std::pair<std::string, std::string>> columns_;
  cbt::Table table_;

Why is table_ not GUARDED_BY(mu_)?

Also, why do you even need it? I haven't noticed any uses?

tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 316 at r3 (raw file):

 private:
  BigtableClientResource& client_resource_;
  const core::ScopedUnref client_resource_unref_;

Why don't we need to unref the client resource anymore?

tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 332 at r3 (raw file):

 private:
  BigtableClientResource& client_resource_;
  io::BigtableRowSetResource& row_set_resource_;

Why do we need to hold the whole resource? If we do, I think we need to ref and unref it - why not keep its contents instead?

tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 356 at r3 (raw file):

    OP_REQUIRES_OK(ctx,
                   GetResourceFromContext(ctx, "row_set", &row_set_resource));
    core::ScopedUnref row_set_resource_unref_(row_set_resource);

Please remove the trailing underscore from the local variable. Also, as mentioned earlier, I think it's easier if you pass the RowSet here instead of the RowSetReasource.

Slightly unrelated, but that might also be the case for the client_resource (i.e. you could simply pass the DataClient).

tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 405 at r3 (raw file):

num_parallel_calls

That is not a self-explanatory name in this context. How about num_shards or num_splits?

tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 461 at r3 (raw file):

(size_t)

Please don't use C-style casts. Use static_cast, dynamic_cast, const_cast or reinterpret_cast depending on the context. Here, it will probably be easiest to explicitly instantiate the std::min template, i.e. std::min<std::size_t>(somthing).

tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 464 at r3 (raw file):

(long)

static_cast, please.

tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 495 at r3 (raw file):

 private:
  mutable mutex mu_;
  std::string table_id_;

Please add the concurrency annotations.

tensorflow_io/core/kernels/bigtable/bigtable_row_set.cc, line 152 at r2 (raw file):

Previously, kboroszko (Kajetan Boroszko) wrote…

Done.

It's still here.

tensorflow_io/core/ops/bigtable_ops.cc, line 97 at r2 (raw file):

Previously, kboroszko (Kajetan Boroszko) wrote…

Done.

It's still here.

tensorflow_io/core/ops/bigtable_ops.cc, line 106 at r2 (raw file):

Previously, kboroszko (Kajetan Boroszko) wrote…

How about BigtableSplitRowSetEvenly? since the thing we are splitting is a user-specified row set.

Even better.

tensorflow_io/python/ops/bigtable/bigtable_dataset_ops.py, line 47 at r2 (raw file):

Previously, kboroszko (Kajetan Boroszko) wrote…

Alrighty. 🐿️
It turns out that TF already has a const value tf.data.AUTOTUNE which is set to the suggested number of threads, so I used that.

Even better.

kboroszko

Reviewable status: 0 of 18 files reviewed, 13 unresolved discussions (waiting on @dopiera)

tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 272 at r3 (raw file):

Previously, dopiera (Marek Dopiera) wrote…

Why is table_ not GUARDED_BY(mu_)?

Also, why do you even need it? I haven't noticed any uses?

Done.

tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 316 at r3 (raw file):

Previously, dopiera (Marek Dopiera) wrote…

Why don't we need to unref the client resource anymore?

Done.

tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 332 at r3 (raw file):

Previously, dopiera (Marek Dopiera) wrote…

Why do we need to hold the whole resource? If we do, I think we need to ref and unref it - why not keep its contents instead?

Done.

tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 356 at r3 (raw file):

Previously, dopiera (Marek Dopiera) wrote…

Please remove the trailing underscore from the local variable. Also, as mentioned earlier, I think it's easier if you pass the RowSet here instead of the RowSetReasource.

Slightly unrelated, but that might also be the case for the client_resource (i.e. you could simply pass the DataClient).

Done.

tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 405 at r3 (raw file):

Previously, dopiera (Marek Dopiera) wrote…

num_parallel_calls
That is not a self-explanatory name in this context. How about num_shards or num_splits?

Done. I think that num_splits might be best.

tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 461 at r3 (raw file):

Previously, dopiera (Marek Dopiera) wrote…

(size_t)
Please don't use C-style casts. Use static_cast, dynamic_cast, const_cast or reinterpret_cast depending on the context. Here, it will probably be easiest to explicitly instantiate the std::min template, i.e. std::min<std::size_t>(somthing).

Done.

tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 464 at r3 (raw file):

Previously, dopiera (Marek Dopiera) wrote…

(long)
static_cast, please.

Done.

tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 495 at r3 (raw file):

Previously, dopiera (Marek Dopiera) wrote…

Please add the concurrency annotations.

Done.

tensorflow_io/core/kernels/bigtable/bigtable_row_set.cc, line 152 at r2 (raw file):

Previously, dopiera (Marek Dopiera) wrote…

It's still here.

No it's not. This one is just the good old RowSet::Intersect wrapper, so the user can create a RowSet and Intersect it with a RowRange or something. The other one was called RowSetIntersectTensor and got removed.

tensorflow_io/core/ops/bigtable_ops.cc, line 97 at r2 (raw file):

Previously, dopiera (Marek Dopiera) wrote…

It's still here.

It's not 😄 . BigtableRowSetIntersectTensor doesn't exist. Are you sure you're looking at full diff?

tensorflow_io/core/ops/bigtable_ops.cc, line 106 at r2 (raw file):

Previously, dopiera (Marek Dopiera) wrote…

Even better.

Done.

tensorflow_io/python/ops/bigtable/bigtable_dataset_ops.py, line 47 at r2 (raw file):

Previously, dopiera (Marek Dopiera) wrote…

Even better.

Done.

kboroszko

Reviewable status: 0 of 18 files reviewed, 13 unresolved discussions (waiting on @dopiera)

tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 332 at r3 (raw file):

Previously, kboroszko (Kajetan Boroszko) wrote…

Done.

I'm storing the data_client and row_set by value now, as you suggested.

dopiera

Almost there.

Reviewable status: 0 of 18 files reviewed, 1 unresolved discussion (waiting on @dopiera and @kboroszko)

tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 272 at r3 (raw file):

Previously, kboroszko (Kajetan Boroszko) wrote…

Done.

But you only commented it out rather than removing.

tensorflow_io/core/kernels/bigtable/bigtable_row_set.cc, line 152 at r2 (raw file):

Previously, kboroszko (Kajetan Boroszko) wrote…

No it's not. This one is just the good old RowSet::Intersect wrapper, so the user can create a RowSet and Intersect it with a RowRange or something. The other one was called RowSetIntersectTensor and got removed.

Oops, you're right

tensorflow_io/core/ops/bigtable_ops.cc, line 97 at r2 (raw file):

Previously, kboroszko (Kajetan Boroszko) wrote…

It's not 😄 . BigtableRowSetIntersectTensor doesn't exist. Are you sure you're looking at full diff?

Weird, but you're right - it looks good now.

kboroszko

Reviewable status: 0 of 18 files reviewed, 1 unresolved discussion (waiting on @dopiera)

tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 272 at r3 (raw file):

Previously, dopiera (Marek Dopiera) wrote…

But you only commented it out rather than removing.

My bad, missed that one.

dopiera

Reviewable status: 0 of 18 files reviewed, all discussions resolved (waiting on @dopiera)

In this pr we make the read methods accept a row_set reading only rows specified by the user. We also add a parallel read, that leverages the sample_row_keys method to split work among workers.

dopiera reviewed Nov 4, 2021

View reviewed changes

kboroszko added 19 commits November 5, 2021 17:02

remove obsolete classes from bigtable_dataset_kernel

9c219d4

outline

c36527f

simplified creating resource

ae7e73b

parallel read not exactly working

b8fdd3b

parallel not split work working

93d9423

sampleRowSet

7137163

parallel without row_set

bd87e5b

rowset in parallel working

24eb8c9

row_set const ref working

d673b99

working parallel all

e8fedea

PR comments and linting

c4c1792

added more tests for parallel read

45eedf7

removed sample row_keys because it's unused

9ba5773

removed obsolete code and comments

d24fd1c

code cleanup 1

96981fb

remove obsolete comments RowsetIntersectRange and BigtableSampleRowSets tweaks

more tests for parallel read

c756183

run linter on python files

ace753a

linter on tests

360852e

after rebase

5b8110a

kboroszko force-pushed the kb/parallel branch from d5f8011 to 5b8110a Compare November 5, 2021 19:18

add tests

8975bc9

kboroszko commented Nov 5, 2021

View reviewed changes

linting

963d3b3

dopiera requested changes Nov 8, 2021

View reviewed changes

kboroszko added 5 commits November 9, 2021 20:09

samples working but ugly

71ca232

removed accidental change

6709520

Use resource tensor.

099768f

run linter and fixed namimg

0ca85bf

fix naming

127c355

handled empty row_set

c5a8fae

kboroszko commented Nov 10, 2021

View reviewed changes

dopiera requested changes Nov 10, 2021

View reviewed changes

kboroszko added 2 commits November 19, 2021 14:40

pr comments

628a07f

linter

e9bf2ef

kboroszko commented Nov 19, 2021

View reviewed changes

dopiera requested changes Nov 21, 2021

View reviewed changes

kboroszko commented Nov 22, 2021

View reviewed changes

removed missed comment

759afcb

dopiera approved these changes Nov 22, 2021

View reviewed changes

kboroszko merged commit 44fa86d into master Nov 22, 2021

kboroszko deleted the kb/parallel branch November 22, 2021 13:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: parallel read #4

feat: parallel read #4

Uh oh!

kboroszko commented Nov 3, 2021 •

edited by dopiera

Loading

Uh oh!

dopiera left a comment

Uh oh!

kboroszko left a comment

Uh oh!

dopiera left a comment

Uh oh!

kboroszko left a comment

Uh oh!

dopiera left a comment

Uh oh!

kboroszko left a comment

Uh oh!

kboroszko left a comment

Uh oh!

dopiera left a comment

Uh oh!

kboroszko left a comment

Uh oh!

dopiera left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

feat: parallel read #4

feat: parallel read #4

Uh oh!

Conversation

kboroszko commented Nov 3, 2021 • edited by dopiera Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dopiera left a comment

Choose a reason for hiding this comment

Uh oh!

kboroszko left a comment

Choose a reason for hiding this comment

Uh oh!

dopiera left a comment

Choose a reason for hiding this comment

Uh oh!

kboroszko left a comment

Choose a reason for hiding this comment

Uh oh!

dopiera left a comment

Choose a reason for hiding this comment

Uh oh!

kboroszko left a comment

Choose a reason for hiding this comment

Uh oh!

kboroszko left a comment

Choose a reason for hiding this comment

Uh oh!

dopiera left a comment

Choose a reason for hiding this comment

Uh oh!

kboroszko left a comment

Choose a reason for hiding this comment

Uh oh!

dopiera left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

kboroszko commented Nov 3, 2021 •

edited by dopiera

Loading