PARQUET-2188: [parquet-cpp] Add SkipRecords API to RecordReader #14142

fatemehp · 2022-09-15T17:53:43Z

The RecordReader is missing an API to skip records. There is a Skip method in the ColumnReader, but that skips based on the number of values/levels and not records. For repeated fields, this SkipRecords API will detect the record boundaries and correctly skip the right number of values for the requested number of records.

I plan to add a stress test separately.

fatemehp · 2022-09-15T17:54:03Z

@emkornfield could you take a look?

github-actions · 2022-09-15T19:55:08Z

https://issues.apache.org/jira/browse/PARQUET-2188

github-actions · 2022-09-15T19:55:11Z

⚠️ Ticket has not been started in JIRA, please click 'Start Progress'.

emkornfield · 2022-09-19T17:24:34Z

@github-actions autotune

emkornfield · 2022-09-19T17:25:09Z

cpp/src/parquet/column_reader.cc

  // Read dictionary indices.
  *indices_read = ReadDictionaryIndices(indices_to_read, indices);
-  int64_t total_indices = std::max(num_def_levels, *indices_read);
+  int64_t total_indices = std::max<int>(num_def_levels, *indices_read);


should this be int64_t for the template parameter?

emkornfield · 2022-09-19T17:26:24Z

cpp/src/parquet/column_reader.cc

    }

-    int64_t level_batch_size = std::max(kMinLevelBatchSize, num_records);
+    int64_t level_batch_size = std::max<int>(kMinLevelBatchSize, num_records);


same comment on type-template param for max

emkornfield · 2022-09-19T17:27:53Z

cpp/src/parquet/column_reader.cc

+  // non-repeated fields.
+  int64_t SkipRecordsInBufferNonRepeated(int64_t num_records) {
+    ARROW_DCHECK(this->max_rep_level_ == 0);
+    ARROW_DCHECK(this->has_values_to_process());


The first DCHECK seems self explanitory, I'm not sure I understand the second one though. Just to validate, do you think DCHECK or a throwing here is more appropriate?

I removed the check and added a return statement.

emkornfield · 2022-09-19T17:30:14Z

cpp/src/parquet/column_reader.cc

+  }
+
+  // Skip records for repeated fields. Returns number of skipped records.
+  // Skip records for repeated fields. Returns number of skipped records.


nit: repeated comment.

emkornfield · 2022-09-19T17:30:50Z

cpp/src/parquet/column_reader.cc

+  // Skip records that we have in our buffer. This function is only for
+  // non-repeated fields.
+  int64_t SkipRecordsInBufferNonRepeated(int64_t num_records) {
+    ARROW_DCHECK(this->max_rep_level_ == 0);


is there ARROW_DCHECK_EQ?

emkornfield · 2022-09-19T17:34:45Z

cpp/src/parquet/column_reader.cc

+    // Keep filling the buffer and skipping until we reach the desired number
+    // of records or we run out of values in the column chunk.
+    int64_t skipped_records = 0;
+    int64_t level_batch_size = std::max<int>(kMinLevelBatchSize, num_records);


int64_t for max?

emkornfield · 2022-09-19T17:39:16Z

cpp/src/parquet/column_reader.cc

+
+      ReserveLevels(batch_size);
+
+      int16_t* def_levels = this->def_levels() + levels_written_;


why is levels_written_ important here if we are discarding data?

When reading repetition and definition levels, we append them to the buffer that we already have. When we figure out how many of those we need to skip, we shift the values to the left to skip them.

Your comment revealed a bug where I was not shifting the values. I fixed it and checked that in the tests.

Hmm, that doesn't seem to answer the question? Why bump levers_written_ in this loop if the levels are meant to be skipped/discarded anyway? It seems you'll reserve superfluous buffer space for the levels.

I am bumping it here for correctness. At any point in time levels_written_ shows the end of the levels that are in the buffer. So we update it right here after we read a batch of levels. Note that we may not throw away all the levels that we read here. We may only throw away some of them in DelimitAndSkipRecordsInBuffer. When we throw away levels, we will update levels_written_ accordingly.

You are bringing up a good point here. We actually can read the values that we want to skip into a separate buffer and throw them away, which will then reduce the amount of shifting that we have to do. It can make the code a bit more complicated though since I need to consume the values from this buffer first, then read into the scratch buffer, and if anything is left transfer it over. I will keep this in mind as an optimization on top of this pull request.

emkornfield · 2022-09-19T17:40:30Z

cpp/src/parquet/column_reader.cc

+  // Read 'num_values' values and throw them away.
+  int64_t ReadAndThrowAway(int64_t num_values) {
+    int64_t values_left = num_values;
+    int64_t batch_size = 1024;  // ReadBatch with a smaller memory footprint


should this be initialized from a constant or config?

emkornfield · 2022-09-19T17:42:06Z

cpp/src/parquet/column_reader.cc

+    // This will be enough scratch space to accommodate 16-bit levels or any
+    // value type
+    int value_size = type_traits<DType::type_num>::value_byte_size;
+    std::shared_ptr<ResizableBuffer> scratch = AllocateBuffer(


i forget how this works for variable length types, is it still sufficient for those?

I think it should work since the value for the variable length types is technically a length and a pointer. I will add a separate test for them.

In parquet/types.h

template <>
struct type_traitsType::BYTE_ARRAY {
using value_type = ByteArray;
static constexpr int value_byte_size = sizeof(ByteArray);
static constexpr const char* printf_code = "s";
};

I added the test for ByteArray.

emkornfield · 2022-09-19T17:42:38Z

cpp/src/parquet/column_reader.cc

+          batch_size, reinterpret_cast<T*>(scratch->mutable_data()));
+      values_left -= values_read;
+    } while (values_read > 0 && values_left > 0);
+    return num_values - values_left;


should there be a validation here on values_read and num_values?

I added a check to check the result once we return from this function. Is that what you meant?

Well, apparently one call site was updated to check the result. Why not check the result here instead?

emkornfield · 2022-09-19T17:43:46Z

cpp/src/parquet/column_reader.cc

    // Conservative upper bound
    const int64_t possible_num_values =
-        std::max(num_records, levels_written_ - levels_position_);
+        std::max<int>(num_records, levels_written_ - levels_position_);


emkornfield · 2022-09-19T17:43:53Z

cpp/src/parquet/column_reader.cc

      // No repetition levels, skip delimiting logic. Each level represents a
      // null or not null entry
-      records_read = std::min(levels_written_ - levels_position_, num_records);
+      records_read = std::min<int>(levels_written_ - levels_position_, num_records);


emkornfield · 2022-09-19T17:44:20Z

cpp/src/parquet/column_reader.h

  /// \return number of records read
  virtual int64_t ReadRecords(int64_t num_records) = 0;
+
+  /// \brief Attempt to skip indicated number of records from column chunk


lets add a detail here to clarify this is the numer of rows records and not Values.

emkornfield · 2022-09-19T17:45:38Z

cpp/src/parquet/column_reader.h

+  std::shared_ptr<::arrow::ResizableBuffer> values_;
+  // In the case of false (BYTE_ARRAY), don't allocate the values buffer
+  // (when we directly read into builder classes).
+  bool uses_values_;


why was this moved up?

I must have re-ordered it to improve readability.

emkornfield · 2022-09-19T17:52:02Z

cpp/src/parquet/column_reader.cc


 template <typename DType>
-class TypedRecordReader : public ColumnReaderImplBase<DType>,
+class TypedRecordReader : public TypedColumnReaderImpl<DType>,


why the change in the base class?

To have access to the ColumnReader's Skip method.

emkornfield · 2022-09-19T17:55:07Z

cpp/src/parquet/column_reader_test.cc

+                                 read_values + record_reader->values_written() -
+                                     record_reader->null_count());
+
+  ASSERT_TRUE(vector_equal(read_vals, {20, 20}));


GMock should be available in arrow, so it would be clearer to use EXPECT_THAT(read_values, ElementsAre(20, 20)); here.

emkornfield

Thanks for adding a test coverage. Left a few comments as I'm not clear on the exact approach here but at a high level this looks good.

type and using gmock.

wgtmac · 2022-09-22T09:04:21Z

cpp/src/parquet/column_reader.cc

      levels_capacity_ = levels_remaining;
    }

-    records_read_ = 0;


Should we rename records_read_ to records_processed_ and update it accordingly? It is useful when we want to check the current position of the reader.

This variable is currently not used anywhere, so I am inclined towards removing it since it will be another member variable to keep updated. We could always add it back later and actually use it for checking the current position of the reader.

wgtmac · 2022-09-22T09:21:44Z

cpp/src/parquet/column_reader.cc

+    return num_values - values_left;
+  }
+
+  int64_t SkipRecords(int64_t num_records) override {


It seems that all SKIP operations called below actually read and then discard some records (unless the remaining values of current page can be skipped). Why not simply calling ReadRecords(num_records) internally and reset the buffers?

The idea is to enable Skipping entire pages at a time if the number of records to skip is sufficiently large.

Can you explain where the "Skipping entire pages at a time" part happens?

To give some context, suppose that we have M records in the column chunk and we want to read only record N from the column. In that case, we can do something like this. SkipRecords(N-1), ReadRecords(1), SkipRecords(M - N). We can do some optimizations here. For example for the skip to the end of the column chunk, we would not need to even look at page headers.

Ok, but are those optimizations done in this PR?

No, here is what we have now: Consider a non-repeated field, and that there are 10 pages with 100 values each. SkipRecords(900) will skip "decoding" the first 9 pages. It will still look at the page headers to find out how many values there are per page.

SkipRecords(900) will skip "decoding" the first 9 pages.

Can you point where this happens?

It happens in TypedColumnReader::Skip.

Ahah. I see, thanks.

emkornfield · 2022-10-18T04:58:22Z

cpp/src/parquet/column_reader.cc

+    int16_t* def_data = def_levels();
+    std::copy(def_data + levels_position_, def_data + levels_written_,
+              def_data + levels_position_ - gap);
+    PARQUET_THROW_NOT_OK(def_levels_->Resize(levels_remaining * sizeof(int16_t), false));


nit: comment literal. This doesn't try to release the memory correct?

Correct. Added parameter comments.

emkornfield · 2022-10-18T04:59:52Z

cpp/src/parquet/column_reader.cc

+    int64_t levels_remaining = levels_written_ - gap;
+
+    int16_t* def_data = def_levels();
+    std::copy(def_data + levels_position_, def_data + levels_written_,


can these two ranges overlap?

Yes, they can. However levels_position - gap is smaller than levels_position (we return for gap = 0 above). So we can safely use std::copy:
"Copies all elements in the range [first, last) starting from first and proceeding to last - 1. The behavior is undefined if d_first is within the range [first, last). In this case, std::copy_backward may be used instead."

emkornfield · 2022-10-18T05:00:51Z

cpp/src/parquet/column_reader.cc

    if (this->max_rep_level_ > 0) {
-      std::copy(rep_levels() + levels_position_, rep_levels() + levels_written_,
-                rep_levels() + levels_position_ - gap);
+      int16_t* rep_data = rep_levels();


Does it pay to make this a helper method/lambda that can be applied to a buffer?

emkornfield · 2022-10-18T05:02:20Z

cpp/src/parquet/column_reader.cc

+    int64_t values_seen = 0;
+    int64_t skipped_records = DelimitRecords(num_records, &values_seen);
+    if (ReadAndThrowAwayValues(values_seen) != values_seen) {
+      throw ParquetException("Could not read and throw away requested values");


adding details on values read/values and position might be useful when debugging issues.

Done. We do not buffer values so we do not have a position to report here. I reported the number of values that were requested and we could not read.

fatemehp · 2022-10-25T20:56:21Z

Addressed comments. @emkornfield, @pitrou could you take a look? Thanks!

…code more concise.

emkornfield · 2022-10-26T20:26:14Z

@pitrou did you want to rereview? Or are you OK if I merge this?

pitrou · 2022-10-26T21:01:41Z

@emkornfield I'll take a look, thank you!

pitrou

Thanks a lot for the update @fatemehp .

I like the extensive testing and the extensive comments you added, even though I still have some comments and suggestions below.

Nice work overall!

pitrou · 2022-10-27T13:10:49Z

cpp/src/parquet/column_reader.cc

+    if (gap == 0) return;
+
+    int64_t levels_remaining = levels_written_ - gap;
+    int64_t destination = levels_position_ - gap;


In other words, destination is equal to start_levels_position. Simplify?

pitrou · 2022-10-27T15:30:03Z

cpp/src/parquet/column_reader.cc

+    ThrowAwayLevels(start_levels_position);
+    // For values, we do not have them in buffer, so we will read them and
+    // throw them away.
+    ReadAndThrowAwayValues(values_to_read);


Looks like this is ignoring the ReadAndThrowAwayValues return value.

Done, by throwing the error in the function itself as suggested below.

pitrou · 2022-10-27T15:30:37Z

cpp/src/parquet/column_reader.cc

+          batch_size, reinterpret_cast<T*>(scratch->mutable_data()));
+      values_left -= values_read;
+    } while (values_read > 0 && values_left > 0);
+    return num_values - values_left;


Well, apparently one call site was updated to check the result. Why not check the result here instead?

pitrou · 2022-10-27T15:35:38Z

cpp/src/parquet/column_reader.cc

+        if (!at_record_start_) {
+          // We ended the row group while inside a record that we haven't seen
+          // the end of yet. So increment the record count for the last record
+          // in the row group


Row groups automatically terminate repeated values?

Yes, however, we are counting the records by seeing the next record (rep_level = 0). So for the very last one, we need to manually increment it.

pitrou · 2022-10-27T15:38:07Z

cpp/src/parquet/column_reader.cc

+    int64_t level_batch_size =
+        std::max<int64_t>(kMinLevelBatchSize, num_records - skipped_records);
+
+    // If 'at_record_start_' is false, but (skip_records == num_records), it


Suggested change

// If 'at_record_start_' is false, but (skip_records == num_records), it

// If 'at_record_start_' is false, but (skipped_records == num_records), it

pitrou · 2022-10-27T16:16:37Z

cpp/src/parquet/column_reader_test.cc

+    pages.push_back(std::move(page));
+  }
+
+  // Page 3: { ... 20], [30]} continues from previous page.


This is already a neat example, but can we add one or two nulls in the mix? :-)

I already tested for skipping nulls in SkipRepeated. I am not sure if adding null records to this test will increase coverage. Since a null record will not span multiple pages.

pitrou · 2022-10-27T16:17:46Z

cpp/src/parquet/column_reader_test.cc

+
+  std::vector<std::shared_ptr<Page>> pages;
+
+  // Page 1: {[10], [20, 20, 20 ... } continues to next page.


Same here: add some nulls?

Similarly, I already tested reading nulls in BasicReadRepeatedField. Additionally, I am adding a stress test in fatemehp#2 that should cover these scenarios.

pitrou · 2022-10-27T16:18:18Z

cpp/src/parquet/column_reader_test.cc

+  }
+}
+
+// Test that Skip works on ByteArrays. Specifically, this is testing


Suggested change

// Test that Skip works on ByteArrays. Specifically, this is testing

// Test that SkipRecords works on ByteArrays. Specifically, this is testing

pitrou · 2022-10-27T16:18:40Z

cpp/src/parquet/column_reader_test.cc

+  level_info.def_level = 1;
+  level_info.rep_level = 0;
+
+  // Must use REPEATED to excercise ReadAndThrowAwayValues for ByteArrays. It


Suggested change

// Must use REPEATED to excercise ReadAndThrowAwayValues for ByteArrays. It

// Must use REPEATED to exercise ReadAndThrowAwayValues for ByteArrays. It

pitrou · 2022-10-27T16:24:25Z

cpp/src/parquet/column_reader_test.cc

+    expected_values.emplace_back(reinterpret_cast<const char*>(values[values_index].ptr),
+                                 values[values_index].len);
+    ++values_index;
+  }


For the record, you could instead use arrow::BinaryBuilder to build an expected array and then call ArrayRangeEquals (or even AssertArraysEqual).

But the way you're doing it is ok as well.

fatemehp · 2022-10-27T20:29:07Z

Thanks for your comments @pitrou. I think I have addressed them all. Please take a look. Thanks!

pitrou

Thanks a lot @fatemehp ! This looks good to me now.

pitrou · 2022-10-31T15:21:36Z

Fixed lint and merged with latest master. I will wait for CI to run again.

fatemehp · 2022-10-31T20:28:23Z

@pitrou, @emkornfield I think one thing that we did not discuss fully about this PR is that do we want this API to be public or protected? Do we want to allow every client to call SkipRecords at any time? I will be using this API for calling a sequence of Reads and Skips to fully exhaust a column chunk. However, that could be masked behind another public API that allows filtering based on row numbers and does not reveal SkipRecords.

emkornfield · 2022-10-31T20:53:13Z

I think this class is already marked as internal which doesn't really provide guarantees, but I think we should probably answer first if we want to move it out of internal namespace?

ursabot · 2022-11-01T02:01:44Z

Benchmark runs are scheduled for baseline = f9ccca6 and contender = 1164785. 1164785 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Failed ⬇️0.0% ⬆️0.0%] test-mac-arm
[Finished ⬇️0.0% ⬆️0.0%] ursa-i9-9960x
[Finished ⬇️0.54% ⬆️0.21%] ursa-thinkcentre-m75q
Buildkite builds:
[Finished] 11647857 ec2-t3-xlarge-us-east-2
[Failed] 11647857 test-mac-arm
[Finished] 11647857 ursa-i9-9960x
[Finished] 11647857 ursa-thinkcentre-m75q
[Finished] f9ccca62 ec2-t3-xlarge-us-east-2
[Failed] f9ccca62 test-mac-arm
[Finished] f9ccca62 ursa-i9-9960x
[Finished] f9ccca62 ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

### Rationale for this change #14142 implemented the logic to skip parquet records and added a lot of test caese. However, there is a minor issue in the test function `RecordReaderPrimitiveTypeTest::CheckReadValues` where `!` is missing. ```c++ if (descr_->schema_node()->is_required()) { std::vector<int16_t> read_defs( record_reader_->def_levels(), record_reader_->def_levels() + record_reader_->levels_position()); ASSERT_TRUE(vector_equal(expected_defs, read_defs)); } ``` ### What changes are included in this PR? Add `!` to `if (descr_->schema_node()->is_required())` as mentioned above. ### Are these changes tested? This is a fix to the test case. ### Are there any user-facing changes? NO. Authored-by: Gang Wu <[email protected]> Signed-off-by: Will Jones <[email protected]>

### Rationale for this change apache#14142 implemented the logic to skip parquet records and added a lot of test caese. However, there is a minor issue in the test function `RecordReaderPrimitiveTypeTest::CheckReadValues` where `!` is missing. ```c++ if (descr_->schema_node()->is_required()) { std::vector<int16_t> read_defs( record_reader_->def_levels(), record_reader_->def_levels() + record_reader_->levels_position()); ASSERT_TRUE(vector_equal(expected_defs, read_defs)); } ``` ### What changes are included in this PR? Add `!` to `if (descr_->schema_node()->is_required())` as mentioned above. ### Are these changes tested? This is a fix to the test case. ### Are there any user-facing changes? NO. Authored-by: Gang Wu <[email protected]> Signed-off-by: Will Jones <[email protected]>

fatemehp and others added 2 commits September 15, 2022 17:45

Add SkipRecords to RecordReader and corresponding unit tests.

66b6531

I plan to add a stress test separately.

Merge branch 'apache:master' into master

327f42d

github-actions bot added Component: C++ Component: Parquet labels Sep 15, 2022

Add test for reading and skipping partial records for repeated fields.

c0fac59

emkornfield reviewed Sep 19, 2022

View reviewed changes

Addressed comments.

2945559

emkornfield reviewed Sep 19, 2022

View reviewed changes

emkornfield requested changes Sep 19, 2022

View reviewed changes

fatemehp added 2 commits September 20, 2022 20:08

Addressed most comments. Working on adding a test for variable length

938e494

type and using gmock.

Use gmock ElementsAre for comparing vectors.

ba87e4d

wgtmac reviewed Sep 22, 2022

View reviewed changes

emkornfield reviewed Oct 18, 2022

View reviewed changes

fatemehp added 2 commits October 25, 2022 19:47

Add parameter comment shrink_to_fit for buffer Resize calls.

7355db5

Address comments.

0223c39

Use std::make_unique and collapse declaration and definition to make …

3dd5fa3

…code more concise.

pitrou self-requested a review October 27, 2022 12:50

pitrou requested changes Oct 27, 2022

View reviewed changes

fatemehp added 2 commits October 27, 2022 18:16

Address comments.

9c4d66f

Address some more comments.

df873c7

pitrou approved these changes Oct 31, 2022

View reviewed changes

pitrou added 2 commits October 31, 2022 16:10

Merge branch 'master' into conflict

7ab5233

Fix lint

312abbc

pitrou merged commit 1164785 into apache:master Oct 31, 2022

zhixingheyi-tian mentioned this pull request Dec 12, 2022

ARROW-17735: [C++][Parquet] Optimize parquet reading for String/Binary type #14353

Closed

wgtmac mentioned this pull request Apr 19, 2023

MINOR: [C++][Parquet] Fix column_reader_test #35233

Merged

asfimport mentioned this pull request Jun 23, 2024

[C++][Parquet] Add SkipRecords API to RecordReader #42386

Closed


		ReserveLevels(batch_size);

		int16_t* def_levels = this->def_levels() + levels_written_;

	// If 'at_record_start_' is false, but (skip_records == num_records), it
	// If 'at_record_start_' is false, but (skipped_records == num_records), it


		std::vector<std::shared_ptr<Page>> pages;

		// Page 1: {[10], [20, 20, 20 ... } continues to next page.

	// Test that Skip works on ByteArrays. Specifically, this is testing
	// Test that SkipRecords works on ByteArrays. Specifically, this is testing

	// Must use REPEATED to excercise ReadAndThrowAwayValues for ByteArrays. It
	// Must use REPEATED to exercise ReadAndThrowAwayValues for ByteArrays. It

PARQUET-2188: [parquet-cpp] Add SkipRecords API to RecordReader #14142

PARQUET-2188: [parquet-cpp] Add SkipRecords API to RecordReader #14142

Uh oh!

Conversation

fatemehp commented Sep 15, 2022

Uh oh!

fatemehp commented Sep 15, 2022

Uh oh!

github-actions bot commented Sep 15, 2022

Uh oh!

github-actions bot commented Sep 15, 2022

Uh oh!

emkornfield commented Sep 19, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!