-
Notifications
You must be signed in to change notification settings - Fork 4k
PARQUET-2188: [parquet-cpp] Add SkipRecords API to RecordReader #14142
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
I plan to add a stress test separately.
|
@emkornfield could you take a look? |
|
|
|
@github-actions autotune |
cpp/src/parquet/column_reader.cc
Outdated
| // Read dictionary indices. | ||
| *indices_read = ReadDictionaryIndices(indices_to_read, indices); | ||
| int64_t total_indices = std::max(num_def_levels, *indices_read); | ||
| int64_t total_indices = std::max<int>(num_def_levels, *indices_read); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this be int64_t for the template parameter?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
cpp/src/parquet/column_reader.cc
Outdated
| } | ||
|
|
||
| int64_t level_batch_size = std::max(kMinLevelBatchSize, num_records); | ||
| int64_t level_batch_size = std::max<int>(kMinLevelBatchSize, num_records); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same comment on type-template param for max
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
cpp/src/parquet/column_reader.cc
Outdated
| // non-repeated fields. | ||
| int64_t SkipRecordsInBufferNonRepeated(int64_t num_records) { | ||
| ARROW_DCHECK(this->max_rep_level_ == 0); | ||
| ARROW_DCHECK(this->has_values_to_process()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The first DCHECK seems self explanitory, I'm not sure I understand the second one though. Just to validate, do you think DCHECK or a throwing here is more appropriate?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed the check and added a return statement.
cpp/src/parquet/column_reader.cc
Outdated
| } | ||
|
|
||
| // Skip records for repeated fields. Returns number of skipped records. | ||
| // Skip records for repeated fields. Returns number of skipped records. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: repeated comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
cpp/src/parquet/column_reader.cc
Outdated
| // Skip records that we have in our buffer. This function is only for | ||
| // non-repeated fields. | ||
| int64_t SkipRecordsInBufferNonRepeated(int64_t num_records) { | ||
| ARROW_DCHECK(this->max_rep_level_ == 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there ARROW_DCHECK_EQ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
cpp/src/parquet/column_reader.cc
Outdated
| // Keep filling the buffer and skipping until we reach the desired number | ||
| // of records or we run out of values in the column chunk. | ||
| int64_t skipped_records = 0; | ||
| int64_t level_batch_size = std::max<int>(kMinLevelBatchSize, num_records); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
int64_t for max?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
|
|
||
| ReserveLevels(batch_size); | ||
|
|
||
| int16_t* def_levels = this->def_levels() + levels_written_; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is levels_written_ important here if we are discarding data?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When reading repetition and definition levels, we append them to the buffer that we already have. When we figure out how many of those we need to skip, we shift the values to the left to skip them.
Your comment revealed a bug where I was not shifting the values. I fixed it and checked that in the tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, that doesn't seem to answer the question? Why bump levers_written_ in this loop if the levels are meant to be skipped/discarded anyway? It seems you'll reserve superfluous buffer space for the levels.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am bumping it here for correctness. At any point in time levels_written_ shows the end of the levels that are in the buffer. So we update it right here after we read a batch of levels. Note that we may not throw away all the levels that we read here. We may only throw away some of them in DelimitAndSkipRecordsInBuffer. When we throw away levels, we will update levels_written_ accordingly.
You are bringing up a good point here. We actually can read the values that we want to skip into a separate buffer and throw them away, which will then reduce the amount of shifting that we have to do. It can make the code a bit more complicated though since I need to consume the values from this buffer first, then read into the scratch buffer, and if anything is left transfer it over. I will keep this in mind as an optimization on top of this pull request.
cpp/src/parquet/column_reader.cc
Outdated
| // Read 'num_values' values and throw them away. | ||
| int64_t ReadAndThrowAway(int64_t num_values) { | ||
| int64_t values_left = num_values; | ||
| int64_t batch_size = 1024; // ReadBatch with a smaller memory footprint |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this be initialized from a constant or config?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
| // This will be enough scratch space to accommodate 16-bit levels or any | ||
| // value type | ||
| int value_size = type_traits<DType::type_num>::value_byte_size; | ||
| std::shared_ptr<ResizableBuffer> scratch = AllocateBuffer( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i forget how this works for variable length types, is it still sufficient for those?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it should work since the value for the variable length types is technically a length and a pointer. I will add a separate test for them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In parquet/types.h
template <>
struct type_traitsType::BYTE_ARRAY {
using value_type = ByteArray;
static constexpr int value_byte_size = sizeof(ByteArray);
static constexpr const char* printf_code = "s";
};
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added the test for ByteArray.
cpp/src/parquet/column_reader.cc
Outdated
| batch_size, reinterpret_cast<T*>(scratch->mutable_data())); | ||
| values_left -= values_read; | ||
| } while (values_read > 0 && values_left > 0); | ||
| return num_values - values_left; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should there be a validation here on values_read and num_values?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a check to check the result once we return from this function. Is that what you meant?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, apparently one call site was updated to check the result. Why not check the result here instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
cpp/src/parquet/column_reader.cc
Outdated
| // Conservative upper bound | ||
| const int64_t possible_num_values = | ||
| std::max(num_records, levels_written_ - levels_position_); | ||
| std::max<int>(num_records, levels_written_ - levels_position_); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
int64_t?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
cpp/src/parquet/column_reader.cc
Outdated
| // No repetition levels, skip delimiting logic. Each level represents a | ||
| // null or not null entry | ||
| records_read = std::min(levels_written_ - levels_position_, num_records); | ||
| records_read = std::min<int>(levels_written_ - levels_position_, num_records); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
int64_t?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
cpp/src/parquet/column_reader.h
Outdated
| /// \return number of records read | ||
| virtual int64_t ReadRecords(int64_t num_records) = 0; | ||
|
|
||
| /// \brief Attempt to skip indicated number of records from column chunk |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lets add a detail here to clarify this is the numer of rows records and not Values.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
| std::shared_ptr<::arrow::ResizableBuffer> values_; | ||
| // In the case of false (BYTE_ARRAY), don't allocate the values buffer | ||
| // (when we directly read into builder classes). | ||
| bool uses_values_; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why was this moved up?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I must have re-ordered it to improve readability.
|
|
||
| template <typename DType> | ||
| class TypedRecordReader : public ColumnReaderImplBase<DType>, | ||
| class TypedRecordReader : public TypedColumnReaderImpl<DType>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why the change in the base class?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To have access to the ColumnReader's Skip method.
| read_values + record_reader->values_written() - | ||
| record_reader->null_count()); | ||
|
|
||
| ASSERT_TRUE(vector_equal(read_vals, {20, 20})); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
GMock should be available in arrow, so it would be clearer to use EXPECT_THAT(read_values, ElementsAre(20, 20)); here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
emkornfield
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding a test coverage. Left a few comments as I'm not clear on the exact approach here but at a high level this looks good.
| levels_capacity_ = levels_remaining; | ||
| } | ||
|
|
||
| records_read_ = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we rename records_read_ to records_processed_ and update it accordingly? It is useful when we want to check the current position of the reader.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This variable is currently not used anywhere, so I am inclined towards removing it since it will be another member variable to keep updated. We could always add it back later and actually use it for checking the current position of the reader.
| return num_values - values_left; | ||
| } | ||
|
|
||
| int64_t SkipRecords(int64_t num_records) override { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems that all SKIP operations called below actually read and then discard some records (unless the remaining values of current page can be skipped). Why not simply calling ReadRecords(num_records) internally and reset the buffers?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The idea is to enable Skipping entire pages at a time if the number of records to skip is sufficiently large.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you explain where the "Skipping entire pages at a time" part happens?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To give some context, suppose that we have M records in the column chunk and we want to read only record N from the column. In that case, we can do something like this. SkipRecords(N-1), ReadRecords(1), SkipRecords(M - N). We can do some optimizations here. For example for the skip to the end of the column chunk, we would not need to even look at page headers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, but are those optimizations done in this PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, here is what we have now: Consider a non-repeated field, and that there are 10 pages with 100 values each. SkipRecords(900) will skip "decoding" the first 9 pages. It will still look at the page headers to find out how many values there are per page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SkipRecords(900) will skip "decoding" the first 9 pages.
Can you point where this happens?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It happens in TypedColumnReader::Skip.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahah. I see, thanks.
cpp/src/parquet/column_reader.cc
Outdated
| int16_t* def_data = def_levels(); | ||
| std::copy(def_data + levels_position_, def_data + levels_written_, | ||
| def_data + levels_position_ - gap); | ||
| PARQUET_THROW_NOT_OK(def_levels_->Resize(levels_remaining * sizeof(int16_t), false)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: comment literal. This doesn't try to release the memory correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct. Added parameter comments.
cpp/src/parquet/column_reader.cc
Outdated
| int64_t levels_remaining = levels_written_ - gap; | ||
|
|
||
| int16_t* def_data = def_levels(); | ||
| std::copy(def_data + levels_position_, def_data + levels_written_, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can these two ranges overlap?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, they can. However levels_position - gap is smaller than levels_position (we return for gap = 0 above). So we can safely use std::copy:
"Copies all elements in the range [first, last) starting from first and proceeding to last - 1. The behavior is undefined if d_first is within the range [first, last). In this case, std::copy_backward may be used instead."
cpp/src/parquet/column_reader.cc
Outdated
| if (this->max_rep_level_ > 0) { | ||
| std::copy(rep_levels() + levels_position_, rep_levels() + levels_written_, | ||
| rep_levels() + levels_position_ - gap); | ||
| int16_t* rep_data = rep_levels(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it pay to make this a helper method/lambda that can be applied to a buffer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
cpp/src/parquet/column_reader.cc
Outdated
| int64_t values_seen = 0; | ||
| int64_t skipped_records = DelimitRecords(num_records, &values_seen); | ||
| if (ReadAndThrowAwayValues(values_seen) != values_seen) { | ||
| throw ParquetException("Could not read and throw away requested values"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
adding details on values read/values and position might be useful when debugging issues.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. We do not buffer values so we do not have a position to report here. I reported the number of values that were requested and we could not read.
|
Addressed comments. @emkornfield, @pitrou could you take a look? Thanks! |
…code more concise.
|
@pitrou did you want to rereview? Or are you OK if I merge this? |
|
@emkornfield I'll take a look, thank you! |
pitrou
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for the update @fatemehp .
I like the extensive testing and the extensive comments you added, even though I still have some comments and suggestions below.
Nice work overall!
cpp/src/parquet/column_reader.cc
Outdated
| if (gap == 0) return; | ||
|
|
||
| int64_t levels_remaining = levels_written_ - gap; | ||
| int64_t destination = levels_position_ - gap; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In other words, destination is equal to start_levels_position. Simplify?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
| ThrowAwayLevels(start_levels_position); | ||
| // For values, we do not have them in buffer, so we will read them and | ||
| // throw them away. | ||
| ReadAndThrowAwayValues(values_to_read); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like this is ignoring the ReadAndThrowAwayValues return value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, by throwing the error in the function itself as suggested below.
cpp/src/parquet/column_reader.cc
Outdated
| batch_size, reinterpret_cast<T*>(scratch->mutable_data())); | ||
| values_left -= values_read; | ||
| } while (values_read > 0 && values_left > 0); | ||
| return num_values - values_left; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, apparently one call site was updated to check the result. Why not check the result here instead?
| if (!at_record_start_) { | ||
| // We ended the row group while inside a record that we haven't seen | ||
| // the end of yet. So increment the record count for the last record | ||
| // in the row group |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Row groups automatically terminate repeated values?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, however, we are counting the records by seeing the next record (rep_level = 0). So for the very last one, we need to manually increment it.
cpp/src/parquet/column_reader.cc
Outdated
| int64_t level_batch_size = | ||
| std::max<int64_t>(kMinLevelBatchSize, num_records - skipped_records); | ||
|
|
||
| // If 'at_record_start_' is false, but (skip_records == num_records), it |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| // If 'at_record_start_' is false, but (skip_records == num_records), it | |
| // If 'at_record_start_' is false, but (skipped_records == num_records), it |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
| pages.push_back(std::move(page)); | ||
| } | ||
|
|
||
| // Page 3: { ... 20], [30]} continues from previous page. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is already a neat example, but can we add one or two nulls in the mix? :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I already tested for skipping nulls in SkipRepeated. I am not sure if adding null records to this test will increase coverage. Since a null record will not span multiple pages.
|
|
||
| std::vector<std::shared_ptr<Page>> pages; | ||
|
|
||
| // Page 1: {[10], [20, 20, 20 ... } continues to next page. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here: add some nulls?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similarly, I already tested reading nulls in BasicReadRepeatedField. Additionally, I am adding a stress test in fatemehp#2 that should cover these scenarios.
| } | ||
| } | ||
|
|
||
| // Test that Skip works on ByteArrays. Specifically, this is testing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| // Test that Skip works on ByteArrays. Specifically, this is testing | |
| // Test that SkipRecords works on ByteArrays. Specifically, this is testing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
| level_info.def_level = 1; | ||
| level_info.rep_level = 0; | ||
|
|
||
| // Must use REPEATED to excercise ReadAndThrowAwayValues for ByteArrays. It |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| // Must use REPEATED to excercise ReadAndThrowAwayValues for ByteArrays. It | |
| // Must use REPEATED to exercise ReadAndThrowAwayValues for ByteArrays. It |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
| expected_values.emplace_back(reinterpret_cast<const char*>(values[values_index].ptr), | ||
| values[values_index].len); | ||
| ++values_index; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the record, you could instead use arrow::BinaryBuilder to build an expected array and then call ArrayRangeEquals (or even AssertArraysEqual).
But the way you're doing it is ok as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Acked.
|
Thanks for your comments @pitrou. I think I have addressed them all. Please take a look. Thanks! |
pitrou
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot @fatemehp ! This looks good to me now.
|
Fixed lint and merged with latest master. I will wait for CI to run again. |
|
@pitrou, @emkornfield I think one thing that we did not discuss fully about this PR is that do we want this API to be public or protected? Do we want to allow every client to call SkipRecords at any time? I will be using this API for calling a sequence of Reads and Skips to fully exhaust a column chunk. However, that could be masked behind another public API that allows filtering based on row numbers and does not reveal SkipRecords. |
|
I think this class is already marked as internal which doesn't really provide guarantees, but I think we should probably answer first if we want to move it out of internal namespace? |
|
Benchmark runs are scheduled for baseline = f9ccca6 and contender = 1164785. 1164785 is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
### Rationale for this change #14142 implemented the logic to skip parquet records and added a lot of test caese. However, there is a minor issue in the test function `RecordReaderPrimitiveTypeTest::CheckReadValues` where `!` is missing. ```c++ if (descr_->schema_node()->is_required()) { std::vector<int16_t> read_defs( record_reader_->def_levels(), record_reader_->def_levels() + record_reader_->levels_position()); ASSERT_TRUE(vector_equal(expected_defs, read_defs)); } ``` ### What changes are included in this PR? Add `!` to `if (descr_->schema_node()->is_required())` as mentioned above. ### Are these changes tested? This is a fix to the test case. ### Are there any user-facing changes? NO. Authored-by: Gang Wu <[email protected]> Signed-off-by: Will Jones <[email protected]>
### Rationale for this change apache#14142 implemented the logic to skip parquet records and added a lot of test caese. However, there is a minor issue in the test function `RecordReaderPrimitiveTypeTest::CheckReadValues` where `!` is missing. ```c++ if (descr_->schema_node()->is_required()) { std::vector<int16_t> read_defs( record_reader_->def_levels(), record_reader_->def_levels() + record_reader_->levels_position()); ASSERT_TRUE(vector_equal(expected_defs, read_defs)); } ``` ### What changes are included in this PR? Add `!` to `if (descr_->schema_node()->is_required())` as mentioned above. ### Are these changes tested? This is a fix to the test case. ### Are there any user-facing changes? NO. Authored-by: Gang Wu <[email protected]> Signed-off-by: Will Jones <[email protected]>
### Rationale for this change apache#14142 implemented the logic to skip parquet records and added a lot of test caese. However, there is a minor issue in the test function `RecordReaderPrimitiveTypeTest::CheckReadValues` where `!` is missing. ```c++ if (descr_->schema_node()->is_required()) { std::vector<int16_t> read_defs( record_reader_->def_levels(), record_reader_->def_levels() + record_reader_->levels_position()); ASSERT_TRUE(vector_equal(expected_defs, read_defs)); } ``` ### What changes are included in this PR? Add `!` to `if (descr_->schema_node()->is_required())` as mentioned above. ### Are these changes tested? This is a fix to the test case. ### Are there any user-facing changes? NO. Authored-by: Gang Wu <[email protected]> Signed-off-by: Will Jones <[email protected]>
The RecordReader is missing an API to skip records. There is a Skip method in the ColumnReader, but that skips based on the number of values/levels and not records. For repeated fields, this SkipRecords API will detect the record boundaries and correctly skip the right number of values for the requested number of records.