Skip to content

Conversation

@fatemehp
Copy link
Contributor

The RecordReader is missing an API to skip records. There is a Skip method in the ColumnReader, but that skips based on the number of values/levels and not records. For repeated fields, this SkipRecords API will detect the record boundaries and correctly skip the right number of values for the requested number of records.

@fatemehp
Copy link
Contributor Author

@emkornfield could you take a look?

@github-actions
Copy link

@github-actions
Copy link

⚠️ Ticket has not been started in JIRA, please click 'Start Progress'.

@emkornfield
Copy link
Contributor

@github-actions autotune

// Read dictionary indices.
*indices_read = ReadDictionaryIndices(indices_to_read, indices);
int64_t total_indices = std::max(num_def_levels, *indices_read);
int64_t total_indices = std::max<int>(num_def_levels, *indices_read);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be int64_t for the template parameter?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

}

int64_t level_batch_size = std::max(kMinLevelBatchSize, num_records);
int64_t level_batch_size = std::max<int>(kMinLevelBatchSize, num_records);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comment on type-template param for max

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

// non-repeated fields.
int64_t SkipRecordsInBufferNonRepeated(int64_t num_records) {
ARROW_DCHECK(this->max_rep_level_ == 0);
ARROW_DCHECK(this->has_values_to_process());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The first DCHECK seems self explanitory, I'm not sure I understand the second one though. Just to validate, do you think DCHECK or a throwing here is more appropriate?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed the check and added a return statement.

}

// Skip records for repeated fields. Returns number of skipped records.
// Skip records for repeated fields. Returns number of skipped records.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: repeated comment.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

// Skip records that we have in our buffer. This function is only for
// non-repeated fields.
int64_t SkipRecordsInBufferNonRepeated(int64_t num_records) {
ARROW_DCHECK(this->max_rep_level_ == 0);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there ARROW_DCHECK_EQ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

// Keep filling the buffer and skipping until we reach the desired number
// of records or we run out of values in the column chunk.
int64_t skipped_records = 0;
int64_t level_batch_size = std::max<int>(kMinLevelBatchSize, num_records);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

int64_t for max?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done


ReserveLevels(batch_size);

int16_t* def_levels = this->def_levels() + levels_written_;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is levels_written_ important here if we are discarding data?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When reading repetition and definition levels, we append them to the buffer that we already have. When we figure out how many of those we need to skip, we shift the values to the left to skip them.

Your comment revealed a bug where I was not shifting the values. I fixed it and checked that in the tests.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, that doesn't seem to answer the question? Why bump levers_written_ in this loop if the levels are meant to be skipped/discarded anyway? It seems you'll reserve superfluous buffer space for the levels.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am bumping it here for correctness. At any point in time levels_written_ shows the end of the levels that are in the buffer. So we update it right here after we read a batch of levels. Note that we may not throw away all the levels that we read here. We may only throw away some of them in DelimitAndSkipRecordsInBuffer. When we throw away levels, we will update levels_written_ accordingly.

You are bringing up a good point here. We actually can read the values that we want to skip into a separate buffer and throw them away, which will then reduce the amount of shifting that we have to do. It can make the code a bit more complicated though since I need to consume the values from this buffer first, then read into the scratch buffer, and if anything is left transfer it over. I will keep this in mind as an optimization on top of this pull request.

// Read 'num_values' values and throw them away.
int64_t ReadAndThrowAway(int64_t num_values) {
int64_t values_left = num_values;
int64_t batch_size = 1024; // ReadBatch with a smaller memory footprint
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be initialized from a constant or config?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

// This will be enough scratch space to accommodate 16-bit levels or any
// value type
int value_size = type_traits<DType::type_num>::value_byte_size;
std::shared_ptr<ResizableBuffer> scratch = AllocateBuffer(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i forget how this works for variable length types, is it still sufficient for those?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it should work since the value for the variable length types is technically a length and a pointer. I will add a separate test for them.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In parquet/types.h

template <>
struct type_traitsType::BYTE_ARRAY {
using value_type = ByteArray;
static constexpr int value_byte_size = sizeof(ByteArray);
static constexpr const char* printf_code = "s";
};

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added the test for ByteArray.

batch_size, reinterpret_cast<T*>(scratch->mutable_data()));
values_left -= values_read;
} while (values_read > 0 && values_left > 0);
return num_values - values_left;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should there be a validation here on values_read and num_values?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a check to check the result once we return from this function. Is that what you meant?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, apparently one call site was updated to check the result. Why not check the result here instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

// Conservative upper bound
const int64_t possible_num_values =
std::max(num_records, levels_written_ - levels_position_);
std::max<int>(num_records, levels_written_ - levels_position_);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

int64_t?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

// No repetition levels, skip delimiting logic. Each level represents a
// null or not null entry
records_read = std::min(levels_written_ - levels_position_, num_records);
records_read = std::min<int>(levels_written_ - levels_position_, num_records);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

int64_t?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

/// \return number of records read
virtual int64_t ReadRecords(int64_t num_records) = 0;

/// \brief Attempt to skip indicated number of records from column chunk
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lets add a detail here to clarify this is the numer of rows records and not Values.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

std::shared_ptr<::arrow::ResizableBuffer> values_;
// In the case of false (BYTE_ARRAY), don't allocate the values buffer
// (when we directly read into builder classes).
bool uses_values_;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why was this moved up?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I must have re-ordered it to improve readability.


template <typename DType>
class TypedRecordReader : public ColumnReaderImplBase<DType>,
class TypedRecordReader : public TypedColumnReaderImpl<DType>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why the change in the base class?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To have access to the ColumnReader's Skip method.

read_values + record_reader->values_written() -
record_reader->null_count());

ASSERT_TRUE(vector_equal(read_vals, {20, 20}));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GMock should be available in arrow, so it would be clearer to use EXPECT_THAT(read_values, ElementsAre(20, 20)); here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Contributor

@emkornfield emkornfield left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding a test coverage. Left a few comments as I'm not clear on the exact approach here but at a high level this looks good.

levels_capacity_ = levels_remaining;
}

records_read_ = 0;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we rename records_read_ to records_processed_ and update it accordingly? It is useful when we want to check the current position of the reader.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This variable is currently not used anywhere, so I am inclined towards removing it since it will be another member variable to keep updated. We could always add it back later and actually use it for checking the current position of the reader.

return num_values - values_left;
}

int64_t SkipRecords(int64_t num_records) override {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that all SKIP operations called below actually read and then discard some records (unless the remaining values of current page can be skipped). Why not simply calling ReadRecords(num_records) internally and reset the buffers?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea is to enable Skipping entire pages at a time if the number of records to skip is sufficiently large.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain where the "Skipping entire pages at a time" part happens?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To give some context, suppose that we have M records in the column chunk and we want to read only record N from the column. In that case, we can do something like this. SkipRecords(N-1), ReadRecords(1), SkipRecords(M - N). We can do some optimizations here. For example for the skip to the end of the column chunk, we would not need to even look at page headers.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, but are those optimizations done in this PR?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, here is what we have now: Consider a non-repeated field, and that there are 10 pages with 100 values each. SkipRecords(900) will skip "decoding" the first 9 pages. It will still look at the page headers to find out how many values there are per page.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SkipRecords(900) will skip "decoding" the first 9 pages.

Can you point where this happens?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It happens in TypedColumnReader::Skip.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahah. I see, thanks.

int16_t* def_data = def_levels();
std::copy(def_data + levels_position_, def_data + levels_written_,
def_data + levels_position_ - gap);
PARQUET_THROW_NOT_OK(def_levels_->Resize(levels_remaining * sizeof(int16_t), false));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: comment literal. This doesn't try to release the memory correct?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct. Added parameter comments.

int64_t levels_remaining = levels_written_ - gap;

int16_t* def_data = def_levels();
std::copy(def_data + levels_position_, def_data + levels_written_,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can these two ranges overlap?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, they can. However levels_position - gap is smaller than levels_position (we return for gap = 0 above). So we can safely use std::copy:
"Copies all elements in the range [first, last) starting from first and proceeding to last - 1. The behavior is undefined if d_first is within the range [first, last). In this case, std::copy_backward may be used instead."

if (this->max_rep_level_ > 0) {
std::copy(rep_levels() + levels_position_, rep_levels() + levels_written_,
rep_levels() + levels_position_ - gap);
int16_t* rep_data = rep_levels();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it pay to make this a helper method/lambda that can be applied to a buffer?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

int64_t values_seen = 0;
int64_t skipped_records = DelimitRecords(num_records, &values_seen);
if (ReadAndThrowAwayValues(values_seen) != values_seen) {
throw ParquetException("Could not read and throw away requested values");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

adding details on values read/values and position might be useful when debugging issues.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. We do not buffer values so we do not have a position to report here. I reported the number of values that were requested and we could not read.

@fatemehp
Copy link
Contributor Author

Addressed comments. @emkornfield, @pitrou could you take a look? Thanks!

@emkornfield
Copy link
Contributor

@pitrou did you want to rereview? Or are you OK if I merge this?

@pitrou
Copy link
Member

pitrou commented Oct 26, 2022

@emkornfield I'll take a look, thank you!

@pitrou pitrou self-requested a review October 27, 2022 12:50
Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for the update @fatemehp .

I like the extensive testing and the extensive comments you added, even though I still have some comments and suggestions below.

Nice work overall!

if (gap == 0) return;

int64_t levels_remaining = levels_written_ - gap;
int64_t destination = levels_position_ - gap;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In other words, destination is equal to start_levels_position. Simplify?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

ThrowAwayLevels(start_levels_position);
// For values, we do not have them in buffer, so we will read them and
// throw them away.
ReadAndThrowAwayValues(values_to_read);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this is ignoring the ReadAndThrowAwayValues return value.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, by throwing the error in the function itself as suggested below.

batch_size, reinterpret_cast<T*>(scratch->mutable_data()));
values_left -= values_read;
} while (values_read > 0 && values_left > 0);
return num_values - values_left;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, apparently one call site was updated to check the result. Why not check the result here instead?

if (!at_record_start_) {
// We ended the row group while inside a record that we haven't seen
// the end of yet. So increment the record count for the last record
// in the row group
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Row groups automatically terminate repeated values?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, however, we are counting the records by seeing the next record (rep_level = 0). So for the very last one, we need to manually increment it.

int64_t level_batch_size =
std::max<int64_t>(kMinLevelBatchSize, num_records - skipped_records);

// If 'at_record_start_' is false, but (skip_records == num_records), it
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// If 'at_record_start_' is false, but (skip_records == num_records), it
// If 'at_record_start_' is false, but (skipped_records == num_records), it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

pages.push_back(std::move(page));
}

// Page 3: { ... 20], [30]} continues from previous page.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is already a neat example, but can we add one or two nulls in the mix? :-)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I already tested for skipping nulls in SkipRepeated. I am not sure if adding null records to this test will increase coverage. Since a null record will not span multiple pages.


std::vector<std::shared_ptr<Page>> pages;

// Page 1: {[10], [20, 20, 20 ... } continues to next page.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here: add some nulls?

Copy link
Contributor Author

@fatemehp fatemehp Oct 27, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly, I already tested reading nulls in BasicReadRepeatedField. Additionally, I am adding a stress test in fatemehp#2 that should cover these scenarios.

}
}

// Test that Skip works on ByteArrays. Specifically, this is testing
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Test that Skip works on ByteArrays. Specifically, this is testing
// Test that SkipRecords works on ByteArrays. Specifically, this is testing

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

level_info.def_level = 1;
level_info.rep_level = 0;

// Must use REPEATED to excercise ReadAndThrowAwayValues for ByteArrays. It
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Must use REPEATED to excercise ReadAndThrowAwayValues for ByteArrays. It
// Must use REPEATED to exercise ReadAndThrowAwayValues for ByteArrays. It

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

expected_values.emplace_back(reinterpret_cast<const char*>(values[values_index].ptr),
values[values_index].len);
++values_index;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the record, you could instead use arrow::BinaryBuilder to build an expected array and then call ArrayRangeEquals (or even AssertArraysEqual).

But the way you're doing it is ok as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Acked.

@fatemehp
Copy link
Contributor Author

Thanks for your comments @pitrou. I think I have addressed them all. Please take a look. Thanks!

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot @fatemehp ! This looks good to me now.

@pitrou
Copy link
Member

pitrou commented Oct 31, 2022

Fixed lint and merged with latest master. I will wait for CI to run again.

@pitrou pitrou merged commit 1164785 into apache:master Oct 31, 2022
@fatemehp
Copy link
Contributor Author

fatemehp commented Oct 31, 2022

@pitrou, @emkornfield I think one thing that we did not discuss fully about this PR is that do we want this API to be public or protected? Do we want to allow every client to call SkipRecords at any time? I will be using this API for calling a sequence of Reads and Skips to fully exhaust a column chunk. However, that could be masked behind another public API that allows filtering based on row numbers and does not reveal SkipRecords.

@emkornfield
Copy link
Contributor

I think this class is already marked as internal which doesn't really provide guarantees, but I think we should probably answer first if we want to move it out of internal namespace?

@ursabot
Copy link

ursabot commented Nov 1, 2022

Benchmark runs are scheduled for baseline = f9ccca6 and contender = 1164785. 1164785 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Failed ⬇️0.0% ⬆️0.0%] test-mac-arm
[Finished ⬇️0.0% ⬆️0.0%] ursa-i9-9960x
[Finished ⬇️0.54% ⬆️0.21%] ursa-thinkcentre-m75q
Buildkite builds:
[Finished] 11647857 ec2-t3-xlarge-us-east-2
[Failed] 11647857 test-mac-arm
[Finished] 11647857 ursa-i9-9960x
[Finished] 11647857 ursa-thinkcentre-m75q
[Finished] f9ccca62 ec2-t3-xlarge-us-east-2
[Failed] f9ccca62 test-mac-arm
[Finished] f9ccca62 ursa-i9-9960x
[Finished] f9ccca62 ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

wjones127 pushed a commit that referenced this pull request Apr 19, 2023
### Rationale for this change

#14142 implemented the logic to skip parquet records and added a lot of test caese. However, there is a minor issue in the test function `RecordReaderPrimitiveTypeTest::CheckReadValues` where `!` is missing.

```c++
    if (descr_->schema_node()->is_required()) {
      std::vector<int16_t> read_defs(
          record_reader_->def_levels(),
          record_reader_->def_levels() + record_reader_->levels_position());
      ASSERT_TRUE(vector_equal(expected_defs, read_defs));
    }

```

### What changes are included in this PR?

Add `!` to `if (descr_->schema_node()->is_required())` as mentioned above.

### Are these changes tested?

This is a fix to the test case.

### Are there any user-facing changes?

NO.

Authored-by: Gang Wu <[email protected]>
Signed-off-by: Will Jones <[email protected]>
liujiacheng777 pushed a commit to LoongArch-Python/arrow that referenced this pull request May 11, 2023
### Rationale for this change

apache#14142 implemented the logic to skip parquet records and added a lot of test caese. However, there is a minor issue in the test function `RecordReaderPrimitiveTypeTest::CheckReadValues` where `!` is missing.

```c++
    if (descr_->schema_node()->is_required()) {
      std::vector<int16_t> read_defs(
          record_reader_->def_levels(),
          record_reader_->def_levels() + record_reader_->levels_position());
      ASSERT_TRUE(vector_equal(expected_defs, read_defs));
    }

```

### What changes are included in this PR?

Add `!` to `if (descr_->schema_node()->is_required())` as mentioned above.

### Are these changes tested?

This is a fix to the test case.

### Are there any user-facing changes?

NO.

Authored-by: Gang Wu <[email protected]>
Signed-off-by: Will Jones <[email protected]>
ArgusLi pushed a commit to Bit-Quill/arrow that referenced this pull request May 15, 2023
### Rationale for this change

apache#14142 implemented the logic to skip parquet records and added a lot of test caese. However, there is a minor issue in the test function `RecordReaderPrimitiveTypeTest::CheckReadValues` where `!` is missing.

```c++
    if (descr_->schema_node()->is_required()) {
      std::vector<int16_t> read_defs(
          record_reader_->def_levels(),
          record_reader_->def_levels() + record_reader_->levels_position());
      ASSERT_TRUE(vector_equal(expected_defs, read_defs));
    }

```

### What changes are included in this PR?

Add `!` to `if (descr_->schema_node()->is_required())` as mentioned above.

### Are these changes tested?

This is a fix to the test case.

### Are there any user-facing changes?

NO.

Authored-by: Gang Wu <[email protected]>
Signed-off-by: Will Jones <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants