ARROW-17735: [C++][Parquet] Optimize parquet reading for String/Binary type #14353

zhixingheyi-tian · 2022-10-09T01:33:17Z

Target

Improve parquet reading performance for String/Binary type based on Buffer operations instead of BinaryArrayBuilder.
Just like fixed-width types which take full advantage of using buffer,
here:

arrow/cpp/src/parquet/arrow/reader_internal.cc

Line 344 in 6cfe246

std::shared_ptr<Array> TransferZeroCopy(RecordReader* reader,

, to boost String/Binary reading performance!

Performance evaluation

CPU: Intel(R) Xeon(R) Platinum 8268 CPU @ 2.90GHz
OS ：CentOS 7.6
Data: Single parquet file with dictionary-encoding, gzip compression， 100M Rows, 10 Cols
Run：With one thread

	Elapse time	performance
Arrow-upstream	27.3s
Arrow-optimizer	12.2s	2.23X

Performance evaluation using binary benchmark #15100

upstream/master:

BM_ReadBinaryColumn/null_probability:0/unique_values:32   3417224521 ns   3409209551 ns            1 bytes_per_second=53.5429M/s items_per_second=3.07572M/s
BM_ReadBinaryColumn/null_probability:0/unique_values:-1   1423706251 ns   1420369542 ns            1 bytes_per_second=151.331M/s items_per_second=7.38242M/s
BM_ReadBinaryColumn/null_probability:1/unique_values:32   3796615016 ns   3787710745 ns            1 bytes_per_second=47.8159M/s items_per_second=2.76836M/s
BM_ReadBinaryColumn/null_probability:50/unique_values:32  3129534432 ns   3122194897 ns            1 bytes_per_second=35.6467M/s items_per_second=3.35846M/s
BM_ReadBinaryColumn/null_probability:99/unique_values:32  1907305272 ns   1902829090 ns            1 bytes_per_second=21.7693M/s items_per_second=5.51062M/s
BM_ReadBinaryColumn/null_probability:1/unique_values:-1   1569981873 ns   1566302194 ns            1 bytes_per_second=136.12M/s items_per_second=6.6946M/s
BM_ReadBinaryColumn/null_probability:50/unique_values:-1  1576715826 ns   1573015486 ns            1 bytes_per_second=81.0747M/s items_per_second=6.66602M/s
BM_ReadBinaryColumn/null_probability:99/unique_values:-1  1383394533 ns   1380152318 ns            1 bytes_per_second=30.2516M/s items_per_second=7.59754M/s

zhixingheyi-tian/arrow_parquet_string_opt:

BM_ReadBinaryColumn/null_probability:0/unique_values:32    720659967 ns    718964998 ns            1 bytes_per_second=253.891M/s items_per_second=14.5845M/s
BM_ReadBinaryColumn/null_probability:0/unique_values:-1    901647061 ns    899527968 ns            1 bytes_per_second=238.954M/s items_per_second=11.657M/s
BM_ReadBinaryColumn/null_probability:1/unique_values:32    852270789 ns    850262221 ns            1 bytes_per_second=213.008M/s items_per_second=12.3324M/s
BM_ReadBinaryColumn/null_probability:50/unique_values:32   898193520 ns    896083642 ns            1 bytes_per_second=124.203M/s items_per_second=11.7018M/s
BM_ReadBinaryColumn/null_probability:99/unique_values:32   449696378 ns    448639932 ns            2 bytes_per_second=92.3308M/s items_per_second=23.3723M/s
BM_ReadBinaryColumn/null_probability:1/unique_values:-1    980079842 ns    977777588 ns            1 bytes_per_second=218.05M/s items_per_second=10.7241M/s
BM_ReadBinaryColumn/null_probability:50/unique_values:-1  1114096867 ns   1111474752 ns            1 bytes_per_second=114.741M/s items_per_second=9.4341M/s
BM_ReadBinaryColumn/null_probability:99/unique_values:-1   574420621 ns    573069568 ns            1 bytes_per_second=72.8565M/s items_per_second=18.2975M/s

@cyb70289 @wjones127 @pitrou

The performance improvement by this optimization is very obvious.

github-actions · 2022-10-09T01:33:36Z

https://issues.apache.org/jira/browse/ARROW-17735

github-actions · 2022-10-09T01:33:37Z

⚠️ Ticket has not been started in JIRA, please click 'Start Progress'.

wjones127 · 2022-11-17T19:53:07Z

Hi @zhixingheyi-tian, sorry this hasn't gotten reviewer attention for a while. Are you still interested in working on this?

Would you mind rebasing and cleaning up any commented code?

In addition, I noticed there aren't any benchmarks for binary or string types in either of these two benchmark files:

src/parquet/arrow/reader_writer_benchmark.cc
src/parquet/parquet_encoding_benchmark.cc

We'll want some benchmarks in order to evaluate the changes. If you add those, then you can compare benchmarks locally like so:

BUILD_DIR=cpp/path/to/build

archery benchmark run $BUILD_DIR \
  --suite-filter=parquet-arrow-reader-writer-benchmark \
  --output=contender.json

git checkout master # Or some ref that has benchmarks but not your changes

archery benchmark run $BUILD_DIR \
  --suite-filter=parquet-arrow-reader-writer-benchmark \
  --output=baseline.json

archery benchmark diff contender.json baseline.json

zhixingheyi-tian · 2022-11-29T12:53:49Z

@wjones127 ,
Will focus on this patch recently.
Welcome review!

Thanks!

zhixingheyi-tian · 2022-12-06T09:44:29Z

Hi @pitrou ,

Recently，have fixed remaining failed 7 UTs， and this patch is ready for reviw.

Thanks!

zhixingheyi-tian · 2022-12-12T09:39:40Z

HI @jorisvandenbossche @wjones127 @pitrou @iajoiner

This PR is ready to review, please have a look.
Thanks,

zhixingheyi-tian · 2022-12-12T09:44:36Z

New failed UT "RecordReaderByteArrayTest.SkipByteArray" , came from newly commit #14142. This UT used the GetBuilderChunks() interface : https://github.com/apache/arrow/blame/6cfe24633b9fe3c474137571940eca35d7a475dc/cpp/src/parquet/column_reader_test.cc#L1181-L1185

And this performance PR is avoiding this interface. So it failed.

Any suggestions to fix this UT? Or change this UT?

Thanks!

zhixingheyi-tian · 2022-12-12T10:02:44Z

Hi @zhixingheyi-tian, sorry this hasn't gotten reviewer attention for a while. Are you still interested in working on this?

Would you mind rebasing and cleaning up any commented code?

In addition, I noticed there aren't any benchmarks for binary or string types in either of these two benchmark files:

src/parquet/arrow/reader_writer_benchmark.cc

src/parquet/parquet_encoding_benchmark.cc

We'll want some benchmarks in order to evaluate the changes. If you add those, then you can compare benchmarks locally like so:
BUILD_DIR=cpp/path/to/build

archery benchmark run $BUILD_DIR \
  --suite-filter=parquet-arrow-reader-writer-benchmark \
  --output=contender.json

git checkout master # Or some ref that has benchmarks but not your changes

archery benchmark run $BUILD_DIR \
  --suite-filter=parquet-arrow-reader-writer-benchmark \
  --output=baseline.json

archery benchmark diff contender.json baseline.json 

Hi @wjones127
Good idea to show performance.

But how to add Binary/String benchmark in *benchmark.cc?
I noticed currently Arrow benchmark testings had not support generating random binary data.
Would you please give some help or advices about adding String/binary in micro benchmarks, thanks!

In my local server end-2-end performance testing:

CPU: Intel(R) Xeon(R) Platinum 8268 CPU @ 2.90GHz
OS ：CentOS 7.6
Data: Single parquet file with dictionary-encoding, gzip compression， 100M Rows, 10 Cols
Run：With one thread

	Elapse time	performance
Arrow-upstream	27.3s
Arrow-optimizer	12.2s	2.23X

pitrou · 2022-12-12T14:23:19Z

And this performance PR is avoiding this interface. So it failed.

Any suggestions to fix this UT?

I'm not sure I understand your question. It's your job to make sure that your changes don't break the existing test suite...

zhixingheyi-tian · 2022-12-13T10:01:39Z

And this performance PR is avoiding this interface. So it failed.
Any suggestions to fix this UT?

I'm not sure I understand your question. It's your job to make sure that your changes don't break the existing test suite...

Hi @pitrou ,
Have fixed this UT-- RecordReaderByteArrayTest.SkipByteArray, today.
Thanks!

zhixingheyi-tian · 2022-12-13T10:03:25Z

HI @cyb70289 @jorisvandenbossche
Would you please give a review,
Thanks!

…ing_opt

cyb70289 · 2022-12-14T02:17:16Z

Would you fix the CI failures?

zhixingheyi-tian · 2022-12-14T02:50:00Z

Would you fix the CI failures?
Thanks @cyb70289

I am in fixing errors in github CI.
For "TestArrowReadDeltaEncoding.DeltaByteArray", why is skipped in my local testing? But showed failure on CI.

# ./parquet-arrow-test --gtest_filter=TestArrowReadDeltaEncoding.DeltaByteArray
Running main() from /home/shen/software/googletest/googletest/src/gtest_main.cc
Note: Google Test filter = TestArrowReadDeltaEncoding.DeltaByteArray
[==========] Running 1 test from 1 test suite.
[----------] Global test environment set-up.
[----------] 1 test from TestArrowReadDeltaEncoding
[ RUN      ] TestArrowReadDeltaEncoding.DeltaByteArray
[  SKIPPED ] TestArrowReadDeltaEncoding.DeltaByteArray (0 ms)
[----------] 1 test from TestArrowReadDeltaEncoding (0 ms total)

[----------] Global test environment tear-down
[==========] 1 test from 1 test suite ran. (0 ms total)
[  PASSED  ] 0 tests.
[  SKIPPED ] 1 test, listed below:
[  SKIPPED ] TestArrowReadDeltaEncoding.DeltaByteArray

cyb70289 · 2022-12-14T03:05:33Z

I am in fixing errors in github CI. For "TestArrowReadDeltaEncoding.DeltaByteArray", why is skipped in my local testing? But showed failure on CI.

Don't know the details. But it should be easy to find in the source code.

pitrou · 2022-12-14T17:29:36Z

Instead of defining entire separate classes for this, why not change EncodingTraits::Accumulator to the following:

template <>
struct EncodingTraits<ByteArrayType> {
  // ...
  struct Accumulator {
    std::unique_ptr<::arrow::Int32Builder> offsets_builder;
    std::unique_ptr<::arrow::BufferBuilder> data_builder;
    std::vector<std::shared_ptr<::arrow::Array>> chunks;
  };

zhixingheyi-tian · 2022-12-15T14:22:57Z

I am in fixing errors in github CI. For "TestArrowReadDeltaEncoding.DeltaByteArray", why is skipped in my local testing? But showed failure on CI.

Don't know the details. But it should be easy to find in the source code.

New commits should fix all UTs. Please give a review!
Thanks!

zhixingheyi-tian · 2022-12-15T14:28:00Z

Instead of defining entire separate classes for this, why not change EncodingTraits::Accumulator to the following:

template <>
struct EncodingTraits<ByteArrayType> {
  // ...
  struct Accumulator {
    std::unique_ptr<::arrow::Int32Builder> offsets_builder;
    std::unique_ptr<::arrow::BufferBuilder> data_builder;
    std::vector<std::shared_ptr<::arrow::Array>> chunks;
  };

If use *Builder， may add extra data copy when accumulating element.
Can refer to fixed width usage ：

arrow/cpp/src/parquet/encoding.cc

Lines 1062 to 1070 in 5ce8d79

    
           inline int DecodePlain(const uint8_t* data, int64_t data_size, int num_values, 
        
                                  int type_length, T* out) { 
        
             int64_t bytes_to_decode = num_values * static_cast<int64_t>(sizeof(T)); 
        
             if (bytes_to_decode > data_size || bytes_to_decode > INT_MAX) { 
        
               ParquetException::EofException(); 
        
             } 
        
             // If bytes_to_decode == 0, data could be null 
        
             if (bytes_to_decode > 0) { 
        
               memcpy(out, data, bytes_to_decode);

Thanks!

…ing_opt

cyb70289

Thanks @zhixingheyi-tian. Look it's a useful improvement.
Please see some comments from me.

Besides, did you compare benchmarks (use microbenchmarks in arrow source) of various cases (no nulls, with nulls, different data types, etc.) against current code? If some microbenchmarks are missing, will you add them? It is necessary to evaluate this PR.

cyb70289 · 2022-12-18T09:27:18Z

cpp/src/parquet/column_reader.h

  /// \brief Pre-allocate space for data. Results in better flat read performance
  virtual void Reserve(int64_t num_values) = 0;

+  virtual void ReserveValues(int64_t capacity) {}


A new interface? Are these added data members and functions necessary for this base class? I suppose they are only for the new reader implementation.

Previously. it's TypedRecordReader internal interface,

arrow/cpp/src/parquet/column_reader.cc

Lines 1698 to 1720 in f82501e

void ReserveValues(int64_t extra_values) {

const int64_t new_values_capacity =

UpdateCapacity(values_capacity_, values_written_, extra_values);

if (new_values_capacity > values_capacity_) {

// XXX(wesm): A hack to avoid memory allocation when reading directly

// into builder classes

if (uses_values_) {

PARQUET_THROW_NOT_OK(values_->Resize(bytes_for_values(new_values_capacity),

/*shrink_to_fit=*/false));

}

values_capacity_ = new_values_capacity;

}

if (leaf_info_.HasNullableValues()) {

int64_t valid_bytes_new = bit_util::BytesForBits(values_capacity_);

if (valid_bits_->size() < valid_bytes_new) {

int64_t valid_bytes_old = bit_util::BytesForBits(values_written_);

PARQUET_THROW_NOT_OK(

valid_bits_->Resize(valid_bytes_new, /*shrink_to_fit=*/false));

// Avoid valgrind warnings

memset(valid_bits_->mutable_data() + valid_bytes_old, 0,

valid_bytes_new - valid_bytes_old);

}

.

And ByteArrayChunkedOptRecordReader extends from TypedRecordReader， so extract it as public interface.

Isn't it just a helper function specific to implementation?

Since these are coming from TypeRecordReader, which is private, could you mark it's methods as virtual instead?

Suggested change

virtual void ReserveValues(int64_t capacity) {}

Is it better to make it pure virtual? In addition, it helps to add a comment for public function.

cpp/src/parquet/encoding.cc

cyb70289 · 2022-12-18T09:34:21Z

cpp/src/parquet/column_reader.cc

    current_encoding_ = encoding;
    current_decoder_->SetData(static_cast<int>(num_buffered_values_), buffer,
                              static_cast<int>(data_size));
+    if (!hasSet_uses_opt_) {


Stick to snake_case for variables.

Just follow https://github.com/apache/arrow/blob/f82501e763ff48af610077f9525ae83cc3ab2e95/cpp/src/parquet/column_reader.cc#L1271，

Why different renaming ways for variables？

Thank!

Don't mix camel and snake case.
has_set_uses_opt

cyb70289 · 2022-12-18T09:46:52Z

cpp/src/parquet/column_reader.cc

+      const auto last_offset = offsetArr[values_written_];
+      int64_t binary_length = last_offset - first_offset;
+      binary_per_row_length_ = binary_length / values_written_ + 1;
+      // std::cout << "binary_per_row_length_:" << binary_per_row_length_ << std::endl;


Please remove debug code.

Looks there are other debug code not removed. Do remove all of them.
// RETURN_NOT_OK(IndexInBounds(idx));

cyb70289 · 2022-12-18T09:51:17Z

cpp/src/parquet/column_reader.cc

+  bool hasSet_uses_opt_ = false;
+  bool uses_opt_ = false;


Are these two flags really necessary?
Looks to me a trivial helper function is better.

These two flags is for
if (current_encoding_ == Encoding::PLAIN_DICTIONARY || current_encoding_ == Encoding::PLAIN || current_encoding_ == Encoding::RLE_DICTIONARY)

Just avoid comparing every time.

Try to make it simple. I don't think there's any performance consideration here.
E.g., define a helper function
bool UsesOpt() const { return (current_encoding == xxx || ....; }

Could you create a separate function for that, as Yibo suggested? If you do measure a meaningful performance difference, could you share your results then?

In addition, could you add a comment explaining why the optimization is only applicable to those those three encodings?

Hmm, also please find a more descriptive name than "uses optimization". (which optimization?)

cyb70289 · 2022-12-18T09:53:26Z

cpp/src/parquet/column_reader.cc

+      if (current_encoding_ == Encoding::PLAIN_DICTIONARY ||
+          current_encoding_ == Encoding::PLAIN ||
+          current_encoding_ == Encoding::RLE_DICTIONARY) {


Are all these cases covered by UT?

Yes,
This patch is from one customer's case. They want to boost String scan performance.
So this patch is just for parquet general encodings: PLAIN , RLE_DICTIONARY, PLAIN_DICTIONARY.

Other encodings will skip this optimization.

Existed UTs will cover all encodings cases.

Just curious, why other encodings are not supported?

cyb70289 · 2022-12-18T10:02:05Z

cpp/src/parquet/column_reader.cc

+    } else {
+      ::arrow::ArrayVector result = accumulator_.chunks;
+      if (result.size() == 0 || accumulator_.builder->length() > 0) {
+        std::shared_ptr<::arrow::Array> last_chunk;
+        PARQUET_THROW_NOT_OK(accumulator_.builder->Finish(&last_chunk));
+        result.push_back(std::move(last_chunk));
+      }
+      accumulator_.chunks = {};
+      return result;


Duplicates ByteArrayChunkedRecordReader? This doesn't look right.

The optimized RecordReader implementation is ByteArrayChunkedOptRecordReader

https://github.com/zhixingheyi-tian/arrow/blob/8d611c673d3ed23abfa32f50da3cc4c4e4feb807/cpp/src/parquet/column_reader.cc#L2203-L2206

And is just for Binary/String/LargeBinary/LargeString types.

This probably means it's not good to create a new class for this optimized reader.

That said, I don't have deep knowlege of parquet code and lack bandwidth recently to investigate, someone else might comment.

I agree, would shouldn't create another class. I don't see any reason this can't be used for the Decimal case.

I've found locally that if I merge the implementations, the unit tests pass. Could you please merge them in the PR?

@wjones127
Do you merge the two class: ByteArrayChunkedRecordReader and ByteArrayChunkedOptRecordReader, , all unit tests passed?

Yes, those are the two classes I merged.

Here is the changeset that allows the parquet-arrow-test and parquet-arrow-internals-test to pass. zhixingheyi-tian#2

Could you incorporate those changes?

+1 for not adding a separate class. This would be difficult to maintain if more optimization will be added. It would be better if an option can be added so that user can manually turn it off when something goes wrong with the new feature.

cyb70289 · 2022-12-18T10:19:04Z

cpp/src/parquet/encoding.h

+                              int32_t* offset,
+                              std::shared_ptr<::arrow::ResizableBuffer>& values,
+                              int64_t valid_bits_offset, int32_t* bianry_length) {
+    return 0;


I saw many dummy implementations like this in the PR. Probably they can be eliminated.

Just follow previous DecodeArrow() function call stack usages.

My gut feeling is still that they are not necessary. Though it might be wrong.

It seems like we are adding this because the other methods are based on builders (in the accumulator), and builder don't provide a way to transfer multiple values in one memcpy. Does that sound right?

I wonder if we could add such a method on builders, and that might be a cleaner solution. Something like:

class BaseBinaryBuilder { ... Status UnsafeAppendValues(const uint8_t* values, int64_t length, const uint8_t* valid_bytes = NULL_PTR) { ... } };

The reason getting back to builders might be good is that I don't think these changes handle the case where there are more values than can fit into a StringArray. The current implementation will split it up into chunks if it reaches capacity, and I think we need to keep that behavior.

Also, I'll say it again, but another possibility is to change the definition of EncodingTraits<ByteArrayType>::Accumulator from:

template <> struct EncodingTraits<ByteArrayType> { using Encoder = ByteArrayEncoder; using Decoder = ByteArrayDecoder; /// \brief Internal helper class for decoding BYTE_ARRAY data where we can /// overflow the capacity of a single arrow::BinaryArray struct Accumulator { std::unique_ptr<::arrow::BinaryBuilder> builder; std::vector<std::shared_ptr<::arrow::Array>> chunks; }; using ArrowType = ::arrow::BinaryType; using DictAccumulator = ::arrow::Dictionary32Builder<::arrow::BinaryType>; };

to:

template <> struct EncodingTraits<ByteArrayType> { using Encoder = ByteArrayEncoder; using Decoder = ByteArrayDecoder; /// \brief Internal helper class for decoding BYTE_ARRAY data where we can /// overflow the capacity of a single arrow::Int32Array struct Accumulator { std::unique_ptr<::arrow::Int32Builder> offsets_builder; std::unique_ptr<::arrow::BufferBuilder> data_builder; std::vector<std::shared_ptr<::arrow::Array>> chunks; }; using ArrowType = ::arrow::BinaryType; using DictAccumulator = ::arrow::Dictionary32Builder<::arrow::BinaryType>; };

Either this or @wjones127 's suggestion would be better than adding a specialized method, IMHO.

Thanks Antoine. I guess I'm just catching to up your understanding 😄

When I was at previous employer, we have implemented mutable arrow::Array to address similar issue of ORC reader. The idea is that we can know in advance the total length of all string/binary values in a single batch. Therefore we can pre-allocate the data buffer at once or even reuse previous buffer if it has enough capacity. The overhead of buffer allocation and resize operation are non-negligible.

@pitrou We have discussed the idea in https://issues.apache.org/jira/browse/ARROW-15289

@wgtmac I'm not sure if you are objecting to my proposal above. A separate BufferBuilder for the string data allows to presize for the computed total length, so it should address your concern.

@wgtmac I'm not sure if you are objecting to my proposal above. A separate BufferBuilder for the string data allows to presize for the computed total length, so it should address your concern.

Sorry I didn't make it clear. I'm not objecting your proposal. Instead I just summarized another optimization we have done before if reusing arrow::RecordBatch on the same reader is possible. Internal experiment reveals that repeated buffer allocation and resize operation are non-negligible overhead, especially for wide columns (e.g. 1000+ columns). @pitrou

cyb70289 · 2022-12-18T10:26:32Z

cpp/src/parquet/encoding.cc

+    int32_t indices[kBufferSize];
+    auto dst_value = values->mutable_data() + (*bianry_length);
+
+    ::arrow::internal::BitmapReader bit_reader(valid_bits, valid_bits_offset, num_values);


Do we have benchmark result of this bitmap reader based implementation?
Is it better than original code (bitblock counter)?

zhixingheyi-tian · 2022-12-19T03:18:33Z

Good idea to show performance.

But how to add Binary/String benchmark in *benchmark.cc? I noticed currently Arrow benchmark testings had not support generating random binary data. Would you please give some help or advices about adding String/binary in micro benchmarks, thanks!

Hi @cyb70289 ,
Thanks your kind review very much!

Good idea to show performance.

But how to add Binary/String benchmark in *benchmark.cc?
I noticed currently Arrow benchmark testings had not support generating random binary data.
Would you please give some help or advices about adding String/binary in micro benchmarks, thanks!

cyb70289 · 2022-12-19T05:28:59Z

But how to add Binary/String benchmark in *benchmark.cc? I noticed currently Arrow benchmark testings had not support generating random binary data. Would you please give some help or advices about adding String/binary in micro benchmarks, thanks!

There are many examples you can reference. cpp/src/arrow/testing/random.h defines functions to generate different kinds of random test set.

Please note it can be a separate PR to add new benchmarks, which can be reviewed and merged before this one.

…ing_opt

zhixingheyi-tian · 2023-01-04T05:18:34Z

@cyb70289 @wjones127
During the last two weeks, was suffering high fever. Sorry!
Now, Will continue to support this patch!

wjones127 · 2023-01-04T19:12:20Z

Sorry to hear you were sick @zhixingheyi-tian. I'll soon have a benchmark merged (#15100), and that will give us relevant results in Conbench that will help the evaluate these changes.

zhixingheyi-tian · 2023-01-10T08:11:18Z

Sorry to hear you were sick @zhixingheyi-tian. I'll soon have a benchmark merged (#15100), and that will give us relevant results in Conbench that will help the evaluate these changes.

Thanks!

wjones127 · 2023-01-10T21:43:47Z

@ursabot please benchmark command=cpp-micro --suite-filter=parquet-arrow-reader-writer-benchmark

ursabot · 2023-01-10T21:43:50Z

Benchmark runs are scheduled for baseline = f82501e and contender = 19e2ba3. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Skipped ⚠️ Only ['lang', 'name'] filters are supported on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2
[Skipped ⚠️ Only ['lang', 'name'] filters are supported on test-mac-arm] test-mac-arm
[Skipped ⚠️ Only ['lang', 'name'] filters are supported on ursa-i9-9960x] ursa-i9-9960x
[Finished ⬇️0.1% ⬆️0.0%] ursa-thinkcentre-m75q
Buildkite builds:
[Finished] 19e2ba3b ursa-thinkcentre-m75q
[Finished] f82501e7 ec2-t3-xlarge-us-east-2
[Failed] f82501e7 test-mac-arm
[Finished] f82501e7 ursa-i9-9960x
[Finished] f82501e7 ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

wjones127 · 2023-01-10T21:55:39Z

The failures in the continuous-integration/appveyor/pr seem relevant. This test is failing:

arrow/python/pyarrow/tests/parquet/test_basic.py

Lines 715 to 742 in 85b167c

    
           @parametrize_legacy_dataset 
        
           @pytest.mark.parametrize('array_factory', [ 
        
               lambda: pa.array([0, None] * 10), 
        
               lambda: pa.array([0, None] * 10).dictionary_encode(), 
        
               lambda: pa.array(["", None] * 10), 
        
               lambda: pa.array(["", None] * 10).dictionary_encode(), 
        
           ]) 
        
           @pytest.mark.parametrize('use_dictionary', [False, True]) 
        
           @pytest.mark.parametrize('read_dictionary', [False, True]) 
        
           def test_buffer_contents( 
        
                   array_factory, use_dictionary, read_dictionary, use_legacy_dataset 
        
           ): 
        
               # Test that null values are deterministically initialized to zero 
        
               # after a roundtrip through Parquet. 
        
               # See ARROW-8006 and ARROW-8011. 
        
               orig_table = pa.Table.from_pydict({"col": array_factory()}) 
        
               bio = io.BytesIO() 
        
               pq.write_table(orig_table, bio, use_dictionary=True) 
        
               bio.seek(0) 
        
               read_dictionary = ['col'] if read_dictionary else None 
        
               table = pq.read_table(bio, use_threads=False, 
        
                                     read_dictionary=read_dictionary, 
        
                                     use_legacy_dataset=use_legacy_dataset) 
        
               for col in table.columns: 
        
                   [chunk] = col.chunks 
        
                   buf = chunk.buffers()[1] 
        
                   assert buf.to_pybytes() == buf.size * b"\0"

Perhaps we need to make sure that if there are null values that we don't copy those parts of the buffer?

wjones127 · 2023-01-10T22:24:58Z

cpp/src/parquet/column_reader.cc

+  // Helper data structure for accumulating builder chunks
+  typename EncodingTraits<ByteArrayType>::Accumulator accumulator_;
+
+  int32_t bianry_length_ = 0;


It's a prexisting issue, but could you fix this variable name?

Suggested change

int32_t bianry_length_ = 0;

int32_t binary_length_ = 0;

wjones127 · 2023-01-10T22:25:17Z

cpp/src/parquet/column_reader.h

 // 16 KB is the default expected page header size
 static constexpr uint32_t kDefaultPageHeaderSize = 16 * 1024;

+static constexpr int32_t kDefaultBinaryPerRowSzie = 20;


Suggested change

static constexpr int32_t kDefaultBinaryPerRowSzie = 20;

static constexpr int32_t kDefaultBinaryPerRowSize = 20;

wjones127 · 2023-01-10T22:38:37Z

cpp/src/parquet/column_reader.cc

+  void ResetValues() {
+    if (values_written_ > 0) {
+      // Resize to 0, but do not shrink to fit
+      PARQUET_THROW_NOT_OK(valid_bits_->Resize(0, false));


I ran the unit tests (parquet-arrow-test) with a debugger, and found this branch was never hit. Does that seem right? Could you add a test that validates this branch?

Just follow here:

arrow/cpp/src/parquet/column_reader.cc

Lines 1846 to 1851 in fc53ff8

void ResetValues() {

if (values_written_ > 0) {

// Resize to 0, but do not shrink to fit

if (uses_values_) {

PARQUET_THROW_NOT_OK(values_->Resize(0, /*shrink_to_fit=*/false));

}

wjones127 · 2023-01-10T22:40:18Z

@zhixingheyi-tian Once you've fixed the issues mentioned, could you also rebase on the latest master branch? Once you do that we can run the benchmark.

zhixingheyi-tian · 2023-01-12T09:42:56Z

@zhixingheyi-tian Once you've fixed the issues mentioned, could you also rebase on the latest master branch? Once you do that we can run the benchmark.

Have rebased. Performance can refer to #14353 (comment)

zhixingheyi-tian · 2023-01-12T09:48:42Z

The failures in the continuous-integration/appveyor/pr seem relevant. This test is failing:

arrow/python/pyarrow/tests/parquet/test_basic.py

Lines 715 to 742 in 85b167c

@parametrize_legacy_dataset

@pytest.mark.parametrize('array_factory', [

lambda: pa.array([0, None] * 10),

lambda: pa.array([0, None] * 10).dictionary_encode(),

lambda: pa.array(["", None] * 10),

lambda: pa.array(["", None] * 10).dictionary_encode(),

])

@pytest.mark.parametrize('use_dictionary', [False, True])

@pytest.mark.parametrize('read_dictionary', [False, True])

def test_buffer_contents(

array_factory, use_dictionary, read_dictionary, use_legacy_dataset

):

# Test that null values are deterministically initialized to zero

# after a roundtrip through Parquet.

# See ARROW-8006 and ARROW-8011.

orig_table = pa.Table.from_pydict({"col": array_factory()})

bio = io.BytesIO()

pq.write_table(orig_table, bio, use_dictionary=True)

bio.seek(0)

read_dictionary = ['col'] if read_dictionary else None

table = pq.read_table(bio, use_threads=False,

read_dictionary=read_dictionary,

use_legacy_dataset=use_legacy_dataset)

for col in table.columns:

[chunk] = col.chunks

buf = chunk.buffers()[1]

assert buf.to_pybytes() == buf.size * b"\0"

Perhaps we need to make sure that if there are null values that we don't copy those parts of the buffer?

I haven't touched pyarrow before. pyarrow underlayer is arrow-cpp, right?
How to reproduce the issue easily in cpp?
Thanks!

…ing_opt

wjones127 · 2023-01-20T18:50:34Z

How to reproduce the issue easily in cpp?

I looked at it a bit more, and I'm actually not sure. It seems to be verifying the offsets buffer is all zeros when the array is composed only of empty string and nulls, which I can't get to fail in C++. I can help look at that further once other feedback is addressed in this PR.

wjones127

The performance is looking promising. I've looked through part of it, and have some suggestions for simplifying the code. Will look through the rest soon.

wjones127 · 2023-01-20T20:09:54Z

cpp/src/parquet/column_reader.h

+  bool hasCal_average_len_ = false;
+  int64_t binary_per_row_length_ = kDefaultBinaryPerRowSize;


First, I think this would be clearer as a std::optional, rather than a boolean on the side.
Second, please document the purpose of these fields in the header file.

Suggested change

bool hasCal_average_len_ = false;

int64_t binary_per_row_length_ = kDefaultBinaryPerRowSize;

/// \brief Typical size of single binary value, used for pre-allocating value buffer.

///

/// Before this is set, kDefaultBinaryPerRowSize is used. After the first

/// batch of values, this is set to the size of the values buffer divided by

/// the number of values.

std::optional<int64_t> binary_per_row_length_ = std::nullopt;

wjones127 · 2023-01-20T20:12:38Z

cpp/src/parquet/column_reader.cc

+  }
+  std::shared_ptr<ResizableBuffer> ReleaseOffsets() override {
+    auto result = offset_;
+    if (ARROW_PREDICT_FALSE(!hasCal_average_len_)) {


if we make it an optional:

Suggested change

if (ARROW_PREDICT_FALSE(!hasCal_average_len_)) {

if (ARROW_PREDICT_FALSE(!binary_per_row_length_.has_value())) {

wjones127 · 2023-01-20T20:14:10Z

cpp/src/parquet/column_reader.cc

+      PARQUET_THROW_NOT_OK(
+          values_->Resize(new_values_capacity * binary_per_row_length_, false));


if we make this an option:

Suggested change

PARQUET_THROW_NOT_OK(

values_->Resize(new_values_capacity * binary_per_row_length_, false));

int64_t per_row_length = binary_per_row_length_.value_or(kDefaultBinaryPerRowSize);

PARQUET_THROW_NOT_OK(

values_->Resize(new_values_capacity * per_row_length, false));

wjones127 · 2023-01-20T20:19:43Z

cpp/src/parquet/column_reader.h

 // 16 KB is the default expected page header size
 static constexpr uint32_t kDefaultPageHeaderSize = 16 * 1024;

+static constexpr int32_t kDefaultBinaryPerRowSize = 20;


Since this corresponds to binary_per_row_length_, could we make the names match? I'm thinking "bytes per row" is the best description here:

Suggested change

static constexpr int32_t kDefaultBinaryPerRowSize = 20;

static constexpr int32_t kDefaultBinaryBytesPerRow = 20;

(Also change binary_per_row_length_ to binary_bytes_per_row_)

It would be better to add a comment here.

wjones127 · 2023-01-20T20:32:35Z

cpp/src/parquet/column_reader.h

  /// \brief Pre-allocate space for data. Results in better flat read performance
  virtual void Reserve(int64_t num_values) = 0;

+  virtual void ReserveValues(int64_t capacity) {}


Since these are coming from TypeRecordReader, which is private, could you mark it's methods as virtual instead?

Suggested change

virtual void ReserveValues(int64_t capacity) {}

wjones127 · 2023-01-20T20:32:57Z

cpp/src/parquet/column_reader.cc

    }
  }

+  std::shared_ptr<ResizableBuffer> ReleaseOffsets() override { return nullptr; }


Suggested change

std::shared_ptr<ResizableBuffer> ReleaseOffsets() override { return nullptr; }

virtual std::shared_ptr<ResizableBuffer> ReleaseOffsets() { return nullptr; }

wjones127 · 2023-01-20T20:33:08Z

cpp/src/parquet/column_reader.cc

  }

-  void ReserveValues(int64_t extra_values) {
+  void ReserveValues(int64_t extra_values) override {


Suggested change

void ReserveValues(int64_t extra_values) override {

virtual void ReserveValues(int64_t extra_values) {

wjones127 · 2023-01-20T20:33:58Z

cpp/src/parquet/column_reader.h

+  virtual std::shared_ptr<ResizableBuffer> ReleaseOffsets() = 0;
+


Suggested change

virtual std::shared_ptr<ResizableBuffer> ReleaseOffsets() = 0;

Same here for a comment.

wjones127 · 2023-01-20T20:41:44Z

cpp/src/parquet/column_reader.cc

+  bool hasSet_uses_opt_ = false;
+  bool uses_opt_ = false;


Could you create a separate function for that, as Yibo suggested? If you do measure a meaningful performance difference, could you share your results then?

In addition, could you add a comment explaining why the optimization is only applicable to those those three encodings?

wgtmac · 2023-01-24T11:36:13Z

cpp/src/parquet/column_reader.h

 // 16 KB is the default expected page header size
 static constexpr uint32_t kDefaultPageHeaderSize = 16 * 1024;

+static constexpr int32_t kDefaultBinaryPerRowSize = 20;


It would be better to add a comment here.

wgtmac · 2023-01-24T11:37:21Z

cpp/src/parquet/column_reader.h

  /// \brief Pre-allocate space for data. Results in better flat read performance
  virtual void Reserve(int64_t num_values) = 0;

+  virtual void ReserveValues(int64_t capacity) {}


Is it better to make it pure virtual? In addition, it helps to add a comment for public function.

wgtmac · 2023-01-24T11:38:24Z

cpp/src/parquet/column_reader.h

+  virtual std::shared_ptr<ResizableBuffer> ReleaseOffsets() = 0;
+


Same here for a comment.

wgtmac · 2023-01-24T11:47:00Z

cpp/src/parquet/encoding.h

+                              int32_t* offset,
+                              std::shared_ptr<::arrow::ResizableBuffer>& values,
+                              int64_t valid_bits_offset, int32_t* bianry_length) {
+    return 0;


When I was at previous employer, we have implemented mutable arrow::Array to address similar issue of ORC reader. The idea is that we can know in advance the total length of all string/binary values in a single batch. Therefore we can pre-allocate the data buffer at once or even reuse previous buffer if it has enough capacity. The overhead of buffer allocation and resize operation are non-negligible.

@pitrou We have discussed the idea in https://issues.apache.org/jira/browse/ARROW-15289

wgtmac · 2023-01-24T12:16:21Z

cpp/src/parquet/encoding.cc

+                              int32_t* offset,
+                              std::shared_ptr<::arrow::ResizableBuffer>& values,
+                              int64_t valid_bits_offset, int* out_values_decoded,
+                              int32_t* bianry_length) {


Suggested change

int32_t* bianry_length) {

int32_t* binary_length) {

There are similar typos below.

wgtmac · 2023-01-24T12:20:25Z

cpp/src/parquet/encoding.cc

    return Status::OK();
  }

+  Status DecodeArrowDenseNonNull_opt(int num_values, int32_t* offset,


Can we avoid this kind of name and give a meaningful one instead?

wgtmac · 2023-01-24T12:22:52Z

cpp/src/parquet/encoding.cc

+    while (values_decoded < num_values) {
+      int32_t batch_size = std::min<int32_t>(kBufferSize, num_values - values_decoded);
+      int num_indices = idx_decoder_.GetBatch(indices, batch_size);
+      if (num_indices == 0) ParquetException::EofException();


It would be helpful to provide some concrete error message.

wgtmac · 2023-01-24T12:30:03Z

cpp/src/parquet/column_reader.cc

+      if (current_encoding_ == Encoding::PLAIN_DICTIONARY ||
+          current_encoding_ == Encoding::PLAIN ||
+          current_encoding_ == Encoding::RLE_DICTIONARY) {


Just curious, why other encodings are not supported?

wgtmac · 2023-01-24T12:35:19Z

cpp/src/parquet/column_reader.cc

+      std::vector<std::shared_ptr<Buffer>> buffers = {ReleaseIsValid(), ReleaseOffsets(),
+                                                      ReleaseValues()};
+      auto data = std::make_shared<::arrow::ArrayData>(
+          ::arrow::binary(), values_written(), buffers, null_count());


Is ::arrow::binary() correct? Will it cause type mismatch if actual type is ::arrow::utf8()? For example, when checking equality with a StringArray, the type identity may be broken if one side is binary and the other is utf8.

wgtmac · 2023-01-24T12:38:28Z

cpp/src/parquet/column_reader.cc

+    } else {
+      ::arrow::ArrayVector result = accumulator_.chunks;
+      if (result.size() == 0 || accumulator_.builder->length() > 0) {
+        std::shared_ptr<::arrow::Array> last_chunk;
+        PARQUET_THROW_NOT_OK(accumulator_.builder->Finish(&last_chunk));
+        result.push_back(std::move(last_chunk));
+      }
+      accumulator_.chunks = {};
+      return result;


+1 for not adding a separate class. This would be difficult to maintain if more optimization will be added. It would be better if an option can be added so that user can manually turn it off when something goes wrong with the new feature.

amol- · 2023-03-30T17:10:53Z

Closing because it has been untouched for a while, in case it's still relevant feel free to reopen and move it forward 👍

wgtmac · 2023-03-31T16:13:04Z

@zhixingheyi-tian Do you plan to update this?

Add code for binary/string optimization

df7b91a

github-actions bot added Component: C++ Component: Parquet labels Oct 9, 2022

Fix compile warning

6c746d0

Fix LargeString/LargeBinary UTs

1a3cf43

zhixingheyi-tian added 4 commits December 6, 2022 18:19

Fix Decimal UTs

5c8a253

Fix NestedRequiredOuterOptional UTs

3f34ecd

Merge and fix conflict

d534e0b

remove testing dir change

390323c

zhixingheyi-tian added 3 commits December 14, 2022 01:30

Fix RecordReaderByteArrayTest.SkipByteArray

3294ace

Format cpp code

f703c5a

Merge remote-tracking branch 'upstream/master' into arrow_parquet_str…

8f83984

…ing_opt

Fix nullptr

c4312a7

Merge remote-tracking branch 'upstream/master' into arrow_parquet_str…

85c99ce

…ing_opt

cyb70289 reviewed Dec 18, 2022

View reviewed changes

zhixingheyi-tian added 2 commits December 19, 2022 18:03

Merge remote-tracking branch 'upstream/master' into arrow_parquet_str…

8d611c6

…ing_opt

Refine

19e2ba3

wjones127 mentioned this pull request Dec 27, 2022

Add benchmarks for reading and writing strings #15100

Closed

wjones127 reviewed Jan 10, 2023

View reviewed changes

zhixingheyi-tian added 2 commits January 12, 2023 22:59

Merge remote-tracking branch 'upstream/master' into arrow_parquet_str…

f10d019

…ing_opt

Correct typo

6abd879

wjones127 self-requested a review January 20, 2023 16:55

wjones127 requested changes Jan 20, 2023

View reviewed changes

wgtmac reviewed Jan 24, 2023

View reviewed changes

rok mentioned this pull request Jan 26, 2023

GH-33024: [C++][Parquet] Add DELTA_LENGTH_BYTE_ARRAY encoder to Parquet writer #14293

Merged

amol- closed this Mar 30, 2023

wgtmac mentioned this pull request Dec 9, 2023

GH-39122: [C++][Parquet] Optimize FLBA record reader #39124

Merged

	void ReserveValues(int64_t extra_values) {
	const int64_t new_values_capacity =
	UpdateCapacity(values_capacity_, values_written_, extra_values);
	if (new_values_capacity > values_capacity_) {
	// XXX(wesm): A hack to avoid memory allocation when reading directly
	// into builder classes
	if (uses_values_) {
	PARQUET_THROW_NOT_OK(values_->Resize(bytes_for_values(new_values_capacity),
	/shrink_to_fit=/false));
	}
	values_capacity_ = new_values_capacity;
	}
	if (leaf_info_.HasNullableValues()) {
	int64_t valid_bytes_new = bit_util::BytesForBits(values_capacity_);
	if (valid_bits_->size() < valid_bytes_new) {
	int64_t valid_bytes_old = bit_util::BytesForBits(values_written_);
	PARQUET_THROW_NOT_OK(
	valid_bits_->Resize(valid_bytes_new, /shrink_to_fit=/false));

	// Avoid valgrind warnings
	memset(valid_bits_->mutable_data() + valid_bytes_old, 0,
	valid_bytes_new - valid_bytes_old);
	}

	static constexpr int32_t kDefaultBinaryPerRowSzie = 20;
	static constexpr int32_t kDefaultBinaryPerRowSize = 20;

	void ResetValues() {
	if (values_written_ > 0) {
	// Resize to 0, but do not shrink to fit
	if (uses_values_) {
	PARQUET_THROW_NOT_OK(values_->Resize(0, /shrink_to_fit=/false));
	}

		bool hasCal_average_len_ = false;
		int64_t binary_per_row_length_ = kDefaultBinaryPerRowSize;

-  bool hasCal_average_len_ = false;
-  int64_t binary_per_row_length_ = kDefaultBinaryPerRowSize;
+  /// \brief Typical size of single binary value, used for pre-allocating value buffer.
+  ///
+  /// Before this is set, kDefaultBinaryPerRowSize is used. After the first
+  /// batch of values, this is set to the size of the values buffer divided by
+  /// the number of values.
+  std::optional<int64_t> binary_per_row_length_ = std::nullopt;

	if (ARROW_PREDICT_FALSE(!hasCal_average_len_)) {
	if (ARROW_PREDICT_FALSE(!binary_per_row_length_.has_value())) {

		PARQUET_THROW_NOT_OK(
		values_->Resize(new_values_capacity * binary_per_row_length_, false));

	static constexpr int32_t kDefaultBinaryPerRowSize = 20;
	static constexpr int32_t kDefaultBinaryBytesPerRow = 20;

	std::shared_ptr<ResizableBuffer> ReleaseOffsets() override { return nullptr; }
	virtual std::shared_ptr<ResizableBuffer> ReleaseOffsets() { return nullptr; }

	void ReserveValues(int64_t extra_values) override {
	virtual void ReserveValues(int64_t extra_values) {

ARROW-17735: [C++][Parquet] Optimize parquet reading for String/Binary type #14353

ARROW-17735: [C++][Parquet] Optimize parquet reading for String/Binary type #14353

Uh oh!

Conversation

zhixingheyi-tian commented Oct 9, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Target

Performance evaluation

Performance evaluation using binary benchmark #15100

Uh oh!

github-actions bot commented Oct 9, 2022

Uh oh!

github-actions bot commented Oct 9, 2022

Uh oh!

wjones127 commented Nov 17, 2022

Uh oh!

zhixingheyi-tian commented Nov 29, 2022

Uh oh!

zhixingheyi-tian commented Dec 6, 2022

Uh oh!

zhixingheyi-tian commented Dec 12, 2022

Uh oh!

zhixingheyi-tian commented Dec 12, 2022

Uh oh!

zhixingheyi-tian commented Dec 12, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pitrou commented Dec 12, 2022

Uh oh!

zhixingheyi-tian commented Dec 13, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zhixingheyi-tian commented Dec 13, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cyb70289 commented Dec 14, 2022

Uh oh!

zhixingheyi-tian commented Dec 14, 2022

Uh oh!

cyb70289 commented Dec 14, 2022

Uh oh!

pitrou commented Dec 14, 2022

Uh oh!

zhixingheyi-tian commented Dec 15, 2022

Uh oh!

zhixingheyi-tian commented Dec 15, 2022

Uh oh!

cyb70289 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

zhixingheyi-tian commented Oct 9, 2022 •

edited

Loading

zhixingheyi-tian commented Dec 12, 2022 •

edited

Loading

zhixingheyi-tian commented Dec 13, 2022 •

edited

Loading

zhixingheyi-tian commented Dec 13, 2022 •

edited

Loading

zhixingheyi-tian Dec 19, 2022 •

edited

Loading