Skip to content

Conversation

@zhixingheyi-tian
Copy link
Contributor

@zhixingheyi-tian zhixingheyi-tian commented Oct 9, 2022

Target

Improve parquet reading performance for String/Binary type based on Buffer operations instead of BinaryArrayBuilder.
Just like fixed-width types which take full advantage of using buffer,
here:

std::shared_ptr<Array> TransferZeroCopy(RecordReader* reader,
, to boost String/Binary reading performance!

Performance evaluation

CPU: Intel(R) Xeon(R) Platinum 8268 CPU @ 2.90GHz
OS :CentOS 7.6
Data: Single parquet file with dictionary-encoding, gzip compression, 100M Rows, 10 Cols
Run:With one thread

Elapse time performance
Arrow-upstream 27.3s
Arrow-optimizer 12.2s 2.23X

Performance evaluation using binary benchmark #15100

upstream/master:

BM_ReadBinaryColumn/null_probability:0/unique_values:32   3417224521 ns   3409209551 ns            1 bytes_per_second=53.5429M/s items_per_second=3.07572M/s
BM_ReadBinaryColumn/null_probability:0/unique_values:-1   1423706251 ns   1420369542 ns            1 bytes_per_second=151.331M/s items_per_second=7.38242M/s
BM_ReadBinaryColumn/null_probability:1/unique_values:32   3796615016 ns   3787710745 ns            1 bytes_per_second=47.8159M/s items_per_second=2.76836M/s
BM_ReadBinaryColumn/null_probability:50/unique_values:32  3129534432 ns   3122194897 ns            1 bytes_per_second=35.6467M/s items_per_second=3.35846M/s
BM_ReadBinaryColumn/null_probability:99/unique_values:32  1907305272 ns   1902829090 ns            1 bytes_per_second=21.7693M/s items_per_second=5.51062M/s
BM_ReadBinaryColumn/null_probability:1/unique_values:-1   1569981873 ns   1566302194 ns            1 bytes_per_second=136.12M/s items_per_second=6.6946M/s
BM_ReadBinaryColumn/null_probability:50/unique_values:-1  1576715826 ns   1573015486 ns            1 bytes_per_second=81.0747M/s items_per_second=6.66602M/s
BM_ReadBinaryColumn/null_probability:99/unique_values:-1  1383394533 ns   1380152318 ns            1 bytes_per_second=30.2516M/s items_per_second=7.59754M/s

zhixingheyi-tian/arrow_parquet_string_opt:

BM_ReadBinaryColumn/null_probability:0/unique_values:32    720659967 ns    718964998 ns            1 bytes_per_second=253.891M/s items_per_second=14.5845M/s
BM_ReadBinaryColumn/null_probability:0/unique_values:-1    901647061 ns    899527968 ns            1 bytes_per_second=238.954M/s items_per_second=11.657M/s
BM_ReadBinaryColumn/null_probability:1/unique_values:32    852270789 ns    850262221 ns            1 bytes_per_second=213.008M/s items_per_second=12.3324M/s
BM_ReadBinaryColumn/null_probability:50/unique_values:32   898193520 ns    896083642 ns            1 bytes_per_second=124.203M/s items_per_second=11.7018M/s
BM_ReadBinaryColumn/null_probability:99/unique_values:32   449696378 ns    448639932 ns            2 bytes_per_second=92.3308M/s items_per_second=23.3723M/s
BM_ReadBinaryColumn/null_probability:1/unique_values:-1    980079842 ns    977777588 ns            1 bytes_per_second=218.05M/s items_per_second=10.7241M/s
BM_ReadBinaryColumn/null_probability:50/unique_values:-1  1114096867 ns   1111474752 ns            1 bytes_per_second=114.741M/s items_per_second=9.4341M/s
BM_ReadBinaryColumn/null_probability:99/unique_values:-1   574420621 ns    573069568 ns            1 bytes_per_second=72.8565M/s items_per_second=18.2975M/s

@cyb70289 @wjones127 @pitrou

The performance improvement by this optimization is very obvious.

@github-actions
Copy link

github-actions bot commented Oct 9, 2022

@github-actions
Copy link

github-actions bot commented Oct 9, 2022

⚠️ Ticket has not been started in JIRA, please click 'Start Progress'.

@wjones127
Copy link
Member

Hi @zhixingheyi-tian, sorry this hasn't gotten reviewer attention for a while. Are you still interested in working on this?

Would you mind rebasing and cleaning up any commented code?

In addition, I noticed there aren't any benchmarks for binary or string types in either of these two benchmark files:

  • src/parquet/arrow/reader_writer_benchmark.cc
  • src/parquet/parquet_encoding_benchmark.cc

We'll want some benchmarks in order to evaluate the changes. If you add those, then you can compare benchmarks locally like so:

BUILD_DIR=cpp/path/to/build

archery benchmark run $BUILD_DIR \
  --suite-filter=parquet-arrow-reader-writer-benchmark \
  --output=contender.json

git checkout master # Or some ref that has benchmarks but not your changes

archery benchmark run $BUILD_DIR \
  --suite-filter=parquet-arrow-reader-writer-benchmark \
  --output=baseline.json

archery benchmark diff contender.json baseline.json 

@zhixingheyi-tian
Copy link
Contributor Author

@wjones127 ,
Will focus on this patch recently.
Welcome review!

Thanks!

@zhixingheyi-tian
Copy link
Contributor Author

Hi @pitrou ,

Recently,have fixed remaining failed 7 UTs, and this patch is ready for reviw.

Thanks!

@zhixingheyi-tian
Copy link
Contributor Author

HI @jorisvandenbossche @wjones127 @pitrou @iajoiner

This PR is ready to review, please have a look.
Thanks,

@zhixingheyi-tian
Copy link
Contributor Author

New failed UT "RecordReaderByteArrayTest.SkipByteArray" , came from newly commit #14142. This UT used the GetBuilderChunks() interface : https://github.com/apache/arrow/blame/6cfe24633b9fe3c474137571940eca35d7a475dc/cpp/src/parquet/column_reader_test.cc#L1181-L1185

And this performance PR is avoiding this interface. So it failed.

Any suggestions to fix this UT? Or change this UT?

Thanks!

@zhixingheyi-tian
Copy link
Contributor Author

zhixingheyi-tian commented Dec 12, 2022

Hi @zhixingheyi-tian, sorry this hasn't gotten reviewer attention for a while. Are you still interested in working on this?

Would you mind rebasing and cleaning up any commented code?

In addition, I noticed there aren't any benchmarks for binary or string types in either of these two benchmark files:

  • src/parquet/arrow/reader_writer_benchmark.cc
  • src/parquet/parquet_encoding_benchmark.cc

We'll want some benchmarks in order to evaluate the changes. If you add those, then you can compare benchmarks locally like so:

BUILD_DIR=cpp/path/to/build

archery benchmark run $BUILD_DIR \
  --suite-filter=parquet-arrow-reader-writer-benchmark \
  --output=contender.json

git checkout master # Or some ref that has benchmarks but not your changes

archery benchmark run $BUILD_DIR \
  --suite-filter=parquet-arrow-reader-writer-benchmark \
  --output=baseline.json

archery benchmark diff contender.json baseline.json 

Hi @wjones127
Good idea to show performance.

But how to add Binary/String benchmark in *benchmark.cc?
I noticed currently Arrow benchmark testings had not support generating random binary data.
Would you please give some help or advices about adding String/binary in micro benchmarks, thanks!

In my local server end-2-end performance testing:

CPU: Intel(R) Xeon(R) Platinum 8268 CPU @ 2.90GHz
OS :CentOS 7.6
Data: Single parquet file with dictionary-encoding, gzip compression, 100M Rows, 10 Cols
Run:With one thread

Elapse time performance
Arrow-upstream 27.3s
Arrow-optimizer 12.2s 2.23X

@pitrou
Copy link
Member

pitrou commented Dec 12, 2022

And this performance PR is avoiding this interface. So it failed.

Any suggestions to fix this UT?

I'm not sure I understand your question. It's your job to make sure that your changes don't break the existing test suite...

@zhixingheyi-tian
Copy link
Contributor Author

zhixingheyi-tian commented Dec 13, 2022

And this performance PR is avoiding this interface. So it failed.
Any suggestions to fix this UT?

I'm not sure I understand your question. It's your job to make sure that your changes don't break the existing test suite...

Hi @pitrou ,
Have fixed this UT-- RecordReaderByteArrayTest.SkipByteArray, today.
Thanks!

@zhixingheyi-tian
Copy link
Contributor Author

zhixingheyi-tian commented Dec 13, 2022

HI @cyb70289 @jorisvandenbossche
Would you please give a review,
Thanks!

@cyb70289
Copy link
Contributor

Would you fix the CI failures?

@zhixingheyi-tian
Copy link
Contributor Author

Would you fix the CI failures?
Thanks @cyb70289

I am in fixing errors in github CI.
For "TestArrowReadDeltaEncoding.DeltaByteArray", why is skipped in my local testing? But showed failure on CI.

# ./parquet-arrow-test --gtest_filter=TestArrowReadDeltaEncoding.DeltaByteArray
Running main() from /home/shen/software/googletest/googletest/src/gtest_main.cc
Note: Google Test filter = TestArrowReadDeltaEncoding.DeltaByteArray
[==========] Running 1 test from 1 test suite.
[----------] Global test environment set-up.
[----------] 1 test from TestArrowReadDeltaEncoding
[ RUN      ] TestArrowReadDeltaEncoding.DeltaByteArray
[  SKIPPED ] TestArrowReadDeltaEncoding.DeltaByteArray (0 ms)
[----------] 1 test from TestArrowReadDeltaEncoding (0 ms total)

[----------] Global test environment tear-down
[==========] 1 test from 1 test suite ran. (0 ms total)
[  PASSED  ] 0 tests.
[  SKIPPED ] 1 test, listed below:
[  SKIPPED ] TestArrowReadDeltaEncoding.DeltaByteArray

@cyb70289
Copy link
Contributor

I am in fixing errors in github CI. For "TestArrowReadDeltaEncoding.DeltaByteArray", why is skipped in my local testing? But showed failure on CI.

Don't know the details. But it should be easy to find in the source code.

@pitrou
Copy link
Member

pitrou commented Dec 14, 2022

Instead of defining entire separate classes for this, why not change EncodingTraits::Accumulator to the following:

template <>
struct EncodingTraits<ByteArrayType> {
  // ...
  struct Accumulator {
    std::unique_ptr<::arrow::Int32Builder> offsets_builder;
    std::unique_ptr<::arrow::BufferBuilder> data_builder;
    std::vector<std::shared_ptr<::arrow::Array>> chunks;
  };

@zhixingheyi-tian
Copy link
Contributor Author

I am in fixing errors in github CI. For "TestArrowReadDeltaEncoding.DeltaByteArray", why is skipped in my local testing? But showed failure on CI.

Don't know the details. But it should be easy to find in the source code.

New commits should fix all UTs. Please give a review!
Thanks!

@zhixingheyi-tian
Copy link
Contributor Author

Instead of defining entire separate classes for this, why not change EncodingTraits::Accumulator to the following:

template <>
struct EncodingTraits<ByteArrayType> {
  // ...
  struct Accumulator {
    std::unique_ptr<::arrow::Int32Builder> offsets_builder;
    std::unique_ptr<::arrow::BufferBuilder> data_builder;
    std::vector<std::shared_ptr<::arrow::Array>> chunks;
  };

If use *Builder, may add extra data copy when accumulating element.
Can refer to fixed width usage :

inline int DecodePlain(const uint8_t* data, int64_t data_size, int num_values,
int type_length, T* out) {
int64_t bytes_to_decode = num_values * static_cast<int64_t>(sizeof(T));
if (bytes_to_decode > data_size || bytes_to_decode > INT_MAX) {
ParquetException::EofException();
}
// If bytes_to_decode == 0, data could be null
if (bytes_to_decode > 0) {
memcpy(out, data, bytes_to_decode);

Thanks!

Copy link
Contributor

@cyb70289 cyb70289 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @zhixingheyi-tian. Look it's a useful improvement.
Please see some comments from me.

Besides, did you compare benchmarks (use microbenchmarks in arrow source) of various cases (no nulls, with nulls, different data types, etc.) against current code? If some microbenchmarks are missing, will you add them? It is necessary to evaluate this PR.

/// \brief Pre-allocate space for data. Results in better flat read performance
virtual void Reserve(int64_t num_values) = 0;

virtual void ReserveValues(int64_t capacity) {}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A new interface? Are these added data members and functions necessary for this base class? I suppose they are only for the new reader implementation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previously. it's TypedRecordReader internal interface,

void ReserveValues(int64_t extra_values) {
const int64_t new_values_capacity =
UpdateCapacity(values_capacity_, values_written_, extra_values);
if (new_values_capacity > values_capacity_) {
// XXX(wesm): A hack to avoid memory allocation when reading directly
// into builder classes
if (uses_values_) {
PARQUET_THROW_NOT_OK(values_->Resize(bytes_for_values(new_values_capacity),
/*shrink_to_fit=*/false));
}
values_capacity_ = new_values_capacity;
}
if (leaf_info_.HasNullableValues()) {
int64_t valid_bytes_new = bit_util::BytesForBits(values_capacity_);
if (valid_bits_->size() < valid_bytes_new) {
int64_t valid_bytes_old = bit_util::BytesForBits(values_written_);
PARQUET_THROW_NOT_OK(
valid_bits_->Resize(valid_bytes_new, /*shrink_to_fit=*/false));
// Avoid valgrind warnings
memset(valid_bits_->mutable_data() + valid_bytes_old, 0,
valid_bytes_new - valid_bytes_old);
}
.

And ByteArrayChunkedOptRecordReader extends from TypedRecordReader, so extract it as public interface.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't it just a helper function specific to implementation?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since these are coming from TypeRecordReader, which is private, could you mark it's methods as virtual instead?

Suggested change
virtual void ReserveValues(int64_t capacity) {}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it better to make it pure virtual? In addition, it helps to add a comment for public function.

current_encoding_ = encoding;
current_decoder_->SetData(static_cast<int>(num_buffered_values_), buffer,
static_cast<int>(data_size));
if (!hasSet_uses_opt_) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stick to snake_case for variables.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't mix camel and snake case.
has_set_uses_opt

const auto last_offset = offsetArr[values_written_];
int64_t binary_length = last_offset - first_offset;
binary_per_row_length_ = binary_length / values_written_ + 1;
// std::cout << "binary_per_row_length_:" << binary_per_row_length_ << std::endl;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove debug code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks there are other debug code not removed. Do remove all of them.
// RETURN_NOT_OK(IndexInBounds(idx));

Comment on lines +867 to +868
bool hasSet_uses_opt_ = false;
bool uses_opt_ = false;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these two flags really necessary?
Looks to me a trivial helper function is better.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These two flags is for
if (current_encoding_ == Encoding::PLAIN_DICTIONARY || current_encoding_ == Encoding::PLAIN || current_encoding_ == Encoding::RLE_DICTIONARY)

Just avoid comparing every time.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Try to make it simple. I don't think there's any performance consideration here.
E.g., define a helper function
bool UsesOpt() const { return (current_encoding == xxx || ....; }

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you create a separate function for that, as Yibo suggested? If you do measure a meaningful performance difference, could you share your results then?

In addition, could you add a comment explaining why the optimization is only applicable to those those three encodings?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, also please find a more descriptive name than "uses optimization". (which optimization?)

Comment on lines +854 to +856
if (current_encoding_ == Encoding::PLAIN_DICTIONARY ||
current_encoding_ == Encoding::PLAIN ||
current_encoding_ == Encoding::RLE_DICTIONARY) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are all these cases covered by UT?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes,
This patch is from one customer's case. They want to boost String scan performance.
So this patch is just for parquet general encodings: PLAIN , RLE_DICTIONARY, PLAIN_DICTIONARY.

Other encodings will skip this optimization.

Existed UTs will cover all encodings cases.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious, why other encodings are not supported?

Comment on lines +1994 to +2002
} else {
::arrow::ArrayVector result = accumulator_.chunks;
if (result.size() == 0 || accumulator_.builder->length() > 0) {
std::shared_ptr<::arrow::Array> last_chunk;
PARQUET_THROW_NOT_OK(accumulator_.builder->Finish(&last_chunk));
result.push_back(std::move(last_chunk));
}
accumulator_.chunks = {};
return result;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Duplicates ByteArrayChunkedRecordReader? This doesn't look right.

Copy link
Contributor Author

@zhixingheyi-tian zhixingheyi-tian Dec 19, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The optimized RecordReader implementation is ByteArrayChunkedOptRecordReader

https://github.com/zhixingheyi-tian/arrow/blob/8d611c673d3ed23abfa32f50da3cc4c4e4feb807/cpp/src/parquet/column_reader.cc#L2203-L2206

And is just for Binary/String/LargeBinary/LargeString types.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This probably means it's not good to create a new class for this optimized reader.

That said, I don't have deep knowlege of parquet code and lack bandwidth recently to investigate, someone else might comment.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, would shouldn't create another class. I don't see any reason this can't be used for the Decimal case.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've found locally that if I merge the implementations, the unit tests pass. Could you please merge them in the PR?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wjones127
Do you merge the two class: ByteArrayChunkedRecordReader and ByteArrayChunkedOptRecordReader, , all unit tests passed?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, those are the two classes I merged.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is the changeset that allows the parquet-arrow-test and parquet-arrow-internals-test to pass. zhixingheyi-tian#2

Could you incorporate those changes?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for not adding a separate class. This would be difficult to maintain if more optimization will be added. It would be better if an option can be added so that user can manually turn it off when something goes wrong with the new feature.

int32_t* offset,
std::shared_ptr<::arrow::ResizableBuffer>& values,
int64_t valid_bits_offset, int32_t* bianry_length) {
return 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I saw many dummy implementations like this in the PR. Probably they can be eliminated.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just follow previous DecodeArrow() function call stack usages.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My gut feeling is still that they are not necessary. Though it might be wrong.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like we are adding this because the other methods are based on builders (in the accumulator), and builder don't provide a way to transfer multiple values in one memcpy. Does that sound right?

I wonder if we could add such a method on builders, and that might be a cleaner solution. Something like:

class BaseBinaryBuilder {
...
  Status UnsafeAppendValues(const uint8_t* values, int64_t length, const uint8_t* valid_bytes = NULL_PTR) { ... }
};

The reason getting back to builders might be good is that I don't think these changes handle the case where there are more values than can fit into a StringArray. The current implementation will split it up into chunks if it reaches capacity, and I think we need to keep that behavior.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, I'll say it again, but another possibility is to change the definition of EncodingTraits<ByteArrayType>::Accumulator from:

template <>
struct EncodingTraits<ByteArrayType> {
  using Encoder = ByteArrayEncoder;
  using Decoder = ByteArrayDecoder;

  /// \brief Internal helper class for decoding BYTE_ARRAY data where we can
  /// overflow the capacity of a single arrow::BinaryArray
  struct Accumulator {
    std::unique_ptr<::arrow::BinaryBuilder> builder;
    std::vector<std::shared_ptr<::arrow::Array>> chunks;
  };
  using ArrowType = ::arrow::BinaryType;
  using DictAccumulator = ::arrow::Dictionary32Builder<::arrow::BinaryType>;
};

to:

template <>
struct EncodingTraits<ByteArrayType> {
  using Encoder = ByteArrayEncoder;
  using Decoder = ByteArrayDecoder;

  /// \brief Internal helper class for decoding BYTE_ARRAY data where we can
  /// overflow the capacity of a single arrow::Int32Array
  struct Accumulator {
    std::unique_ptr<::arrow::Int32Builder> offsets_builder;
    std::unique_ptr<::arrow::BufferBuilder> data_builder;
    std::vector<std::shared_ptr<::arrow::Array>> chunks;
  };
  using ArrowType = ::arrow::BinaryType;
  using DictAccumulator = ::arrow::Dictionary32Builder<::arrow::BinaryType>;
};

Either this or @wjones127 's suggestion would be better than adding a specialized method, IMHO.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Antoine. I guess I'm just catching to up your understanding 😄

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I was at previous employer, we have implemented mutable arrow::Array to address similar issue of ORC reader. The idea is that we can know in advance the total length of all string/binary values in a single batch. Therefore we can pre-allocate the data buffer at once or even reuse previous buffer if it has enough capacity. The overhead of buffer allocation and resize operation are non-negligible.

@pitrou We have discussed the idea in https://issues.apache.org/jira/browse/ARROW-15289

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wgtmac I'm not sure if you are objecting to my proposal above. A separate BufferBuilder for the string data allows to presize for the computed total length, so it should address your concern.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wgtmac I'm not sure if you are objecting to my proposal above. A separate BufferBuilder for the string data allows to presize for the computed total length, so it should address your concern.

Sorry I didn't make it clear. I'm not objecting your proposal. Instead I just summarized another optimization we have done before if reusing arrow::RecordBatch on the same reader is possible. Internal experiment reveals that repeated buffer allocation and resize operation are non-negligible overhead, especially for wide columns (e.g. 1000+ columns). @pitrou

int32_t indices[kBufferSize];
auto dst_value = values->mutable_data() + (*bianry_length);

::arrow::internal::BitmapReader bit_reader(valid_bits, valid_bits_offset, num_values);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have benchmark result of this bitmap reader based implementation?
Is it better than original code (bitblock counter)?

@zhixingheyi-tian
Copy link
Contributor Author

Good idea to show performance.

But how to add Binary/String benchmark in *benchmark.cc? I noticed currently Arrow benchmark testings had not support generating random binary data. Would you please give some help or advices about adding String/binary in micro benchmarks, thanks!

Hi @cyb70289 ,
Thanks your kind review very much!

Good idea to show performance.

But how to add Binary/String benchmark in *benchmark.cc?
I noticed currently Arrow benchmark testings had not support generating random binary data.
Would you please give some help or advices about adding String/binary in micro benchmarks, thanks!

@cyb70289
Copy link
Contributor

But how to add Binary/String benchmark in *benchmark.cc? I noticed currently Arrow benchmark testings had not support generating random binary data. Would you please give some help or advices about adding String/binary in micro benchmarks, thanks!

There are many examples you can reference. cpp/src/arrow/testing/random.h defines functions to generate different kinds of random test set.

Please note it can be a separate PR to add new benchmarks, which can be reviewed and merged before this one.

@zhixingheyi-tian
Copy link
Contributor Author

@cyb70289 @wjones127
During the last two weeks, was suffering high fever. Sorry!
Now, Will continue to support this patch!

@wjones127
Copy link
Member

Sorry to hear you were sick @zhixingheyi-tian. I'll soon have a benchmark merged (#15100), and that will give us relevant results in Conbench that will help the evaluate these changes.

@zhixingheyi-tian
Copy link
Contributor Author

Sorry to hear you were sick @zhixingheyi-tian. I'll soon have a benchmark merged (#15100), and that will give us relevant results in Conbench that will help the evaluate these changes.

Thanks!

@wjones127
Copy link
Member

@ursabot please benchmark command=cpp-micro --suite-filter=parquet-arrow-reader-writer-benchmark

@ursabot
Copy link

ursabot commented Jan 10, 2023

Benchmark runs are scheduled for baseline = f82501e and contender = 19e2ba3. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Skipped ⚠️ Only ['lang', 'name'] filters are supported on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2
[Skipped ⚠️ Only ['lang', 'name'] filters are supported on test-mac-arm] test-mac-arm
[Skipped ⚠️ Only ['lang', 'name'] filters are supported on ursa-i9-9960x] ursa-i9-9960x
[Finished ⬇️0.1% ⬆️0.0%] ursa-thinkcentre-m75q
Buildkite builds:
[Finished] 19e2ba3b ursa-thinkcentre-m75q
[Finished] f82501e7 ec2-t3-xlarge-us-east-2
[Failed] f82501e7 test-mac-arm
[Finished] f82501e7 ursa-i9-9960x
[Finished] f82501e7 ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

@wjones127
Copy link
Member

The failures in the continuous-integration/appveyor/pr seem relevant. This test is failing:

@parametrize_legacy_dataset
@pytest.mark.parametrize('array_factory', [
lambda: pa.array([0, None] * 10),
lambda: pa.array([0, None] * 10).dictionary_encode(),
lambda: pa.array(["", None] * 10),
lambda: pa.array(["", None] * 10).dictionary_encode(),
])
@pytest.mark.parametrize('use_dictionary', [False, True])
@pytest.mark.parametrize('read_dictionary', [False, True])
def test_buffer_contents(
array_factory, use_dictionary, read_dictionary, use_legacy_dataset
):
# Test that null values are deterministically initialized to zero
# after a roundtrip through Parquet.
# See ARROW-8006 and ARROW-8011.
orig_table = pa.Table.from_pydict({"col": array_factory()})
bio = io.BytesIO()
pq.write_table(orig_table, bio, use_dictionary=True)
bio.seek(0)
read_dictionary = ['col'] if read_dictionary else None
table = pq.read_table(bio, use_threads=False,
read_dictionary=read_dictionary,
use_legacy_dataset=use_legacy_dataset)
for col in table.columns:
[chunk] = col.chunks
buf = chunk.buffers()[1]
assert buf.to_pybytes() == buf.size * b"\0"

Perhaps we need to make sure that if there are null values that we don't copy those parts of the buffer?

// Helper data structure for accumulating builder chunks
typename EncodingTraits<ByteArrayType>::Accumulator accumulator_;

int32_t bianry_length_ = 0;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a prexisting issue, but could you fix this variable name?

Suggested change
int32_t bianry_length_ = 0;
int32_t binary_length_ = 0;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

// 16 KB is the default expected page header size
static constexpr uint32_t kDefaultPageHeaderSize = 16 * 1024;

static constexpr int32_t kDefaultBinaryPerRowSzie = 20;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
static constexpr int32_t kDefaultBinaryPerRowSzie = 20;
static constexpr int32_t kDefaultBinaryPerRowSize = 20;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

void ResetValues() {
if (values_written_ > 0) {
// Resize to 0, but do not shrink to fit
PARQUET_THROW_NOT_OK(valid_bits_->Resize(0, false));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ran the unit tests (parquet-arrow-test) with a debugger, and found this branch was never hit. Does that seem right? Could you add a test that validates this branch?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just follow here:

void ResetValues() {
if (values_written_ > 0) {
// Resize to 0, but do not shrink to fit
if (uses_values_) {
PARQUET_THROW_NOT_OK(values_->Resize(0, /*shrink_to_fit=*/false));
}

@wjones127
Copy link
Member

@zhixingheyi-tian Once you've fixed the issues mentioned, could you also rebase on the latest master branch? Once you do that we can run the benchmark.

@zhixingheyi-tian
Copy link
Contributor Author

@zhixingheyi-tian Once you've fixed the issues mentioned, could you also rebase on the latest master branch? Once you do that we can run the benchmark.

Have rebased. Performance can refer to #14353 (comment)

@zhixingheyi-tian
Copy link
Contributor Author

The failures in the continuous-integration/appveyor/pr seem relevant. This test is failing:

@parametrize_legacy_dataset
@pytest.mark.parametrize('array_factory', [
lambda: pa.array([0, None] * 10),
lambda: pa.array([0, None] * 10).dictionary_encode(),
lambda: pa.array(["", None] * 10),
lambda: pa.array(["", None] * 10).dictionary_encode(),
])
@pytest.mark.parametrize('use_dictionary', [False, True])
@pytest.mark.parametrize('read_dictionary', [False, True])
def test_buffer_contents(
array_factory, use_dictionary, read_dictionary, use_legacy_dataset
):
# Test that null values are deterministically initialized to zero
# after a roundtrip through Parquet.
# See ARROW-8006 and ARROW-8011.
orig_table = pa.Table.from_pydict({"col": array_factory()})
bio = io.BytesIO()
pq.write_table(orig_table, bio, use_dictionary=True)
bio.seek(0)
read_dictionary = ['col'] if read_dictionary else None
table = pq.read_table(bio, use_threads=False,
read_dictionary=read_dictionary,
use_legacy_dataset=use_legacy_dataset)
for col in table.columns:
[chunk] = col.chunks
buf = chunk.buffers()[1]
assert buf.to_pybytes() == buf.size * b"\0"

Perhaps we need to make sure that if there are null values that we don't copy those parts of the buffer?

I haven't touched pyarrow before. pyarrow underlayer is arrow-cpp, right?
How to reproduce the issue easily in cpp?
Thanks!

@wjones127 wjones127 self-requested a review January 20, 2023 16:55
@wjones127
Copy link
Member

How to reproduce the issue easily in cpp?

I looked at it a bit more, and I'm actually not sure. It seems to be verifying the offsets buffer is all zeros when the array is composed only of empty string and nulls, which I can't get to fail in C++. I can help look at that further once other feedback is addressed in this PR.

Copy link
Member

@wjones127 wjones127 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The performance is looking promising. I've looked through part of it, and have some suggestions for simplifying the code. Will look through the rest soon.

Comment on lines +379 to +380
bool hasCal_average_len_ = false;
int64_t binary_per_row_length_ = kDefaultBinaryPerRowSize;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First, I think this would be clearer as a std::optional, rather than a boolean on the side.
Second, please document the purpose of these fields in the header file.

Suggested change
bool hasCal_average_len_ = false;
int64_t binary_per_row_length_ = kDefaultBinaryPerRowSize;
/// \brief Typical size of single binary value, used for pre-allocating value buffer.
///
/// Before this is set, kDefaultBinaryPerRowSize is used. After the first
/// batch of values, this is set to the size of the values buffer divided by
/// the number of values.
std::optional<int64_t> binary_per_row_length_ = std::nullopt;

}
std::shared_ptr<ResizableBuffer> ReleaseOffsets() override {
auto result = offset_;
if (ARROW_PREDICT_FALSE(!hasCal_average_len_)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we make it an optional:

Suggested change
if (ARROW_PREDICT_FALSE(!hasCal_average_len_)) {
if (ARROW_PREDICT_FALSE(!binary_per_row_length_.has_value())) {

Comment on lines +2044 to +2045
PARQUET_THROW_NOT_OK(
values_->Resize(new_values_capacity * binary_per_row_length_, false));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we make this an option:

Suggested change
PARQUET_THROW_NOT_OK(
values_->Resize(new_values_capacity * binary_per_row_length_, false));
int64_t per_row_length = binary_per_row_length_.value_or(kDefaultBinaryPerRowSize);
PARQUET_THROW_NOT_OK(
values_->Resize(new_values_capacity * per_row_length, false));

// 16 KB is the default expected page header size
static constexpr uint32_t kDefaultPageHeaderSize = 16 * 1024;

static constexpr int32_t kDefaultBinaryPerRowSize = 20;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this corresponds to binary_per_row_length_, could we make the names match? I'm thinking "bytes per row" is the best description here:

Suggested change
static constexpr int32_t kDefaultBinaryPerRowSize = 20;
static constexpr int32_t kDefaultBinaryBytesPerRow = 20;

(Also change binary_per_row_length_ to binary_bytes_per_row_)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be better to add a comment here.

/// \brief Pre-allocate space for data. Results in better flat read performance
virtual void Reserve(int64_t num_values) = 0;

virtual void ReserveValues(int64_t capacity) {}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since these are coming from TypeRecordReader, which is private, could you mark it's methods as virtual instead?

Suggested change
virtual void ReserveValues(int64_t capacity) {}

}
}

std::shared_ptr<ResizableBuffer> ReleaseOffsets() override { return nullptr; }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
std::shared_ptr<ResizableBuffer> ReleaseOffsets() override { return nullptr; }
virtual std::shared_ptr<ResizableBuffer> ReleaseOffsets() { return nullptr; }

}

void ReserveValues(int64_t extra_values) {
void ReserveValues(int64_t extra_values) override {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
void ReserveValues(int64_t extra_values) override {
virtual void ReserveValues(int64_t extra_values) {

Comment on lines +306 to +307
virtual std::shared_ptr<ResizableBuffer> ReleaseOffsets() = 0;

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
virtual std::shared_ptr<ResizableBuffer> ReleaseOffsets() = 0;

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here for a comment.

Comment on lines +867 to +868
bool hasSet_uses_opt_ = false;
bool uses_opt_ = false;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you create a separate function for that, as Yibo suggested? If you do measure a meaningful performance difference, could you share your results then?

In addition, could you add a comment explaining why the optimization is only applicable to those those three encodings?

// 16 KB is the default expected page header size
static constexpr uint32_t kDefaultPageHeaderSize = 16 * 1024;

static constexpr int32_t kDefaultBinaryPerRowSize = 20;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be better to add a comment here.

/// \brief Pre-allocate space for data. Results in better flat read performance
virtual void Reserve(int64_t num_values) = 0;

virtual void ReserveValues(int64_t capacity) {}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it better to make it pure virtual? In addition, it helps to add a comment for public function.

Comment on lines +306 to +307
virtual std::shared_ptr<ResizableBuffer> ReleaseOffsets() = 0;

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here for a comment.

int32_t* offset,
std::shared_ptr<::arrow::ResizableBuffer>& values,
int64_t valid_bits_offset, int32_t* bianry_length) {
return 0;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I was at previous employer, we have implemented mutable arrow::Array to address similar issue of ORC reader. The idea is that we can know in advance the total length of all string/binary values in a single batch. Therefore we can pre-allocate the data buffer at once or even reuse previous buffer if it has enough capacity. The overhead of buffer allocation and resize operation are non-negligible.

@pitrou We have discussed the idea in https://issues.apache.org/jira/browse/ARROW-15289

int32_t* offset,
std::shared_ptr<::arrow::ResizableBuffer>& values,
int64_t valid_bits_offset, int* out_values_decoded,
int32_t* bianry_length) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
int32_t* bianry_length) {
int32_t* binary_length) {

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are similar typos below.

return Status::OK();
}

Status DecodeArrowDenseNonNull_opt(int num_values, int32_t* offset,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we avoid this kind of name and give a meaningful one instead?

while (values_decoded < num_values) {
int32_t batch_size = std::min<int32_t>(kBufferSize, num_values - values_decoded);
int num_indices = idx_decoder_.GetBatch(indices, batch_size);
if (num_indices == 0) ParquetException::EofException();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be helpful to provide some concrete error message.

Comment on lines +854 to +856
if (current_encoding_ == Encoding::PLAIN_DICTIONARY ||
current_encoding_ == Encoding::PLAIN ||
current_encoding_ == Encoding::RLE_DICTIONARY) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious, why other encodings are not supported?

std::vector<std::shared_ptr<Buffer>> buffers = {ReleaseIsValid(), ReleaseOffsets(),
ReleaseValues()};
auto data = std::make_shared<::arrow::ArrayData>(
::arrow::binary(), values_written(), buffers, null_count());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is ::arrow::binary() correct? Will it cause type mismatch if actual type is ::arrow::utf8()? For example, when checking equality with a StringArray, the type identity may be broken if one side is binary and the other is utf8.

Comment on lines +1994 to +2002
} else {
::arrow::ArrayVector result = accumulator_.chunks;
if (result.size() == 0 || accumulator_.builder->length() > 0) {
std::shared_ptr<::arrow::Array> last_chunk;
PARQUET_THROW_NOT_OK(accumulator_.builder->Finish(&last_chunk));
result.push_back(std::move(last_chunk));
}
accumulator_.chunks = {};
return result;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for not adding a separate class. This would be difficult to maintain if more optimization will be added. It would be better if an option can be added so that user can manually turn it off when something goes wrong with the new feature.

@amol-
Copy link
Member

amol- commented Mar 30, 2023

Closing because it has been untouched for a while, in case it's still relevant feel free to reopen and move it forward 👍

@amol- amol- closed this Mar 30, 2023
@wgtmac
Copy link
Member

wgtmac commented Mar 31, 2023

@zhixingheyi-tian Do you plan to update this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants