Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-6572: [C++] Fix Parquet decoding returning uninitialized data #5392

Closed

Conversation

pitrou
Copy link
Member

@pitrou pitrou commented Sep 16, 2019

This happens when reading RLE-encoded dict-encoded data with nulls.
Make sure we zero-initialize the data array slots corresponding to null entries.

This happens when reading RLE-encoded dict-encoded data with nulls.
Make sure we zero-initialize the data array slots corresponding to null entries.
@pitrou
Copy link
Member Author

pitrou commented Sep 16, 2019

@emkornfield @wesm

Copy link
Member

@wesm wesm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@@ -738,7 +738,13 @@ Status TransferInt96(RecordReader* reader, MemoryPool* pool,
RETURN_NOT_OK(::arrow::AllocateBuffer(pool, length * sizeof(int64_t), &data));
auto data_ptr = reinterpret_cast<int64_t*>(data->mutable_data());
for (int64_t i = 0; i < length; i++) {
*data_ptr++ = Int96GetNanoSeconds(values[i]);
if (values[i].value[2] == 0) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand why 0 is an invalid value (I guess this is specific semantics of int96 type)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not invalid per se, it just doesn't convert to a valid int64_t timestamp.
A better fix might to be to detect all non-convertible values, and raise an error or use a placeholder, but that's overkill for a deprecated feature IMHO.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, OK.

@emkornfield
Copy link
Contributor

LGTM

@emkornfield
Copy link
Contributor

I think this fixes #4607 as well.

@pitrou
Copy link
Member Author

pitrou commented Sep 16, 2019

I think this fixes #4607 as well.

Most likely indeed.

@pitrou
Copy link
Member Author

pitrou commented Sep 16, 2019

@pitrou pitrou closed this in 3bf4d80 Sep 16, 2019
@pitrou pitrou deleted the ARROW-6572-parquet-uninitialized-data branch September 16, 2019 18:00
@@ -433,6 +436,8 @@ inline int RleDecoder::GetBatchWithDictSpaced(const T* dictionary, T* out, int b
DCHECK_GE(bit_width_, 0);
int values_read = 0;
int remaining_nulls = null_count;
T zero;
memset(&zero, 0, sizeof(T));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This causes class-memaccess warning with g++ 9.2.1:

[1/8] Building CXX object src/parquet/CMakeFiles/parquet_objlib.dir/encoding.cc.o
FAILED: src/parquet/CMakeFiles/parquet_objlib.dir/encoding.cc.o 
/usr/bin/c++  -DARROW_EXTRA_ERROR_CONTEXT -DARROW_JEMALLOC -DARROW_JEMALLOC_INCLUDE_DIR="" -DARROW_USE_GLOG -DARROW_USE_SIMD -DARROW_WITH_ZSTD -DHAVE_INTTYPES_H -DHAVE_NETDB_H -DHAVE_NETINET_IN_H -DPARQUET_EXPORTING -Isrc -I../src -isystem jemalloc_ep-prefix/src -isystem flatbuffers_ep-prefix/src/flatbuffers_ep-install/include -isystem ../thirdparty/cares_ep-install/include -isystem ../thirdparty/hadoop/include -isystem orc_ep-install/include -Wno-noexcept-type  -fdiagnostics-color=always -ggdb -O0  -Wall -Wno-conversion -Wno-sign-conversion -Wno-unused-variable -Werror -msse4.2  -g -fPIC   -std=gnu++11 -MD -MT src/parquet/CMakeFiles/parquet_objlib.dir/encoding.cc.o -MF src/parquet/CMakeFiles/parquet_objlib.dir/encoding.cc.o.d -o src/parquet/CMakeFiles/parquet_objlib.dir/encoding.cc.o -c ../src/parquet/encoding.cc
In file included from ../src/parquet/encoding.cc:33:
../src/arrow/util/rle_encoding.h: In instantiation of 'int arrow::util::RleDecoder::GetBatchWithDictSpaced(const T*, T*, int, int, const uint8_t*, int64_t) [with T = parquet::FixedLenByteArray; uint8_t = unsigned char; int64_t = long int]':
../src/parquet/encoding.cc:1079:20:   required from 'int parquet::DictDecoderImpl<Type>::DecodeSpaced(parquet::DictDecoderImpl<Type>::T*, int, int, const uint8_t*, int64_t) [with Type = parquet::PhysicalType<parquet::Type::FIXED_LEN_BYTE_ARRAY>; parquet::DictDecoderImpl<Type>::T = parquet::FixedLenByteArray; uint8_t = unsigned char; int64_t = long int]'
../src/parquet/encoding.cc:1076:7:   required from here
../src/arrow/util/rle_encoding.h:440:9: error: 'void* memset(void*, int, size_t)' clearing an object of non-trivial type 'struct parquet::FixedLenByteArray'; use assignment or value-initialization instead [-Werror=class-memaccess]
  440 |   memset(&zero, 0, sizeof(T));
      |   ~~~~~~^~~~~~~~~~~~~~~~~~~~~
In file included from ../src/parquet/encoding.h:27,
                 from ../src/parquet/encoding.cc:18:
../src/parquet/types.h:515:8: note: 'struct parquet::FixedLenByteArray' declared here
  515 | struct FixedLenByteArray {
      |        ^~~~~~~~~~~~~~~~~
In file included from ../src/parquet/encoding.cc:33:
../src/arrow/util/rle_encoding.h: In instantiation of 'int arrow::util::RleDecoder::GetBatchWithDictSpaced(const T*, T*, int, int, const uint8_t*, int64_t) [with T = parquet::ByteArray; uint8_t = unsigned char; int64_t = long int]':
../src/parquet/encoding.cc:1079:20:   required from 'int parquet::DictDecoderImpl<Type>::DecodeSpaced(parquet::DictDecoderImpl<Type>::T*, int, int, const uint8_t*, int64_t) [with Type = parquet::PhysicalType<parquet::Type::BYTE_ARRAY>; parquet::DictDecoderImpl<Type>::T = parquet::ByteArray; uint8_t = unsigned char; int64_t = long int]'
../src/parquet/encoding.cc:1076:7:   required from here
../src/arrow/util/rle_encoding.h:440:9: error: 'void* memset(void*, int, size_t)' clearing an object of non-trivial type 'struct parquet::ByteArray'; use assignment or value-initialization instead [-Werror=class-memaccess]
  440 |   memset(&zero, 0, sizeof(T));
      |   ~~~~~~^~~~~~~~~~~~~~~~~~~~~
In file included from ../src/parquet/encoding.h:27,
                 from ../src/parquet/encoding.cc:18:
../src/parquet/types.h:495:8: note: 'struct parquet::ByteArray' declared here
  495 | struct ByteArray {
      |        ^~~~~~~~~
cc1plus: all warnings being treated as errors

Can we use T zero = {} instead of memset() here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or is memset(static_cast<void*>(&zero), 0, sizeof(T)) better?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I didn't know that T zero = {} caused zero initialization. We can use that then.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kou Do you want to submit a PR or should I do it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll do because I can confirm the fix locally.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Created: #5414

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants