-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-6572: [C++] Fix Parquet decoding returning uninitialized data #5392
Conversation
This happens when reading RLE-encoded dict-encoded data with nulls. Make sure we zero-initialize the data array slots corresponding to null entries.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
@@ -738,7 +738,13 @@ Status TransferInt96(RecordReader* reader, MemoryPool* pool, | |||
RETURN_NOT_OK(::arrow::AllocateBuffer(pool, length * sizeof(int64_t), &data)); | |||
auto data_ptr = reinterpret_cast<int64_t*>(data->mutable_data()); | |||
for (int64_t i = 0; i < length; i++) { | |||
*data_ptr++ = Int96GetNanoSeconds(values[i]); | |||
if (values[i].value[2] == 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I understand why 0 is an invalid value (I guess this is specific semantics of int96 type)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not invalid per se, it just doesn't convert to a valid int64_t
timestamp.
A better fix might to be to detect all non-convertible values, and raise an error or use a placeholder, but that's overkill for a deprecated feature IMHO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah, OK.
LGTM |
I think this fixes #4607 as well. |
Most likely indeed. |
@@ -433,6 +436,8 @@ inline int RleDecoder::GetBatchWithDictSpaced(const T* dictionary, T* out, int b | |||
DCHECK_GE(bit_width_, 0); | |||
int values_read = 0; | |||
int remaining_nulls = null_count; | |||
T zero; | |||
memset(&zero, 0, sizeof(T)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This causes class-memaccess
warning with g++ 9.2.1:
[1/8] Building CXX object src/parquet/CMakeFiles/parquet_objlib.dir/encoding.cc.o
FAILED: src/parquet/CMakeFiles/parquet_objlib.dir/encoding.cc.o
/usr/bin/c++ -DARROW_EXTRA_ERROR_CONTEXT -DARROW_JEMALLOC -DARROW_JEMALLOC_INCLUDE_DIR="" -DARROW_USE_GLOG -DARROW_USE_SIMD -DARROW_WITH_ZSTD -DHAVE_INTTYPES_H -DHAVE_NETDB_H -DHAVE_NETINET_IN_H -DPARQUET_EXPORTING -Isrc -I../src -isystem jemalloc_ep-prefix/src -isystem flatbuffers_ep-prefix/src/flatbuffers_ep-install/include -isystem ../thirdparty/cares_ep-install/include -isystem ../thirdparty/hadoop/include -isystem orc_ep-install/include -Wno-noexcept-type -fdiagnostics-color=always -ggdb -O0 -Wall -Wno-conversion -Wno-sign-conversion -Wno-unused-variable -Werror -msse4.2 -g -fPIC -std=gnu++11 -MD -MT src/parquet/CMakeFiles/parquet_objlib.dir/encoding.cc.o -MF src/parquet/CMakeFiles/parquet_objlib.dir/encoding.cc.o.d -o src/parquet/CMakeFiles/parquet_objlib.dir/encoding.cc.o -c ../src/parquet/encoding.cc
In file included from ../src/parquet/encoding.cc:33:
../src/arrow/util/rle_encoding.h: In instantiation of 'int arrow::util::RleDecoder::GetBatchWithDictSpaced(const T*, T*, int, int, const uint8_t*, int64_t) [with T = parquet::FixedLenByteArray; uint8_t = unsigned char; int64_t = long int]':
../src/parquet/encoding.cc:1079:20: required from 'int parquet::DictDecoderImpl<Type>::DecodeSpaced(parquet::DictDecoderImpl<Type>::T*, int, int, const uint8_t*, int64_t) [with Type = parquet::PhysicalType<parquet::Type::FIXED_LEN_BYTE_ARRAY>; parquet::DictDecoderImpl<Type>::T = parquet::FixedLenByteArray; uint8_t = unsigned char; int64_t = long int]'
../src/parquet/encoding.cc:1076:7: required from here
../src/arrow/util/rle_encoding.h:440:9: error: 'void* memset(void*, int, size_t)' clearing an object of non-trivial type 'struct parquet::FixedLenByteArray'; use assignment or value-initialization instead [-Werror=class-memaccess]
440 | memset(&zero, 0, sizeof(T));
| ~~~~~~^~~~~~~~~~~~~~~~~~~~~
In file included from ../src/parquet/encoding.h:27,
from ../src/parquet/encoding.cc:18:
../src/parquet/types.h:515:8: note: 'struct parquet::FixedLenByteArray' declared here
515 | struct FixedLenByteArray {
| ^~~~~~~~~~~~~~~~~
In file included from ../src/parquet/encoding.cc:33:
../src/arrow/util/rle_encoding.h: In instantiation of 'int arrow::util::RleDecoder::GetBatchWithDictSpaced(const T*, T*, int, int, const uint8_t*, int64_t) [with T = parquet::ByteArray; uint8_t = unsigned char; int64_t = long int]':
../src/parquet/encoding.cc:1079:20: required from 'int parquet::DictDecoderImpl<Type>::DecodeSpaced(parquet::DictDecoderImpl<Type>::T*, int, int, const uint8_t*, int64_t) [with Type = parquet::PhysicalType<parquet::Type::BYTE_ARRAY>; parquet::DictDecoderImpl<Type>::T = parquet::ByteArray; uint8_t = unsigned char; int64_t = long int]'
../src/parquet/encoding.cc:1076:7: required from here
../src/arrow/util/rle_encoding.h:440:9: error: 'void* memset(void*, int, size_t)' clearing an object of non-trivial type 'struct parquet::ByteArray'; use assignment or value-initialization instead [-Werror=class-memaccess]
440 | memset(&zero, 0, sizeof(T));
| ~~~~~~^~~~~~~~~~~~~~~~~~~~~
In file included from ../src/parquet/encoding.h:27,
from ../src/parquet/encoding.cc:18:
../src/parquet/types.h:495:8: note: 'struct parquet::ByteArray' declared here
495 | struct ByteArray {
| ^~~~~~~~~
cc1plus: all warnings being treated as errors
Can we use T zero = {}
instead of memset()
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or is memset(static_cast<void*>(&zero), 0, sizeof(T))
better?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I didn't know that T zero = {}
caused zero initialization. We can use that then.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kou Do you want to submit a PR or should I do it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll do because I can confirm the fix locally.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Created: #5414
This happens when reading RLE-encoded dict-encoded data with nulls.
Make sure we zero-initialize the data array slots corresponding to null entries.