You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When storing Arrow timestamps in Parquet files using the Int96 storage format, certain combinations of array lengths and validity bitmasks cause an integer overflow error on read. It's not immediately clear whether the Arrow/Parquet writer is storing zeroes when it should be storing positive values or the reader is attempting to calculate a nanoseconds value inappropriately from zeroed inputs (perhaps missing the null bit flag). Also not immediately clear why only certain length columns seem to be affected.
Probably the quickest way to reproduce this undefined behavior is to alter the existing unit test UseDeprecatedInt96 (in file .../arrow/cpp/src/parquet/arrow/arrow-reader-writer-test.cc) by quadrupling its column lengths (repeating the same values), followed by 'make unittest' using clang-7 with sanitizers enabled. (Here's a patch applicable to current master that changes the test as described: [1]; I used the following cmake command to build my environment: [2].) You should get a log something like [3]. If requested, I'll see if I can put together a stand-alone minimal test case that induces the behavior.
The quick-hack at [4] will prevent integer overflows, but this is only included to confirm the proximate cause of the bug: the Julian days field of the Int96 appears to be zero, when a strictly positive number is expected.
I've assigned the issue to myself and I'll start looking into the root cause of this.
TP Boudreau / @tpboudreau:
I marked this issue as minor, because the defect does not appear to produce erroneous results. Tests that trigger UB succeed in all environments except those where the library is built with the UB sanitizer enabled.
When storing Arrow timestamps in Parquet files using the Int96 storage format, certain combinations of array lengths and validity bitmasks cause an integer overflow error on read. It's not immediately clear whether the Arrow/Parquet writer is storing zeroes when it should be storing positive values or the reader is attempting to calculate a nanoseconds value inappropriately from zeroed inputs (perhaps missing the null bit flag). Also not immediately clear why only certain length columns seem to be affected.
Probably the quickest way to reproduce this undefined behavior is to alter the existing unit test UseDeprecatedInt96 (in file .../arrow/cpp/src/parquet/arrow/arrow-reader-writer-test.cc) by quadrupling its column lengths (repeating the same values), followed by 'make unittest' using clang-7 with sanitizers enabled. (Here's a patch applicable to current master that changes the test as described: [1]; I used the following cmake command to build my environment: [2].) You should get a log something like [3]. If requested, I'll see if I can put together a stand-alone minimal test case that induces the behavior.
The quick-hack at [4] will prevent integer overflows, but this is only included to confirm the proximate cause of the bug: the Julian days field of the Int96 appears to be zero, when a strictly positive number is expected.
I've assigned the issue to myself and I'll start looking into the root cause of this.
[1] https://gist.github.com/tpboudreau/b6610c13cbfede4d6b171da681d1f94e
[2] https://gist.github.com/tpboudreau/59178ca8cb50a935aab7477805aa32b9
[3] https://gist.github.com/tpboudreau/0c2d0a18960c1aa04c838fa5c2ac7d2d
[4] https://gist.github.com/tpboudreau/0993beb5c8c1488028e76fb2ca179b7f
Reporter: TP Boudreau / @tpboudreau
Assignee: TP Boudreau / @tpboudreau
Related issues:
PRs and other links:
Note: This issue was originally created as ARROW-5618. Please see the migration documentation for further details.
The text was updated successfully, but these errors were encountered: