Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[C++] [Parquet] Using deprecated Int96 storage for timestamps triggers integer overflow in some cases #22057

Closed
asfimport opened this issue Jun 15, 2019 · 4 comments

Comments

@asfimport
Copy link
Collaborator

asfimport commented Jun 15, 2019

When storing Arrow timestamps in Parquet files using the Int96 storage format, certain combinations of array lengths and validity bitmasks cause an integer overflow error on read.  It's not immediately clear whether the Arrow/Parquet writer is storing zeroes when it should be storing positive values or the reader is attempting to calculate a nanoseconds value inappropriately from zeroed inputs (perhaps missing the null bit flag).  Also not immediately clear why only certain length columns seem to be affected.

Probably the quickest way to reproduce this undefined behavior is to alter the existing unit test UseDeprecatedInt96 (in file .../arrow/cpp/src/parquet/arrow/arrow-reader-writer-test.cc) by quadrupling its column lengths (repeating the same values), followed by 'make unittest' using clang-7 with sanitizers enabled.  (Here's a patch applicable to current master that changes the test as described: [1]; I used the following cmake command to build my environment: [2].)  You should get a log something like [3].  If requested, I'll see if I can put together a stand-alone minimal test case that induces the behavior.

The quick-hack at [4] will prevent integer overflows, but this is only included to confirm the proximate cause of the bug: the Julian days field of the Int96 appears to be zero, when a strictly positive number is expected.

I've assigned the issue to myself and I'll start looking into the root cause of this.

[1] https://gist.github.com/tpboudreau/b6610c13cbfede4d6b171da681d1f94e
[2] https://gist.github.com/tpboudreau/59178ca8cb50a935aab7477805aa32b9
[3] https://gist.github.com/tpboudreau/0c2d0a18960c1aa04c838fa5c2ac7d2d
[4] https://gist.github.com/tpboudreau/0993beb5c8c1488028e76fb2ca179b7f

Reporter: TP Boudreau / @tpboudreau
Assignee: TP Boudreau / @tpboudreau

Related issues:

PRs and other links:

Note: This issue was originally created as ARROW-5618. Please see the migration documentation for further details.

@asfimport
Copy link
Collaborator Author

TP Boudreau / @tpboudreau:
I marked this issue as minor, because the defect does not appear to produce erroneous results.  Tests that trigger UB succeed in all environments except those where the library is built with the UB sanitizer enabled.

@asfimport
Copy link
Collaborator Author

TP Boudreau / @tpboudreau:
Ugh, sorry, forgot all about this PR and the comment from earlier this month slipped through the cracks.

I'll revisit this issue sometime this week.

@asfimport
Copy link
Collaborator Author

TP Boudreau / @tpboudreau:
(Sorry if this is a duplicate comment – first attempt doesn't seem to have posted.)

This is issue fell through the cracks on my end – I'll look into it this weekend.

@asfimport
Copy link
Collaborator Author

Todd Farmer / @toddfarmer:
Transitioning issue from Resolved to Closed to based on resolution field value.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant