-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-5618: [WIP] [C++] [Parquet] Check valid mask before converting Int96 timestamps #4607
Conversation
On reading Parquet files, when the decoders re-expand value arrays that have had their null values removed, the contents of the value slots introduced for the null-valued records is unreliable (unsurprisingly). In the case of the Int96 type, after decode, a calculation is immediately performed for each value in the entire re-expanded array, including the "null" values. These values are often zero, where the calculation expects a value (the Julian date field) significantly larger than zero, resulting in integer overflow in the calculation. I have no evidence that non-null values are ever incorrectly calculated, so this is an entirely silent bug, except to UB checkers. The most direct fix seems to be to interrogate the valid mask (which is available) for each value before attempting the calculation, and that's the approach shown in this patch. However, although this fixes the problem for simple, non-nested columns (see new test), I'm not sure that there is enough information available in this function's context to calculate a valid_bits_offset argument in all cases: I've set it zero here, but I'm not well enough versed in this code to assert that that's the correct value when reading the valid mask for all column type variations, or how to obtain the correct value. And, obviously, I don't want to introduce a "fix" to a silent bug that actually causes real errors. Another possible approach is to continue to run the conversion calculation on all values, but check for sane parameters (Julian date >> 0) in the calculation function Int96GetNanoSeconds() before performing it. This would affect both null and non-null values and I'm not sure I know what the sane value range on the Julian day should be (other than large enough to prevent overflow). I'm reluctant to take any risks to fix a silent bug on a deprecated feature, but our CI clang build catches it. Any thoughts or guidance on this? |
Can you rebase this? Can then take a closer look |
Rebased. |
Codecov Report
@@ Coverage Diff @@
## master #4607 +/- ##
=========================================
Coverage ? 89.08%
=========================================
Files ? 720
Lines ? 100146
Branches ? 0
=========================================
Hits ? 89214
Misses ? 10932
Partials ? 0
Continue to review full report at Codecov.
|
ed180da
to
85fe336
Compare
|
@tpboudreau do you have time to update this? it would be nice to eliminate sources of UB |
@tpboudreau You might take a look at #5392, which should fix this issue, including the fact that Parquet yields uninitialized data in the presence of nulls (though I cannot say it if fixes all sources of uninitialized data in Parquet). |
Closing as obsolete. |
No description provided.