Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-5618: [WIP] [C++] [Parquet] Check valid mask before converting Int96 timestamps #4607

Closed
wants to merge 1 commit into from

Conversation

tpboudreau
Copy link
Contributor

No description provided.

@tpboudreau
Copy link
Contributor Author

On reading Parquet files, when the decoders re-expand value arrays that have had their null values removed, the contents of the value slots introduced for the null-valued records is unreliable (unsurprisingly). In the case of the Int96 type, after decode, a calculation is immediately performed for each value in the entire re-expanded array, including the "null" values. These values are often zero, where the calculation expects a value (the Julian date field) significantly larger than zero, resulting in integer overflow in the calculation. I have no evidence that non-null values are ever incorrectly calculated, so this is an entirely silent bug, except to UB checkers.

The most direct fix seems to be to interrogate the valid mask (which is available) for each value before attempting the calculation, and that's the approach shown in this patch. However, although this fixes the problem for simple, non-nested columns (see new test), I'm not sure that there is enough information available in this function's context to calculate a valid_bits_offset argument in all cases: I've set it zero here, but I'm not well enough versed in this code to assert that that's the correct value when reading the valid mask for all column type variations, or how to obtain the correct value. And, obviously, I don't want to introduce a "fix" to a silent bug that actually causes real errors.

Another possible approach is to continue to run the conversion calculation on all values, but check for sane parameters (Julian date >> 0) in the calculation function Int96GetNanoSeconds() before performing it. This would affect both null and non-null values and I'm not sure I know what the sane value range on the Julian day should be (other than large enough to prevent overflow).

I'm reluctant to take any risks to fix a silent bug on a deprecated feature, but our CI clang build catches it. Any thoughts or guidance on this?

@wesm
Copy link
Member

wesm commented Jun 19, 2019

Can you rebase this? Can then take a closer look

@tpboudreau
Copy link
Contributor Author

Rebased.

@codecov-io
Copy link

codecov-io commented Jul 6, 2019

Codecov Report

❗ No coverage uploaded for pull request base (master@4be02c7). Click here to learn what that means.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff            @@
##             master    #4607   +/-   ##
=========================================
  Coverage          ?   89.08%           
=========================================
  Files             ?      720           
  Lines             ?   100146           
  Branches          ?        0           
=========================================
  Hits              ?    89214           
  Misses            ?    10932           
  Partials          ?        0
Impacted Files Coverage Δ
cpp/src/parquet/arrow/arrow-reader-writer-test.cc 93.97% <100%> (ø)
cpp/src/parquet/arrow/reader.cc 85.88% <100%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4be02c7...a264eb9. Read the comment docs.

@kszucs kszucs force-pushed the master branch 2 times, most recently from ed180da to 85fe336 Compare July 22, 2019 19:29
@fsaintjacques
Copy link
Contributor

  1. Could you explain the re-expand logic, I don't really follow.
  2. This logic is now found in cpp/src/parquet/arrow/reader_internal.cc. The rebase could be simple
  3. It seems other functions (TransferDate64) would also suffers from this.

@emkornfield
Copy link
Contributor

@tpboudreau do you have time to update this? it would be nice to eliminate sources of UB

@pitrou
Copy link
Member

pitrou commented Sep 16, 2019

@tpboudreau You might take a look at #5392, which should fix this issue, including the fact that Parquet yields uninitialized data in the presence of nulls (though I cannot say it if fixes all sources of uninitialized data in Parquet).

@pitrou
Copy link
Member

pitrou commented Sep 18, 2019

Closing as obsolete.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants