-
Notifications
You must be signed in to change notification settings - Fork 3.9k
GH-47740: [C++][Parquet] Fix undefined behavior when reading invalid Parquet data #47741
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@github-actions crossbow submit -g cpp |
|
@AntoinePrv Would you like to take a look? |
| // There may be remaining null if they are not greedily filled by either decoder calls | ||
| check_and_handle_fully_null_remaining(); | ||
|
|
||
| ARROW_DCHECK(batch.is_done() || exhausted()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This check could trigger if the RLE-bit-packed data is invalid (for example a run of invalid size). @AntoinePrv
|
Revision: d620685 Submitted crossbow builds: ursacomputing/crossbow @ actions-0059c16459 |
|
Valgrind failure is unrelated and will be fixed by #47743 |
WillAyd
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm - nice work!
|
After merging your PR, Conbench analyzed the 2 benchmarking runs that have been run so far on merge-commit 33d1f32. There were no benchmark performance regressions. 🎉 The full Conbench report has more details. It also includes information about 7 possible false positives for unstable benchmarks that are known to sometimes produce them. |
…Parquet data (#47741) ### Rationale for this change Fix issues found by OSS-Fuzz when invalid Parquet data is fed to the Parquet reader: * https://issues.oss-fuzz.com/issues/447262173 * https://issues.oss-fuzz.com/issues/447480433 * https://issues.oss-fuzz.com/issues/447490896 * https://issues.oss-fuzz.com/issues/447693724 * https://issues.oss-fuzz.com/issues/447693728 * https://issues.oss-fuzz.com/issues/449498800 ### Are these changes tested? Yes, using the updated fuzz regression files from apache/arrow-testing#115 ### Are there any user-facing changes? No. **This PR contains a "Critical Fix".** (If the changes fix either (a) a security vulnerability, (b) a bug that caused incorrect or invalid data to be produced, or (c) a bug that causes a crash (even when the API contract is upheld), please provide explanation. If not, you can remove this.) * GitHub Issue: #47740 Authored-by: Antoine Pitrou <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>
| typename EncodingTraits<DType>::Accumulator* out, | ||
| int* out_num_values) { | ||
| std::vector<ByteArray> values(num_values); | ||
| std::vector<ByteArray> values(num_values - null_count); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aha...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, this was not a problem in itself, it was just allocating too much memory :)
…valid Parquet data (apache#47741) ### Rationale for this change Fix issues found by OSS-Fuzz when invalid Parquet data is fed to the Parquet reader: * https://issues.oss-fuzz.com/issues/447262173 * https://issues.oss-fuzz.com/issues/447480433 * https://issues.oss-fuzz.com/issues/447490896 * https://issues.oss-fuzz.com/issues/447693724 * https://issues.oss-fuzz.com/issues/447693728 * https://issues.oss-fuzz.com/issues/449498800 ### Are these changes tested? Yes, using the updated fuzz regression files from apache/arrow-testing#115 ### Are there any user-facing changes? No. **This PR contains a "Critical Fix".** (If the changes fix either (a) a security vulnerability, (b) a bug that caused incorrect or invalid data to be produced, or (c) a bug that causes a crash (even when the API contract is upheld), please provide explanation. If not, you can remove this.) * GitHub Issue: apache#47740 Authored-by: Antoine Pitrou <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>
Rationale for this change
Fix issues found by OSS-Fuzz when invalid Parquet data is fed to the Parquet reader:
Are these changes tested?
Yes, using the updated fuzz regression files from apache/arrow-testing#115
Are there any user-facing changes?
No.
This PR contains a "Critical Fix". (If the changes fix either (a) a security vulnerability, (b) a bug that caused incorrect or invalid data to be produced, or (c) a bug that causes a crash (even when the API contract is upheld), please provide explanation. If not, you can remove this.)