test: Add stats based parquet file filter test#16709
test: Add stats based parquet file filter test#16709PingLiuPing wants to merge 1 commit intofacebookincubator:mainfrom
Conversation
✅ Deploy Preview for meta-velox canceled.
|
|
Without #16700, test case reports following error: |
| // Filter c0 > 1000: neither file has values > 1000, both files skipped. | ||
| { | ||
| auto plan = | ||
| PlanBuilder(pool_.get()).tableScan(schema, {"c0 > 1000"}).planNode(); |
There was a problem hiding this comment.
It seems all these tests run through the same logic and can be deduped using 3 parameters:
- filter
- expected skippedSplits
- expected processedSplits
There was a problem hiding this comment.
@mbasmanova Thank you. Simplified the test by adding a lambda.
3853840 to
11a829a
Compare
|
|
||
| // File 1: integers [0, 99], doubles [0.0, 99.0], strings ["a".."d"]. | ||
| const vector_size_t numRows = 100; | ||
| auto file1 = TempFilePath::create()->getPath(); |
There was a problem hiding this comment.
a1,a2,...naming is anti-pattern
{
auto filePath =...
auto data = ...
writeToParquetFile(filePath, {data}, options);
}
There was a problem hiding this comment.
Thanks, file1 and file2 will be used to create the splits later.
Refined the name.
There was a problem hiding this comment.
Since these always used together, a better pattern would be:
std::vector filePaths;
std::vector dataVectors;
{
filePaths.push_back(...);
data.push_back(...);
writeToParquetFile(filePaths.back(), {data.back()}, options);
}
| { | ||
| makeFlatVector<int64_t>(numRows, [](auto row) { return row + 200; }), | ||
| makeFlatVector<double>( | ||
| numRows, [](auto row) { return static_cast<double>(row + 200); }), |
There was a problem hiding this comment.
I think you can drop static_cast<double>
| }); | ||
| writeToParquetFile(file2, {vector2}, options); | ||
|
|
||
| auto schema = asRowType(vector1->type()); |
There was a problem hiding this comment.
use RowVector::rowType()
| auto testFileSkipping = [&](const std::string& filter, | ||
| int32_t expectedSkipped, | ||
| int32_t expectedProcessed) { | ||
| auto plan = PlanBuilder(pool_.get()).tableScan(schema, {filter}).planNode(); |
There was a problem hiding this comment.
add SCOPED_TRACE(filter)
5e51ade to
ec7ddd2
Compare
ec7ddd2 to
17edf92
Compare
PR #16700 added support for Parquet file-level column statistics via
ParquetReader::columnStatistics(). This PR adds an end-to-end test thatverifies entire Parquet files are pruned during table scan when file-level column statistics allow the filter to eliminate all data in a file.