Skip to content

test: Add stats based parquet file filter test#16709

Closed
PingLiuPing wants to merge 1 commit intofacebookincubator:mainfrom
PingLiuPing:lp_add_parquet_test
Closed

test: Add stats based parquet file filter test#16709
PingLiuPing wants to merge 1 commit intofacebookincubator:mainfrom
PingLiuPing:lp_add_parquet_test

Conversation

@PingLiuPing
Copy link
Copy Markdown
Collaborator

PR #16700 added support for Parquet file-level column statistics via ParquetReader::columnStatistics(). This PR adds an end-to-end test that
verifies entire Parquet files are pruned during table scan when file-level column statistics allow the filter to eliminate all data in a file.

@PingLiuPing PingLiuPing requested a review from mbasmanova March 10, 2026 21:51
@meta-cla meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Mar 10, 2026
@netlify
Copy link
Copy Markdown

netlify bot commented Mar 10, 2026

Deploy Preview for meta-velox canceled.

Name Link
🔨 Latest commit 17edf92
🔍 Latest deploy log https://app.netlify.com/projects/meta-velox/deploys/69b1957cc46e5f000816e6b6

@PingLiuPing
Copy link
Copy Markdown
Collaborator Author

Without #16700, test case reports following error:

/velox/velox/dwio/parquet/tests/reader/ParquetTableScanTest.cpp:1711: Failure
Expected equality of these values:
  getRuntimeStats(task)["skippedSplits"].sum
    Which is: 0
  2

/velox/velox/dwio/parquet/tests/reader/ParquetTableScanTest.cpp:1712: Failure
Expected equality of these values:
  getRuntimeStats(task)["processedSplits"].sum
    Which is: 2
  0
...

// Filter c0 > 1000: neither file has values > 1000, both files skipped.
{
auto plan =
PlanBuilder(pool_.get()).tableScan(schema, {"c0 > 1000"}).planNode();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems all these tests run through the same logic and can be deduped using 3 parameters:

  • filter
  • expected skippedSplits
  • expected processedSplits

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mbasmanova Thank you. Simplified the test by adding a lambda.

@PingLiuPing PingLiuPing force-pushed the lp_add_parquet_test branch from 3853840 to 11a829a Compare March 11, 2026 10:16

// File 1: integers [0, 99], doubles [0.0, 99.0], strings ["a".."d"].
const vector_size_t numRows = 100;
auto file1 = TempFilePath::create()->getPath();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a1,a2,...naming is anti-pattern

{
  auto filePath =...
  auto data = ...
  writeToParquetFile(filePath, {data}, options);
}

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, file1 and file2 will be used to create the splits later.
Refined the name.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since these always used together, a better pattern would be:

std::vector filePaths;
std::vector dataVectors;

{
  filePaths.push_back(...);
  data.push_back(...);
  writeToParquetFile(filePaths.back(), {data.back()}, options);
}

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mbasmanova thanks, updated.

{
makeFlatVector<int64_t>(numRows, [](auto row) { return row + 200; }),
makeFlatVector<double>(
numRows, [](auto row) { return static_cast<double>(row + 200); }),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can drop static_cast<double>

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, thanks

});
writeToParquetFile(file2, {vector2}, options);

auto schema = asRowType(vector1->type());
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use RowVector::rowType()

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks.

auto testFileSkipping = [&](const std::string& filter,
int32_t expectedSkipped,
int32_t expectedProcessed) {
auto plan = PlanBuilder(pool_.get()).tableScan(schema, {filter}).planNode();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add SCOPED_TRACE(filter)

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks.

@PingLiuPing PingLiuPing force-pushed the lp_add_parquet_test branch 2 times, most recently from 5e51ade to ec7ddd2 Compare March 11, 2026 16:16
@PingLiuPing PingLiuPing force-pushed the lp_add_parquet_test branch from ec7ddd2 to 17edf92 Compare March 11, 2026 16:16
@mbasmanova mbasmanova added the ready-to-merge PR that have been reviewed and are ready for merging. PRs with this tag notify the Velox Meta oncall label Mar 11, 2026
@meta-codesync
Copy link
Copy Markdown

meta-codesync bot commented Mar 15, 2026

@pedroerp has imported this pull request. If you are a Meta employee, you can view this in D96560581.

@meta-codesync
Copy link
Copy Markdown

meta-codesync bot commented Mar 16, 2026

@pedroerp merged this pull request in 4609a36.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. Merged ready-to-merge PR that have been reviewed and are ready for merging. PRs with this tag notify the Velox Meta oncall

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants