test: Add stats based parquet file filter test by PingLiuPing · Pull Request #16709 · facebookincubator/velox

PingLiuPing · 2026-03-10T21:51:32Z

PR #16700 added support for Parquet file-level column statistics via ParquetReader::columnStatistics(). This PR adds an end-to-end test that
verifies entire Parquet files are pruned during table scan when file-level column statistics allow the filter to eliminate all data in a file.

netlify · 2026-03-10T21:51:38Z

✅ Deploy Preview for meta-velox canceled.

Name	Link
🔨 Latest commit	`17edf92`
🔍 Latest deploy log	https://app.netlify.com/projects/meta-velox/deploys/69b1957cc46e5f000816e6b6

PingLiuPing · 2026-03-10T21:53:35Z

Without #16700, test case reports following error:

/velox/velox/dwio/parquet/tests/reader/ParquetTableScanTest.cpp:1711: Failure
Expected equality of these values:
  getRuntimeStats(task)["skippedSplits"].sum
    Which is: 0
  2

/velox/velox/dwio/parquet/tests/reader/ParquetTableScanTest.cpp:1712: Failure
Expected equality of these values:
  getRuntimeStats(task)["processedSplits"].sum
    Which is: 2
  0
...

mbasmanova · 2026-03-11T00:11:35Z

velox/dwio/parquet/tests/reader/ParquetTableScanTest.cpp

+  // Filter c0 > 1000: neither file has values > 1000, both files skipped.
+  {
+    auto plan =
+        PlanBuilder(pool_.get()).tableScan(schema, {"c0 > 1000"}).planNode();


It seems all these tests run through the same logic and can be deduped using 3 parameters:

filter

expected skippedSplits

expected processedSplits

@mbasmanova Thank you. Simplified the test by adding a lambda.

mbasmanova · 2026-03-11T11:58:46Z

velox/dwio/parquet/tests/reader/ParquetTableScanTest.cpp

+
+  // File 1: integers [0, 99], doubles [0.0, 99.0], strings ["a".."d"].
+  const vector_size_t numRows = 100;
+  auto file1 = TempFilePath::create()->getPath();


a1,a2,...naming is anti-pattern

{ auto filePath =... auto data = ... writeToParquetFile(filePath, {data}, options); }

Thanks, file1 and file2 will be used to create the splits later.
Refined the name.

Since these always used together, a better pattern would be:

std::vector filePaths; std::vector dataVectors; { filePaths.push_back(...); data.push_back(...); writeToParquetFile(filePaths.back(), {data.back()}, options); }

@mbasmanova thanks, updated.

mbasmanova · 2026-03-11T11:59:04Z

velox/dwio/parquet/tests/reader/ParquetTableScanTest.cpp

+      {
+          makeFlatVector<int64_t>(numRows, [](auto row) { return row + 200; }),
+          makeFlatVector<double>(
+              numRows, [](auto row) { return static_cast<double>(row + 200); }),


I think you can drop static_cast<double>

Yes, thanks

mbasmanova · 2026-03-11T11:59:24Z

velox/dwio/parquet/tests/reader/ParquetTableScanTest.cpp

+      });
+  writeToParquetFile(file2, {vector2}, options);
+
+  auto schema = asRowType(vector1->type());


use RowVector::rowType()

mbasmanova · 2026-03-11T11:59:45Z

velox/dwio/parquet/tests/reader/ParquetTableScanTest.cpp

+  auto testFileSkipping = [&](const std::string& filter,
+                              int32_t expectedSkipped,
+                              int32_t expectedProcessed) {
+    auto plan = PlanBuilder(pool_.get()).tableScan(schema, {filter}).planNode();


add SCOPED_TRACE(filter)

meta-codesync · 2026-03-15T17:39:10Z

@pedroerp has imported this pull request. If you are a Meta employee, you can view this in D96560581.

meta-codesync · 2026-03-16T06:26:13Z

@pedroerp merged this pull request in 4609a36.

PingLiuPing requested a review from mbasmanova March 10, 2026 21:51

PingLiuPing requested a review from majetideepak as a code owner March 10, 2026 21:51

meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Mar 10, 2026

mbasmanova reviewed Mar 11, 2026

View reviewed changes

PingLiuPing force-pushed the lp_add_parquet_test branch from 3853840 to 11a829a Compare March 11, 2026 10:16

mbasmanova reviewed Mar 11, 2026

View reviewed changes

PingLiuPing force-pushed the lp_add_parquet_test branch 2 times, most recently from 5e51ade to ec7ddd2 Compare March 11, 2026 16:16

Add parquet stats based file filter test

17edf92

PingLiuPing force-pushed the lp_add_parquet_test branch from ec7ddd2 to 17edf92 Compare March 11, 2026 16:16

mbasmanova approved these changes Mar 11, 2026

View reviewed changes

mbasmanova added the ready-to-merge PR that have been reviewed and are ready for merging. PRs with this tag notify the Velox Meta oncall label Mar 11, 2026

meta-codesync bot closed this in 4609a36 Mar 16, 2026

facebook-github-tools bot added the Merged label Mar 16, 2026

Conversation

PingLiuPing commented Mar 10, 2026

Uh oh!

netlify bot commented Mar 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for meta-velox canceled.

Uh oh!

PingLiuPing commented Mar 10, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

meta-codesync bot commented Mar 15, 2026

Uh oh!

meta-codesync bot commented Mar 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

netlify bot commented Mar 10, 2026 •

edited

Loading