Skip to content

feat: Implement ParquetReader::columnStatistics()#16700

Closed
mbasmanova wants to merge 2 commits intofacebookincubator:mainfrom
mbasmanova:export-D95950333
Closed

feat: Implement ParquetReader::columnStatistics()#16700
mbasmanova wants to merge 2 commits intofacebookincubator:mainfrom
mbasmanova:export-D95950333

Conversation

@mbasmanova
Copy link
Copy Markdown
Contributor

Summary:
Implement ParquetReader::columnStatistics() which previously returned nullptr.
The method merges per-row-group Parquet statistics into file-level statistics
using StatisticsBuilder. Also improve the columnStatistics() API documentation
in Reader.h to clarify the index parameter semantics.

Differential Revision: D95950333

@netlify
Copy link
Copy Markdown

netlify bot commented Mar 10, 2026

Deploy Preview for meta-velox canceled.

Name Link
🔨 Latest commit dd2ba13
🔍 Latest deploy log https://app.netlify.com/projects/meta-velox/deploys/69b02b0866310c0008f80d2a

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Mar 10, 2026
@meta-codesync
Copy link
Copy Markdown

meta-codesync bot commented Mar 10, 2026

@mbasmanova has exported this pull request. If you are a Meta employee, you can view the originating Diff in D95950333.

…ookincubator#16693)

Summary:

The DWRF writer's StatisticsBuilder contains merge/addValues logic that is
format-agnostic, but lives in the DWRF writer library, creating an undesirable
cross-format dependency for consumers like Axiom and Parquet.

This change extracts the format-agnostic parts into new dwio::common classes:
- StatisticsBuilder base class with merge(), reset(), create(), createTree()
- Typed builders: Boolean, Integer, Double, String, Binary
- build() method that produces read-only ColumnStatistics snapshots

The DWRF builders now extend the common base, adding only toProto() and
proto-based build() for DWRF file format serialization.

Axiom's StatisticsBuilderImpl now wraps dwio::common::StatisticsBuilder,
removing the DWRF writer dependency.

Differential Revision: D95885404
…16700)

Summary:

Implement ParquetReader::columnStatistics() which previously returned nullptr.
The method merges per-row-group Parquet statistics into file-level statistics
using StatisticsBuilder. Also improve the columnStatistics() API documentation
in Reader.h to clarify the index parameter semantics.

Differential Revision: D95950333
Copy link
Copy Markdown
Collaborator

@PingLiuPing PingLiuPing left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks.
Looks like #16693 should be landed first.

@meta-codesync
Copy link
Copy Markdown

meta-codesync bot commented Mar 10, 2026

This pull request has been merged in 91b2e45.

meta-codesync bot pushed a commit that referenced this pull request Mar 16, 2026
Summary:
PR #16700 added support for Parquet file-level column statistics via `ParquetReader::columnStatistics()`. This PR adds an end-to-end test that
  verifies entire Parquet files are pruned during table scan when file-level column statistics allow the filter to eliminate all data in a file.

Pull Request resolved: #16709

Reviewed By: apurva-meta, srsuryadev

Differential Revision: D96560581

Pulled By: pedroerp

fbshipit-source-id: d28d3792f63f2550f02212f36e025c7cf3f60087
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. fb-exported Merged meta-exported

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants