Skip to content

refactor: Add format-agnostic StatisticsBuilder to dwio/common#16693

Closed
mbasmanova wants to merge 1 commit intofacebookincubator:mainfrom
mbasmanova:export-D95885404
Closed

refactor: Add format-agnostic StatisticsBuilder to dwio/common#16693
mbasmanova wants to merge 1 commit intofacebookincubator:mainfrom
mbasmanova:export-D95885404

Conversation

@mbasmanova
Copy link
Copy Markdown
Contributor

Summary:
The DWRF writer's StatisticsBuilder contains merge/addValues logic that is
format-agnostic, but lives in the DWRF writer library, creating an undesirable
cross-format dependency for consumers like Axiom and Parquet.

This change extracts the format-agnostic parts into new dwio::common classes:

  • StatisticsBuilder base class with merge(), reset(), create(), createTree()
  • Typed builders: Boolean, Integer, Double, String, Binary
  • build() method that produces read-only ColumnStatistics snapshots

The DWRF builders now extend the common base, adding only toProto() and
proto-based build() for DWRF file format serialization.

Axiom's StatisticsBuilderImpl now wraps dwio::common::StatisticsBuilder,
removing the DWRF writer dependency.

Differential Revision: D95885404

@netlify
Copy link
Copy Markdown

netlify bot commented Mar 10, 2026

Deploy Preview for meta-velox canceled.

Name Link
🔨 Latest commit 443612b
🔍 Latest deploy log https://app.netlify.com/projects/meta-velox/deploys/69b023feb7bb2e0008247a17

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Mar 10, 2026
@meta-codesync
Copy link
Copy Markdown

meta-codesync bot commented Mar 10, 2026

@mbasmanova has exported this pull request. If you are a Meta employee, you can view the originating Diff in D95885404.

Copy link
Copy Markdown
Collaborator

@PingLiuPing PingLiuPing left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had a look at parquet writer statistic, and it has its own completely self-contained statistics system that is quite different from dwio::common::StatisticsBuilder.
For example:
Update API: one value at a time VS bulk Arrow array
Comparison: direct VS customised typed comparator
Output: ColumnStatistics snapshot VS EncodedStatistics (min/max serialized as byte strings)
Parquet also has page level statistics.

@mbasmanova
Copy link
Copy Markdown
Contributor Author

@PingLiuPing Thank you for looking. Is it the case that unlike DWRF, Parquet stores stats only per row group. It doesn't aggregate these and doesn't store file-level stats. What's your suggestion?

namespace {

template <typename T>
void addWithOverflowCheck(std::optional<T>& to, T value, uint64_t count) {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems addWithOverflowCheck, mergeWithOverflowCheck, mergeCount, mergeMin, mergeMax, isValidLength are copy pasted identically in both dwio/common/StatisticsBuilder.cpp and dwio/dwrf/writer/StatisticsBuilder.cpp.

return result;
}

void IntegerStatisticsBuilder::addValues(int64_t value, uint64_t count) {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addValues, merge, reset, and init methods in DWRF typed builder (Boolean, Integer, Double, String, Binary) are copy paste of the corresponding dwio::common builders.

@PingLiuPing
Copy link
Copy Markdown
Collaborator

Is it the case that unlike DWRF, Parquet stores stats only per row group?

Parquet also stores page-level statistics. See:

* Statistics per row group and per page
* All fields are optional.
*/
struct Statistics {

and
https://parquet.apache.org/docs/file-format/metadata/#page-header

It doesn’t aggregate these and doesn’t store file-level stats.

Yes, Parquet aggregates statistics at the page level and row-group level, but not at the file level.
In #16062, I implemented file-level statistics for Parquet according to Iceberg requirements.

I think this PR is still valuable. It decouples statistics computation from format-specific serialization and introduces format-agnostic statistics into dwio::common, so it now only depends on velox_dwio_common. This also allows the statistics computation logic to be shared across formats.

For Parquet, since it is not straightforward to reuse dwio::common::StatisticsBuilder due to the differences, I think it might makes sense to keep the existing Parquet writer statistics logic.

…ookincubator#16693)

Summary:

The DWRF writer's StatisticsBuilder contains merge/addValues logic that is
format-agnostic, but lives in the DWRF writer library, creating an undesirable
cross-format dependency for consumers like Axiom and Parquet.

This change extracts the format-agnostic parts into new dwio::common classes:
- StatisticsBuilder base class with merge(), reset(), create(), createTree()
- Typed builders: Boolean, Integer, Double, String, Binary
- build() method that produces read-only ColumnStatistics snapshots

The DWRF builders now extend the common base, adding only toProto() and
proto-based build() for DWRF file format serialization.

Axiom's StatisticsBuilderImpl now wraps dwio::common::StatisticsBuilder,
removing the DWRF writer dependency.

Differential Revision: D95885404
mbasmanova added a commit to mbasmanova/verax that referenced this pull request Mar 10, 2026
Summary:
X-link: facebookincubator/velox#16693

The DWRF writer's StatisticsBuilder contains merge/addValues logic that is
format-agnostic, but lives in the DWRF writer library, creating an undesirable
cross-format dependency for consumers like Axiom and Parquet.

This change extracts the format-agnostic parts into new dwio::common classes:
- StatisticsBuilder base class with merge(), reset(), create(), createTree()
- Typed builders: Boolean, Integer, Double, String, Binary
- build() method that produces read-only ColumnStatistics snapshots

The DWRF builders now extend the common base, adding only toProto() and
proto-based build() for DWRF file format serialization.

Axiom's StatisticsBuilderImpl now wraps dwio::common::StatisticsBuilder,
removing the DWRF writer dependency.

Differential Revision: D95885404
mbasmanova added a commit to mbasmanova/velox that referenced this pull request Mar 10, 2026
…ookincubator#16693)

Summary:

The DWRF writer's StatisticsBuilder contains merge/addValues logic that is
format-agnostic, but lives in the DWRF writer library, creating an undesirable
cross-format dependency for consumers like Axiom and Parquet.

This change extracts the format-agnostic parts into new dwio::common classes:
- StatisticsBuilder base class with merge(), reset(), create(), createTree()
- Typed builders: Boolean, Integer, Double, String, Binary
- build() method that produces read-only ColumnStatistics snapshots

The DWRF builders now extend the common base, adding only toProto() and
proto-based build() for DWRF file format serialization.

Axiom's StatisticsBuilderImpl now wraps dwio::common::StatisticsBuilder,
removing the DWRF writer dependency.

Differential Revision: D95885404
@mbasmanova
Copy link
Copy Markdown
Contributor Author

@PingLiuPing I fixed code duplication. A follow-up PR #16700 shows how the newly extracted class is used. Another follow-up Axiom PR facebookincubator/axiom#1031 shows e2e usage.

Copy link
Copy Markdown
Collaborator

@PingLiuPing PingLiuPing left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks.

break;
}
default:
VELOX_FAIL("Not supported type: {}", kind);
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: dwrf/writer/StatisticsBuilder.cpp uses DWIO_RAISE

}
}

bool isValidLength(const std::optional<uint64_t>& length) {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: this one still duplicated.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but it is trivial. Probably not worth the trouble of sharing.

meta-codesync bot pushed a commit to facebookincubator/axiom that referenced this pull request Mar 10, 2026
Summary:
X-link: facebookincubator/velox#16693

The DWRF writer's StatisticsBuilder contains merge/addValues logic that is
format-agnostic, but lives in the DWRF writer library, creating an undesirable
cross-format dependency for consumers like Axiom and Parquet.

This change extracts the format-agnostic parts into new dwio::common classes:
- StatisticsBuilder base class with merge(), reset(), create(), createTree()
- Typed builders: Boolean, Integer, Double, String, Binary
- build() method that produces read-only ColumnStatistics snapshots

The DWRF builders now extend the common base, adding only toProto() and
proto-based build() for DWRF file format serialization.

Axiom's StatisticsBuilderImpl now wraps dwio::common::StatisticsBuilder,
removing the DWRF writer dependency.

Reviewed By: Yuhta

Differential Revision: D95885404

fbshipit-source-id: 28da10bb54506b3f305021758f74485291596b59
@meta-codesync
Copy link
Copy Markdown

meta-codesync bot commented Mar 10, 2026

This pull request has been merged in 5699270.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. fb-exported Merged meta-exported

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants