refactor: Add format-agnostic StatisticsBuilder to dwio/common#16693
refactor: Add format-agnostic StatisticsBuilder to dwio/common#16693mbasmanova wants to merge 1 commit intofacebookincubator:mainfrom
Conversation
✅ Deploy Preview for meta-velox canceled.
|
|
@mbasmanova has exported this pull request. If you are a Meta employee, you can view the originating Diff in D95885404. |
55cecef to
f6afc29
Compare
PingLiuPing
left a comment
There was a problem hiding this comment.
I had a look at parquet writer statistic, and it has its own completely self-contained statistics system that is quite different from dwio::common::StatisticsBuilder.
For example:
Update API: one value at a time VS bulk Arrow array
Comparison: direct VS customised typed comparator
Output: ColumnStatistics snapshot VS EncodedStatistics (min/max serialized as byte strings)
Parquet also has page level statistics.
|
@PingLiuPing Thank you for looking. Is it the case that unlike DWRF, Parquet stores stats only per row group. It doesn't aggregate these and doesn't store file-level stats. What's your suggestion? |
| namespace { | ||
|
|
||
| template <typename T> | ||
| void addWithOverflowCheck(std::optional<T>& to, T value, uint64_t count) { |
There was a problem hiding this comment.
Seems addWithOverflowCheck, mergeWithOverflowCheck, mergeCount, mergeMin, mergeMax, isValidLength are copy pasted identically in both dwio/common/StatisticsBuilder.cpp and dwio/dwrf/writer/StatisticsBuilder.cpp.
| return result; | ||
| } | ||
|
|
||
| void IntegerStatisticsBuilder::addValues(int64_t value, uint64_t count) { |
There was a problem hiding this comment.
addValues, merge, reset, and init methods in DWRF typed builder (Boolean, Integer, Double, String, Binary) are copy paste of the corresponding dwio::common builders.
Parquet also stores page-level statistics. See: velox/velox/dwio/parquet/thrift/parquet.thrift Lines 210 to 213 in 4bea9ba and https://parquet.apache.org/docs/file-format/metadata/#page-header
Yes, Parquet aggregates statistics at the page level and row-group level, but not at the file level. I think this PR is still valuable. It decouples statistics computation from format-specific serialization and introduces format-agnostic statistics into dwio::common, so it now only depends on velox_dwio_common. This also allows the statistics computation logic to be shared across formats. For Parquet, since it is not straightforward to reuse |
f6afc29 to
cddfbc7
Compare
cddfbc7 to
7ef6446
Compare
…ookincubator#16693) Summary: The DWRF writer's StatisticsBuilder contains merge/addValues logic that is format-agnostic, but lives in the DWRF writer library, creating an undesirable cross-format dependency for consumers like Axiom and Parquet. This change extracts the format-agnostic parts into new dwio::common classes: - StatisticsBuilder base class with merge(), reset(), create(), createTree() - Typed builders: Boolean, Integer, Double, String, Binary - build() method that produces read-only ColumnStatistics snapshots The DWRF builders now extend the common base, adding only toProto() and proto-based build() for DWRF file format serialization. Axiom's StatisticsBuilderImpl now wraps dwio::common::StatisticsBuilder, removing the DWRF writer dependency. Differential Revision: D95885404
7ef6446 to
443612b
Compare
Summary: X-link: facebookincubator/velox#16693 The DWRF writer's StatisticsBuilder contains merge/addValues logic that is format-agnostic, but lives in the DWRF writer library, creating an undesirable cross-format dependency for consumers like Axiom and Parquet. This change extracts the format-agnostic parts into new dwio::common classes: - StatisticsBuilder base class with merge(), reset(), create(), createTree() - Typed builders: Boolean, Integer, Double, String, Binary - build() method that produces read-only ColumnStatistics snapshots The DWRF builders now extend the common base, adding only toProto() and proto-based build() for DWRF file format serialization. Axiom's StatisticsBuilderImpl now wraps dwio::common::StatisticsBuilder, removing the DWRF writer dependency. Differential Revision: D95885404
…ookincubator#16693) Summary: The DWRF writer's StatisticsBuilder contains merge/addValues logic that is format-agnostic, but lives in the DWRF writer library, creating an undesirable cross-format dependency for consumers like Axiom and Parquet. This change extracts the format-agnostic parts into new dwio::common classes: - StatisticsBuilder base class with merge(), reset(), create(), createTree() - Typed builders: Boolean, Integer, Double, String, Binary - build() method that produces read-only ColumnStatistics snapshots The DWRF builders now extend the common base, adding only toProto() and proto-based build() for DWRF file format serialization. Axiom's StatisticsBuilderImpl now wraps dwio::common::StatisticsBuilder, removing the DWRF writer dependency. Differential Revision: D95885404
|
@PingLiuPing I fixed code duplication. A follow-up PR #16700 shows how the newly extracted class is used. Another follow-up Axiom PR facebookincubator/axiom#1031 shows e2e usage. |
| break; | ||
| } | ||
| default: | ||
| VELOX_FAIL("Not supported type: {}", kind); |
There was a problem hiding this comment.
nit: dwrf/writer/StatisticsBuilder.cpp uses DWIO_RAISE
| } | ||
| } | ||
|
|
||
| bool isValidLength(const std::optional<uint64_t>& length) { |
There was a problem hiding this comment.
nit: this one still duplicated.
There was a problem hiding this comment.
Yes, but it is trivial. Probably not worth the trouble of sharing.
Summary: X-link: facebookincubator/velox#16693 The DWRF writer's StatisticsBuilder contains merge/addValues logic that is format-agnostic, but lives in the DWRF writer library, creating an undesirable cross-format dependency for consumers like Axiom and Parquet. This change extracts the format-agnostic parts into new dwio::common classes: - StatisticsBuilder base class with merge(), reset(), create(), createTree() - Typed builders: Boolean, Integer, Double, String, Binary - build() method that produces read-only ColumnStatistics snapshots The DWRF builders now extend the common base, adding only toProto() and proto-based build() for DWRF file format serialization. Axiom's StatisticsBuilderImpl now wraps dwio::common::StatisticsBuilder, removing the DWRF writer dependency. Reviewed By: Yuhta Differential Revision: D95885404 fbshipit-source-id: 28da10bb54506b3f305021758f74485291596b59
|
This pull request has been merged in 5699270. |
Summary:
The DWRF writer's StatisticsBuilder contains merge/addValues logic that is
format-agnostic, but lives in the DWRF writer library, creating an undesirable
cross-format dependency for consumers like Axiom and Parquet.
This change extracts the format-agnostic parts into new dwio::common classes:
The DWRF builders now extend the common base, adding only toProto() and
proto-based build() for DWRF file format serialization.
Axiom's StatisticsBuilderImpl now wraps dwio::common::StatisticsBuilder,
removing the DWRF writer dependency.
Differential Revision: D95885404