feat: Collect Iceberg stats by PingLiuPing · Pull Request #16062 · facebookincubator/velox

PingLiuPing · 2026-01-19T15:00:39Z

Implements Iceberg Parquet data file statistics collection.
The stats include:

numRecords: Total record count
columnsSizes: Compressed size per column
valueCounts: Value count per column
nullValueCounts: Null value count per column
nanValueCounts: NaN value count per column (placeholder for future support)
lowerBounds/upperBounds: Min/max bounds per column (base64-encoded), currently do not include array and map types and all its descendants.

These stats is presented by struct IcebergDataFileStatistics.

Added IcebergParquetStatsCollector for collecting Iceberg Parquet files stats.
This collector take Parquet file metadata as input and aggregates across all row groups, computes global (per file) min/max bounds. The file metadata is returned when calling Writer::close().

Once the stats are collected. Includes full stats in commit message JSON (previously only had recordCount).

netlify · 2026-01-19T15:00:47Z

✅ Deploy Preview for meta-velox canceled.

Name	Link
🔨 Latest commit	`9c903cf`
🔍 Latest deploy log	https://app.netlify.com/projects/meta-velox/deploys/69ae9a3405976000084bbc3b

velox/connectors/hive/iceberg/IcebergDataSink.cpp

PingLiuPing · 2026-01-27T12:29:28Z

@mbasmanova @majetideepak @Yuhta Could you please take a look at this PR? Thank you very much.

mbasmanova

I don't see logic that uses the stats config. Can you add that?

mbasmanova · 2026-01-27T12:58:15Z

velox/connectors/hive/iceberg/IcebergDataSink.cpp

+    VELOX_CHECK_EQ(field.children.size(), type->size());
+    config.children.reserve(field.children.size());
+
+    if (type->isRow()) {


No need to switch on type. Just loop over [0, Type::size()) and use Type::childAt.

mbasmanova · 2026-01-27T13:03:24Z

velox/connectors/hive/iceberg/IcebergDataSink.cpp

@@ -392,6 +445,28 @@ IcebergDataSink::createWriterOptions() const {
  // (TimestampPrecision::kMicroseconds).
  options->serdeParameters["parquet.writer.timestamp.unit"] = "6";
  options->serdeParameters["parquet.writer.timestamp.timezone"] = "";
+
+  std::function<folly::dynamic(const IcebergFieldStatsConfig&)> toJson =


Are we missing logic to serialize 'skipBounds'?

Can you move serde logic to a method on IcebergFieldStatsConfig?

Thanks. Deleted seDer logic now and directly passing ParquetFieldId to writer options.

mbasmanova · 2026-01-27T13:04:25Z

velox/connectors/hive/iceberg/IcebergFieldStatsConfig.h

+/// Config for collecting Iceberg parquet field statistics.
+/// Holds the Iceberg source field id and whether to skip bounds
+/// collection for this field. For nested field, it contains child fields.
+struct IcebergFieldStatsConfig {


I assume this struct is specific to Parquet. If that's the case, we may want to add Parquet to the name.

perhaps, IcebergParquetWriterStatsConfig

mbasmanova · 2026-01-27T13:05:49Z

velox/dwio/parquet/writer/Writer.h

+  // of field ID structures. Each structure has "fieldId" (int32) and optional
+  // "children" (array of nested structures).
+  // Example: [{"fieldId":1,"children":[{"fieldId":2}]},{"fieldId":3}].
+  static constexpr const char* kParquetSerdeFieldIds =


why "Serde"?

Thanks, see #16062 (comment)

Sorry, but I do not understand. Would you clarify some more? This config seems to control how writer collects stats. It doesn't appear to be used in any sort of serialization or deserialization. Am I missing something?

Oh, sorry for the confusion — let me clarify.

This config indeed controls how the writer collects stats and serves two purposes:

It provides the Parquet field_id, which should be passed down to the Parquet writer and written into the Parquet data file.

It provides skipBounds, which indicates whether min/max statistics should be collected for a given Parquet field. This is used after the Parquet data is written, during the conversion from Parquet stats to Iceberg stats (not implemented yet).

For the first part, ideally we would just set ParquetFieldId directly here. However, I’ve run into issues with this approach in the past (see:
#15509 (comment)).

To avoid explicitly depending on symbols from the Parquet writer, I pass this parameter to the Parquet writer via serdeParameters instead.

An alternative could be to change the CMake configuration so that connectors/hive/iceberg is only compiled when VELOX_ENABLE_PARQUET is enabled and then we can safely include parquet symbols in here.

mbasmanova · 2026-01-28T16:20:37Z

@yingsu00 Ying, do you have any thoughts on tight coupling between Iceberg connector and Parquet format? See #16062 (comment)

It seems non-ideal for the connector to depend directly on the particular storage format.

PingLiuPing · 2026-01-29T22:01:15Z

@mbasmanova Thank you for reviewing this PR.

I’ve now merged all Iceberg stats collection logic into this PR and updated the description accordingly. Initially I planned to split it into 3 PRs, with this one only covering stats configuration, but consolidating them makes the overall design easier to review.

Regarding the concern about tight coupling between the Iceberg connector and Parquet, at the moment, the Iceberg connector depends on Parquet in three places:
1. Unit tests need to explicitly register Parquet readers (Hive does the same).
2. Unit tests also need to explicitly register Parquet writers. And functionally the Parquet writer factory must be registered before writing Iceberg tables. This dependency was introduced recently as part of adding Iceberg write support.
3. Iceberg source code depends on velox/dwio/parquet/writer/arrow/Metadata.h, velox/dwio/parquet/writer/arrow/Statistics.h and velox/dwio/parquet/ParquetFieldId.h from parquet. These dependencies were also introduced with Iceberg write support.
ParquetFieldId.h is now a standalone header compiled as a CMake interface library, so this dependency should be relatively ok. The other two headers are only included by IcebergParquetStatsCollector.cpp, which I now conditionally compile only when VELOX_ENABLE_PARQUET is set.

I couldn’t find a more elegant way to fully decouple this at the moment. For Parquet field IDs in particular, since RowType does not carry equivalent IDs, we eventually need a mechanism to pass these IDs to the Parquet writer. The current approach passes them via writer options.

For IcebergParquetStatsCollector.cpp, I did implement an alternative version under velox/dwio/parquet/writer. However, since this logic is collecting Iceberg-specific statistics, it feels more appropriate to keep it in the Iceberg connector layer. And current design also leaves room for better extensibility if we support additional Iceberg-like table formats in the future.

While this PR is mainly focused on Iceberg stats collection, I plan to refactor iceberg/CMakeLists.txt in the near future to split the target into separate reader and writer components. Since Parquet is currently the only supported file format for writing, the writer target can then be conditionally compiled only when VELOX_ENABLE_PARQUET is enabled.
And the current cmake target name velox_hive_iceberg_splitreader is not accurate anymore since write is supported now.

PingLiuPing · 2026-01-29T22:09:48Z

@yingsu00 Ying, do you have any thoughts on tight coupling between Iceberg connector and Parquet format? See #16062 (comment)

It seems non-ideal for the connector to depend directly on the particular storage format.

I think Ying's PR #14090 could help resolve the parquet reader and writer reference issue from test.

mbasmanova · 2026-01-30T14:30:46Z

For Parquet field IDs in particular, since RowType does not carry equivalent IDs, we eventually need a mechanism to pass these IDs to the Parquet writer. The current approach passes them via writer options.

I believe Iceberg supports other file formats as well. How does it pass field IDs to these?

PingLiuPing · 2026-01-30T15:06:11Z

I believe Iceberg supports other file formats as well. How does it pass field IDs to these?

Yes, you’re right, Iceberg supports Avro, Parquet, and ORC. According to the Iceberg spec, all supported file formats are required to store IDs (see the format-specific requirements in the spec: https://iceberg.apache.org/spec/#appendix-a-format-specific-requirements).
In Velox Iceberg writer only supports Parquet format.

There is also a current limitation on the Iceberg reader side. When reading an Iceberg table written by another engine, even if the Parquet schema contains field_id, the Velox Iceberg reader does not use field IDs for column matching. Instead, it follows the same logic as the Hive reader, matching columns by name or by position.

Had a quick look at the Iceberg Java implementation, there is explicit logic to translate Iceberg field IDs into the corresponding IDs of the underlying file format. For example:

Avro: TypeToSchema
Parquet: TypeToMessageType

In our case, the Iceberg field IDs are computed and passed down from the coordinator node (Java) to the writer. These IDs are exported to Arrow and eventually write to parquet file schema.

PingLiuPing · 2026-02-03T16:00:48Z

@mbasmanova Sorry, I realized I forgot to tag you earlier. Just flagging it here in case you have a chance to take a look and would love to hear your thoughts.

mbasmanova

@PingLiuPing I'm seeing a number of warnings. Any chance you could go over these and address?

feat: Collect Iceberg stats

I assume this refers to file-level stats. If so, perhaps, update PR title to clarify.

mbasmanova · 2026-02-04T08:54:29Z

velox/connectors/hive/iceberg/IcebergDataSink.cpp

+// @param field The Parquet field ID containing the field ID and child field
+// IDs.
+// @param type The Velox type corresponding to this field.
+// @param skipBounds Whether to skip bounds collection for this field and its


@PingLiuPing Can you remind me why do we need skipBounds functionality? Why don't we collect min/max values unconditionally?

Is skipBounds == true equivalent to saying that a column is a map/array of a subfield of such?

Thanks @mbasmanova.

This is a bit tricky. Initially, I was collecting min/max unconditionally for all types, and the Iceberg spec doesn’t explicitly describe this behavior. During testing, I found that both Presto and Spark do not collect min/max statistics for array and map types.

Looking into the Iceberg source code, there is indeed a such limitation enforced there. For reference:
https://github.com/apache/iceberg/blob/5970ddd9278a2baa060183a18f895de3608eab1f/parquet/src/main/java/org/apache/iceberg/parquet/ParquetMetrics.java#L185-L200

Is skipBounds == true equivalent to saying that a column is a map/array of a subfield of such?

Yes, based on current implementation we can have such assumption.

@PingLiuPing What's the downside of collecting these stats?

If we do need to skip collecting min/max for arrays, maps and their subfields, then let's rename 'skipBounds', remove it from public API and make it an implementation detail.

mbasmanova · 2026-02-04T08:56:22Z

velox/connectors/hive/iceberg/IcebergDataSink.cpp

+  bool currentSkipBounds = skipBounds || type->isMap() || type->isArray();
+  IcebergParquetStatsConfig config(field.fieldId, currentSkipBounds);
+
+  if (!field.children.empty()) {


Do we need this check? Can we execute the logic inside the 'if' body unconditionally?

Thanks, it can be removed.

mbasmanova · 2026-02-04T08:57:27Z

velox/connectors/hive/iceberg/IcebergDataSink.cpp

+    auto icebergColumnHandle =
+        checkedPointerCast<const IcebergColumnHandle>(columnHandle);
+    icebergParquetStatsConfig_.push_back(toIcebergFieldStatsConfig(
+        icebergColumnHandle->field(), icebergColumnHandle->dataType(), false));


Would you annotate 'false' with the parameter name ?

/*param=*/false

mbasmanova · 2026-02-04T08:58:33Z

velox/connectors/hive/iceberg/IcebergDataSink.cpp

      ("partitionSpecJson",
        icebergInsertTableHandle->partitionSpec() ? icebergInsertTableHandle->partitionSpec()->specId : 0)
+      // Sort order evolution is not supported. Set default id to 1.


What does 1 mean?

Thanks.

From iceberg spec https://iceberg.apache.org/spec/#sorting, a sort order is defined by a sort order id and a list of sort fields.
Order id 0 is reserved for the unsorted order. Will change this to 0.

mbasmanova · 2026-02-04T09:04:02Z

velox/connectors/hive/iceberg/IcebergParquetStatsCollector.h

+  /// Iterates through all row groups and columns to collect:
+  /// - Record count, split offsets, value counts, column sizes, null counts.
+  /// - Min/max bounds (base64-encoded) for columns not in skipBounds set.
+  /// @param metadata Pointer to shared_ptr<parquet::arrow::FileMetaData>.


Why is being passed as void*?

Thanks.
Refactored the code to use template class.

mbasmanova · 2026-02-04T09:08:09Z

velox/connectors/hive/iceberg/IcebergParquetStatsCollector.cpp

+  dataFileStats->numRecords = fileMetadata->num_rows();
+  const auto numRowGroups = fileMetadata->num_row_groups();
+  for (auto i = 0; i < numRowGroups; ++i) {
+    const auto& rgm = fileMetadata->RowGroup(i);


rgm -> rowGroupMetadata

mbasmanova · 2026-02-04T09:09:09Z

velox/connectors/hive/iceberg/IcebergParquetStatsCollector.cpp

+  // e.g., schema_->size(). It also contains the sub-fields when there are
+  // nested data types in table's schema.
+  std::unordered_set<int32_t> skipBoundsFields;
+  int32_t numFields = 0;


Is numFields used to sanity check row group metadata? Is this needed?

Thanks. You are right, this can be removed.

velox/connectors/hive/iceberg/IcebergParquetStatsCollector.cpp

mbasmanova · 2026-02-04T09:23:36Z

velox/connectors/hive/iceberg/IcebergParquetStatsCollector.cpp

+  std::unordered_map<int32_t, std::shared_ptr<parquet::arrow::Statistics>>
+      globalMinStats;
+  std::unordered_map<int32_t, std::shared_ptr<parquet::arrow::Statistics>>
+      globalMaxStats;


Do we need 2 maps? Can we use a single map with value being a pair of stats?

Thanks. Merged to a single map.

PingLiuPing · 2026-02-27T10:05:18Z

@PingLiuPing Need to add that conditional too as it fails builds otherwise.

@kgpai Thanks, tested locally with VELOX_ENABLE_PARQUET=OFF.

kgpai · 2026-03-02T17:11:57Z

@PingLiuPing Update: Almost there, some builds/tests failing which hopefully I can get over the line today.

PingLiuPing · 2026-03-02T17:18:42Z

@PingLiuPing Update: Almost there, some builds/tests failing which hopefully I can get over the line today.

@kgpai Thank you so much.

kgpai · 2026-03-03T18:08:48Z

@PingLiuPing HiveDataSinkTest.memoryReclaimAfterClose is seeing thread sanitizer issues with this change, which is what I am investigating. Will update you if some code changes are required to fix that.

PingLiuPing · 2026-03-04T09:43:27Z

@kgpai Thank you for helping importing this PR, let me know if you need anything from my side.

kgpai · 2026-03-04T18:42:04Z

@PingLiuPing Had to make some changes on the parquet side to get our gluten builds working. So exporting that to this PR.

kgpai · 2026-03-06T23:18:36Z

@PingLiuPing I had to make some changes to get things to work, but unfortunately cant export them .
Can you apply this patch like so git apply writerconfig.patch and update the PR. I will reimport it then .
writerconfig.patch

PingLiuPing · 2026-03-09T09:59:49Z

@PingLiuPing I had to make some changes to get things to work, but unfortunately cant export them . Can you apply this patch like so git apply writerconfig.patch and update the PR. I will reimport it then . writerconfig.patch

@kgpai Thanks, I updated the code. And I also fixed another code conflict that relate to writer rotation.

PingLiuPing · 2026-03-10T09:49:03Z

@kgpai Thank you for importing again, I see internal build and test still failing, anything I can do?

kgpai · 2026-03-11T05:45:26Z

No few more tests need fixing - hoping to resolve that by tomorrow.

kgpai · 2026-03-13T16:50:17Z

@PingLiuPing Thank you for bearing with me - almost there with this diff. If all goes well should land today.

PingLiuPing · 2026-03-13T16:54:38Z

@PingLiuPing Thank you for bearing with me - almost there with this diff. If all goes well should land today.

@kgpai Thank you so much, this will make my ongoing work much easier.

meta-codesync · 2026-03-14T08:09:44Z

@kgpai merged this pull request in e7dd656.

This reverts commit e7dd656.

PingLiuPing requested a review from mbasmanova January 19, 2026 15:00

PingLiuPing requested a review from majetideepak as a code owner January 19, 2026 15:00

meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jan 19, 2026

PingLiuPing commented Jan 19, 2026

View reviewed changes

velox/connectors/hive/iceberg/IcebergDataSink.cpp Outdated Show resolved Hide resolved

PingLiuPing force-pushed the lp_iceberg_field_ids branch from eff6947 to 1af47a8 Compare January 27, 2026 12:27

mbasmanova reviewed Jan 27, 2026

View reviewed changes

PingLiuPing force-pushed the lp_iceberg_field_ids branch 4 times, most recently from 98dd9ec to 056d456 Compare January 29, 2026 21:01

PingLiuPing changed the title ~~feat: Add Iceberg field stats config~~ feat: Collect Iceberg stats Jan 29, 2026

PingLiuPing force-pushed the lp_iceberg_field_ids branch from 056d456 to 3c05a02 Compare January 29, 2026 22:29

mbasmanova reviewed Feb 4, 2026

View reviewed changes

velox/connectors/hive/iceberg/IcebergParquetStatsCollector.cpp Outdated Show resolved Hide resolved

mbasmanova reviewed Feb 4, 2026

View reviewed changes

PingLiuPing force-pushed the lp_iceberg_field_ids branch from 124e6a3 to a1b87a0 Compare February 27, 2026 10:04

PingLiuPing force-pushed the lp_iceberg_field_ids branch from a1b87a0 to 9ed076d Compare March 9, 2026 09:58

PingLiuPing added 5 commits March 9, 2026 10:00

Collect iceberg parquet data file stats

41863be

Resolve backward compatibility issue

6264c0c

Fix parquet compile error

8ad9650

Fix build

ec88d86

Fix iceberg stats during rotation

9c903cf

PingLiuPing force-pushed the lp_iceberg_field_ids branch from 9ed076d to 9c903cf Compare March 9, 2026 10:00

PingLiuPing mentioned this pull request Mar 10, 2026

refactor: Add format-agnostic StatisticsBuilder to dwio/common #16693

Closed

meta-codesync bot closed this in e7dd656 Mar 14, 2026

facebook-github-tools bot added the Merged label Mar 14, 2026

aditi-pandit mentioned this pull request Mar 16, 2026

chore(ci): Advance velox prestodb/presto#27346

Open

PingLiuPing deleted the lp_iceberg_field_ids branch March 21, 2026 16:27

nmahadevuni added a commit to nmahadevuni/velox that referenced this pull request Apr 1, 2026

Revert "feat: Collect Iceberg stats (facebookincubator#16062)"

eba3fde

This reverts commit e7dd656.

nmahadevuni added a commit to nmahadevuni/velox that referenced this pull request Apr 1, 2026

Revert "feat: Collect Iceberg stats (facebookincubator#16062)"

4f68530

This reverts commit e7dd656.

nmahadevuni added a commit to nmahadevuni/velox that referenced this pull request Apr 1, 2026

Revert "feat: Collect Iceberg stats (facebookincubator#16062)"

c786f67

This reverts commit e7dd656.

Conversation

PingLiuPing commented Jan 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

netlify bot commented Jan 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for meta-velox canceled.

Uh oh!

Uh oh!

PingLiuPing commented Jan 27, 2026

Uh oh!

mbasmanova left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mbasmanova commented Jan 28, 2026

Uh oh!

PingLiuPing commented Jan 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

PingLiuPing commented Jan 29, 2026

Uh oh!

mbasmanova commented Jan 30, 2026

Uh oh!

PingLiuPing commented Jan 30, 2026

Uh oh!

PingLiuPing commented Feb 3, 2026

Uh oh!

mbasmanova left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mbasmanova Feb 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

PingLiuPing commented Jan 19, 2026 •

edited

Loading

netlify bot commented Jan 19, 2026 •

edited

Loading

PingLiuPing commented Jan 29, 2026 •

edited

Loading

mbasmanova Feb 4, 2026 •

edited

Loading