Skip to content

feat: Collect Iceberg stats#16062

Closed
PingLiuPing wants to merge 5 commits intofacebookincubator:mainfrom
PingLiuPing:lp_iceberg_field_ids
Closed

feat: Collect Iceberg stats#16062
PingLiuPing wants to merge 5 commits intofacebookincubator:mainfrom
PingLiuPing:lp_iceberg_field_ids

Conversation

@PingLiuPing
Copy link
Copy Markdown
Collaborator

@PingLiuPing PingLiuPing commented Jan 19, 2026

Implements Iceberg Parquet data file statistics collection.
The stats include:

  • numRecords: Total record count
  • columnsSizes: Compressed size per column
  • valueCounts: Value count per column
  • nullValueCounts: Null value count per column
  • nanValueCounts: NaN value count per column (placeholder for future support)
  • lowerBounds/upperBounds: Min/max bounds per column (base64-encoded), currently do not include array and map types and all its descendants.

These stats is presented by struct IcebergDataFileStatistics.

Added IcebergParquetStatsCollector for collecting Iceberg Parquet files stats.
This collector take Parquet file metadata as input and aggregates across all row groups, computes global (per file) min/max bounds. The file metadata is returned when calling Writer::close().

Once the stats are collected. Includes full stats in commit message JSON (previously only had recordCount).

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jan 19, 2026
@netlify
Copy link
Copy Markdown

netlify bot commented Jan 19, 2026

Deploy Preview for meta-velox canceled.

Name Link
🔨 Latest commit 9c903cf
🔍 Latest deploy log https://app.netlify.com/projects/meta-velox/deploys/69ae9a3405976000084bbc3b

@PingLiuPing
Copy link
Copy Markdown
Collaborator Author

@mbasmanova @majetideepak @Yuhta Could you please take a look at this PR? Thank you very much.

Copy link
Copy Markdown
Contributor

@mbasmanova mbasmanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see logic that uses the stats config. Can you add that?

VELOX_CHECK_EQ(field.children.size(), type->size());
config.children.reserve(field.children.size());

if (type->isRow()) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to switch on type. Just loop over [0, Type::size()) and use Type::childAt.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks.

@@ -392,6 +445,28 @@ IcebergDataSink::createWriterOptions() const {
// (TimestampPrecision::kMicroseconds).
options->serdeParameters["parquet.writer.timestamp.unit"] = "6";
options->serdeParameters["parquet.writer.timestamp.timezone"] = "";

std::function<folly::dynamic(const IcebergFieldStatsConfig&)> toJson =
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we missing logic to serialize 'skipBounds'?

Can you move serde logic to a method on IcebergFieldStatsConfig?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Deleted seDer logic now and directly passing ParquetFieldId to writer options.

/// Config for collecting Iceberg parquet field statistics.
/// Holds the Iceberg source field id and whether to skip bounds
/// collection for this field. For nested field, it contains child fields.
struct IcebergFieldStatsConfig {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume this struct is specific to Parquet. If that's the case, we may want to add Parquet to the name.

perhaps, IcebergParquetWriterStatsConfig

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks.

// of field ID structures. Each structure has "fieldId" (int32) and optional
// "children" (array of nested structures).
// Example: [{"fieldId":1,"children":[{"fieldId":2}]},{"fieldId":3}].
static constexpr const char* kParquetSerdeFieldIds =
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why "Serde"?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, see #16062 (comment)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, but I do not understand. Would you clarify some more? This config seems to control how writer collects stats. It doesn't appear to be used in any sort of serialization or deserialization. Am I missing something?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, sorry for the confusion — let me clarify.

This config indeed controls how the writer collects stats and serves two purposes:

  1. It provides the Parquet field_id, which should be passed down to the Parquet writer and written into the Parquet data file.
  2. It provides skipBounds, which indicates whether min/max statistics should be collected for a given Parquet field. This is used after the Parquet data is written, during the conversion from Parquet stats to Iceberg stats (not implemented yet).

For the first part, ideally we would just set ParquetFieldId directly here. However, I’ve run into issues with this approach in the past (see:
#15509 (comment)).

To avoid explicitly depending on symbols from the Parquet writer, I pass this parameter to the Parquet writer via serdeParameters instead.

An alternative could be to change the CMake configuration so that connectors/hive/iceberg is only compiled when VELOX_ENABLE_PARQUET is enabled and then we can safely include parquet symbols in here.

@mbasmanova
Copy link
Copy Markdown
Contributor

@yingsu00 Ying, do you have any thoughts on tight coupling between Iceberg connector and Parquet format? See #16062 (comment)

It seems non-ideal for the connector to depend directly on the particular storage format.

@PingLiuPing PingLiuPing force-pushed the lp_iceberg_field_ids branch 4 times, most recently from 98dd9ec to 056d456 Compare January 29, 2026 21:01
@PingLiuPing
Copy link
Copy Markdown
Collaborator Author

PingLiuPing commented Jan 29, 2026

@mbasmanova Thank you for reviewing this PR.

I’ve now merged all Iceberg stats collection logic into this PR and updated the description accordingly. Initially I planned to split it into 3 PRs, with this one only covering stats configuration, but consolidating them makes the overall design easier to review.

Regarding the concern about tight coupling between the Iceberg connector and Parquet, at the moment, the Iceberg connector depends on Parquet in three places:
1. Unit tests need to explicitly register Parquet readers (Hive does the same).
2. Unit tests also need to explicitly register Parquet writers. And functionally the Parquet writer factory must be registered before writing Iceberg tables. This dependency was introduced recently as part of adding Iceberg write support.
3. Iceberg source code depends on velox/dwio/parquet/writer/arrow/Metadata.h, velox/dwio/parquet/writer/arrow/Statistics.h and velox/dwio/parquet/ParquetFieldId.h from parquet. These dependencies were also introduced with Iceberg write support.
ParquetFieldId.h is now a standalone header compiled as a CMake interface library, so this dependency should be relatively ok. The other two headers are only included by IcebergParquetStatsCollector.cpp, which I now conditionally compile only when VELOX_ENABLE_PARQUET is set.

I couldn’t find a more elegant way to fully decouple this at the moment. For Parquet field IDs in particular, since RowType does not carry equivalent IDs, we eventually need a mechanism to pass these IDs to the Parquet writer. The current approach passes them via writer options.

For IcebergParquetStatsCollector.cpp, I did implement an alternative version under velox/dwio/parquet/writer. However, since this logic is collecting Iceberg-specific statistics, it feels more appropriate to keep it in the Iceberg connector layer. And current design also leaves room for better extensibility if we support additional Iceberg-like table formats in the future.

While this PR is mainly focused on Iceberg stats collection, I plan to refactor iceberg/CMakeLists.txt in the near future to split the target into separate reader and writer components. Since Parquet is currently the only supported file format for writing, the writer target can then be conditionally compiled only when VELOX_ENABLE_PARQUET is enabled.
And the current cmake target name velox_hive_iceberg_splitreader is not accurate anymore since write is supported now.

@PingLiuPing
Copy link
Copy Markdown
Collaborator Author

@yingsu00 Ying, do you have any thoughts on tight coupling between Iceberg connector and Parquet format? See #16062 (comment)

It seems non-ideal for the connector to depend directly on the particular storage format.

I think Ying's PR #14090 could help resolve the parquet reader and writer reference issue from test.

@PingLiuPing PingLiuPing changed the title feat: Add Iceberg field stats config feat: Collect Iceberg stats Jan 29, 2026
@mbasmanova
Copy link
Copy Markdown
Contributor

For Parquet field IDs in particular, since RowType does not carry equivalent IDs, we eventually need a mechanism to pass these IDs to the Parquet writer. The current approach passes them via writer options.

I believe Iceberg supports other file formats as well. How does it pass field IDs to these?

@PingLiuPing
Copy link
Copy Markdown
Collaborator Author

I believe Iceberg supports other file formats as well. How does it pass field IDs to these?

Yes, you’re right, Iceberg supports Avro, Parquet, and ORC. According to the Iceberg spec, all supported file formats are required to store IDs (see the format-specific requirements in the spec: https://iceberg.apache.org/spec/#appendix-a-format-specific-requirements).
In Velox Iceberg writer only supports Parquet format.

There is also a current limitation on the Iceberg reader side. When reading an Iceberg table written by another engine, even if the Parquet schema contains field_id, the Velox Iceberg reader does not use field IDs for column matching. Instead, it follows the same logic as the Hive reader, matching columns by name or by position.

Had a quick look at the Iceberg Java implementation, there is explicit logic to translate Iceberg field IDs into the corresponding IDs of the underlying file format. For example:

In our case, the Iceberg field IDs are computed and passed down from the coordinator node (Java) to the writer. These IDs are exported to Arrow and eventually write to parquet file schema.

@PingLiuPing
Copy link
Copy Markdown
Collaborator Author

@mbasmanova Sorry, I realized I forgot to tag you earlier. Just flagging it here in case you have a chance to take a look and would love to hear your thoughts.

Copy link
Copy Markdown
Contributor

@mbasmanova mbasmanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@PingLiuPing I'm seeing a number of warnings. Any chance you could go over these and address?

Screenshot 2026-02-04 at 3 52 30 AM

feat: Collect Iceberg stats

I assume this refers to file-level stats. If so, perhaps, update PR title to clarify.

// @param field The Parquet field ID containing the field ID and child field
// IDs.
// @param type The Velox type corresponding to this field.
// @param skipBounds Whether to skip bounds collection for this field and its
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@PingLiuPing Can you remind me why do we need skipBounds functionality? Why don't we collect min/max values unconditionally?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is skipBounds == true equivalent to saying that a column is a map/array of a subfield of such?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @mbasmanova.

This is a bit tricky. Initially, I was collecting min/max unconditionally for all types, and the Iceberg spec doesn’t explicitly describe this behavior. During testing, I found that both Presto and Spark do not collect min/max statistics for array and map types.

Looking into the Iceberg source code, there is indeed a such limitation enforced there. For reference:
https://github.com/apache/iceberg/blob/5970ddd9278a2baa060183a18f895de3608eab1f/parquet/src/main/java/org/apache/iceberg/parquet/ParquetMetrics.java#L185-L200

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is skipBounds == true equivalent to saying that a column is a map/array of a subfield of such?

Yes, based on current implementation we can have such assumption.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@PingLiuPing What's the downside of collecting these stats?

If we do need to skip collecting min/max for arrays, maps and their subfields, then let's rename 'skipBounds', remove it from public API and make it an implementation detail.

bool currentSkipBounds = skipBounds || type->isMap() || type->isArray();
IcebergParquetStatsConfig config(field.fieldId, currentSkipBounds);

if (!field.children.empty()) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this check? Can we execute the logic inside the 'if' body unconditionally?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, it can be removed.

auto icebergColumnHandle =
checkedPointerCast<const IcebergColumnHandle>(columnHandle);
icebergParquetStatsConfig_.push_back(toIcebergFieldStatsConfig(
icebergColumnHandle->field(), icebergColumnHandle->dataType(), false));
Copy link
Copy Markdown
Contributor

@mbasmanova mbasmanova Feb 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you annotate 'false' with the parameter name ?

/*param=*/false

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks.

("partitionSpecJson",
icebergInsertTableHandle->partitionSpec() ? icebergInsertTableHandle->partitionSpec()->specId : 0)
// Sort order evolution is not supported. Set default id to 1.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does 1 mean?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks.

From iceberg spec https://iceberg.apache.org/spec/#sorting, a sort order is defined by a sort order id and a list of sort fields.
Order id 0 is reserved for the unsorted order. Will change this to 0.

/// Iterates through all row groups and columns to collect:
/// - Record count, split offsets, value counts, column sizes, null counts.
/// - Min/max bounds (base64-encoded) for columns not in skipBounds set.
/// @param metadata Pointer to shared_ptr<parquet::arrow::FileMetaData>.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is being passed as void*?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks.
Refactored the code to use template class.

dataFileStats->numRecords = fileMetadata->num_rows();
const auto numRowGroups = fileMetadata->num_row_groups();
for (auto i = 0; i < numRowGroups; ++i) {
const auto& rgm = fileMetadata->RowGroup(i);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rgm -> rowGroupMetadata

// e.g., schema_->size(). It also contains the sub-fields when there are
// nested data types in table's schema.
std::unordered_set<int32_t> skipBoundsFields;
int32_t numFields = 0;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is numFields used to sanity check row group metadata? Is this needed?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. You are right, this can be removed.

Comment on lines +73 to +76
std::unordered_map<int32_t, std::shared_ptr<parquet::arrow::Statistics>>
globalMinStats;
std::unordered_map<int32_t, std::shared_ptr<parquet::arrow::Statistics>>
globalMaxStats;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need 2 maps? Can we use a single map with value being a pair of stats?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Merged to a single map.

@PingLiuPing
Copy link
Copy Markdown
Collaborator Author

@PingLiuPing Need to add that conditional too as it fails builds otherwise.

@kgpai Thanks, tested locally with VELOX_ENABLE_PARQUET=OFF.

@kgpai
Copy link
Copy Markdown
Contributor

kgpai commented Mar 2, 2026

@PingLiuPing Update: Almost there, some builds/tests failing which hopefully I can get over the line today.

@PingLiuPing
Copy link
Copy Markdown
Collaborator Author

@PingLiuPing Update: Almost there, some builds/tests failing which hopefully I can get over the line today.

@kgpai Thank you so much.

@kgpai
Copy link
Copy Markdown
Contributor

kgpai commented Mar 3, 2026

@PingLiuPing HiveDataSinkTest.memoryReclaimAfterClose is seeing thread sanitizer issues with this change, which is what I am investigating. Will update you if some code changes are required to fix that.

@PingLiuPing
Copy link
Copy Markdown
Collaborator Author

@kgpai Thank you for helping importing this PR, let me know if you need anything from my side.

@kgpai
Copy link
Copy Markdown
Contributor

kgpai commented Mar 4, 2026

@PingLiuPing Had to make some changes on the parquet side to get our gluten builds working. So exporting that to this PR.

@kgpai
Copy link
Copy Markdown
Contributor

kgpai commented Mar 6, 2026

@PingLiuPing I had to make some changes to get things to work, but unfortunately cant export them .
Can you apply this patch like so git apply writerconfig.patch and update the PR. I will reimport it then .
writerconfig.patch

@PingLiuPing PingLiuPing force-pushed the lp_iceberg_field_ids branch from a1b87a0 to 9ed076d Compare March 9, 2026 09:58
@PingLiuPing
Copy link
Copy Markdown
Collaborator Author

@PingLiuPing I had to make some changes to get things to work, but unfortunately cant export them . Can you apply this patch like so git apply writerconfig.patch and update the PR. I will reimport it then . writerconfig.patch

@kgpai Thanks, I updated the code. And I also fixed another code conflict that relate to writer rotation.

@PingLiuPing PingLiuPing force-pushed the lp_iceberg_field_ids branch from 9ed076d to 9c903cf Compare March 9, 2026 10:00
@PingLiuPing
Copy link
Copy Markdown
Collaborator Author

@kgpai Thank you for importing again, I see internal build and test still failing, anything I can do?

@kgpai
Copy link
Copy Markdown
Contributor

kgpai commented Mar 11, 2026

No few more tests need fixing - hoping to resolve that by tomorrow.

@kgpai
Copy link
Copy Markdown
Contributor

kgpai commented Mar 13, 2026

@PingLiuPing Thank you for bearing with me - almost there with this diff. If all goes well should land today.

@PingLiuPing
Copy link
Copy Markdown
Collaborator Author

@PingLiuPing Thank you for bearing with me - almost there with this diff. If all goes well should land today.

@kgpai Thank you so much, this will make my ongoing work much easier.

@meta-codesync
Copy link
Copy Markdown

meta-codesync bot commented Mar 14, 2026

@kgpai merged this pull request in e7dd656.

@PingLiuPing PingLiuPing deleted the lp_iceberg_field_ids branch March 21, 2026 16:27
nmahadevuni added a commit to nmahadevuni/velox that referenced this pull request Apr 1, 2026
nmahadevuni added a commit to nmahadevuni/velox that referenced this pull request Apr 1, 2026
nmahadevuni added a commit to nmahadevuni/velox that referenced this pull request Apr 1, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. Merged ready-to-merge PR that have been reviewed and are ready for merging. PRs with this tag notify the Velox Meta oncall

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants