feat: Support iceberg partition transform #15035

PingLiuPing · 2025-10-03T16:26:20Z

This is continuous of previous PR #14723.
Implements comprehensive partition transform functionality for
Iceberg tables, enabling data to be partitioned using various transform
functions including identity, bucket, truncate, and temporal transforms
(year, month, day, hour).

netlify · 2025-10-03T16:26:26Z

✅ Deploy Preview for meta-velox canceled.

Name	Link
🔨 Latest commit	`102596e`
🔍 Latest deploy log	https://app.netlify.com/projects/meta-velox/deploys/690ca8c2fcc99500082b43d3

PingLiuPing · 2025-10-07T20:10:41Z

@mbasmanova Could you help to have a look at this PR, thank you.

mbasmanova · 2025-10-07T20:17:53Z

@yingsu00 @rui-mo Ying, Rui, would you help review this change?

rui-mo

Thanks!

velox/connectors/hive/iceberg/PartitionSpec.h

rui-mo · 2025-10-07T21:06:49Z

velox/connectors/hive/iceberg/IcebergDataSink.h

      std::vector<HiveColumnHandlePtr> inputColumns,
      LocationHandlePtr locationHandle,
      dwio::common::FileFormat tableStorageFormat,
+      IcebergPartitionSpecPtr partitionSpec = nullptr,


Can we add a test to confirm that this spec controls the write directory as intended? It looks like the tests only check that the field parameters can be correctly set for now.

Thanks for the comment.
This PR just add the spec and it is not been actually used. So, we cannot add test to cover this scenario now.
In this PR https://github.com/facebookincubator/velox/pull/13874/files#diff-493592c0098c71afd3f75c22e49df793e1d32631dc51312d853a2fd4d139a9ec, I have added lots of test cases to cover the partition folder names. And I will split them into smaller PRs one by one.

@rui-mo Is this make sense to you? Thanks.

I have added lots of test cases to cover the partition folder names.

Sounds reasonable to me to add the tests in follow-up PRs, thanks.

rui-mo

Thanks. Looks good overall.

velox/connectors/hive/iceberg/PartitionSpec.h

PingLiuPing · 2025-10-16T15:40:21Z

@mbasmanova Could you help have a look, @rui-mo has reviewed and approved the PR.
Thanks.

mbasmanova · 2025-10-16T15:47:06Z

velox/connectors/hive/iceberg/IcebergDataSink.h

      std::optional<common::CompressionKind> compressionKind = {},
      const std::unordered_map<std::string, std::string>& serdeParameters = {});
+
+  IcebergPartitionSpecPtr partitionSpec() const {


use const& for return value

@PingLiuPing I'm looking at the first commit and somehow I don't see these old comments addressed. Would you double check?

@mbasmanova Oh, I made a mistake, I commit the fix in second commit. Now I combine 2 commits into a single commit.

mbasmanova · 2025-10-16T15:48:36Z

velox/connectors/hive/iceberg/PartitionSpec.h

+/// - kIdentity: Use the source value as-is (no transformation).
+/// - kHour/kDay/kMonth/kYear: Extract time components from timestamps/dates.
+/// - kBucket: Hash the value into N buckets for even distribution.
+/// - kTruncate: Truncate strings/numbers to a specified width.


Would you move these comments next to enum values?

enum class TransformType { /// Use the source value as-is (no transformation) kIdentity, /// Extract time components from timestamps/dates. kHour, kDay, kMonth, kYear, /// Hash the value into N buckets for even distribution. kBucket, ...

mbasmanova · 2025-10-16T15:49:28Z

velox/connectors/hive/iceberg/PartitionSpec.h

+/// represented by a named Field.
+///
+/// The partition spec defines:
+/// - Unique partition spec ID for versioning and evolution.


drop "partition spec" : Unique ID for ...

mbasmanova · 2025-10-16T15:55:31Z

velox/connectors/hive/iceberg/PartitionSpec.h

+///
+/// The partition spec defines:
+/// - Unique partition spec ID for versioning and evolution.
+/// - Which source columns to use for partitioning.


How is this specified? Does Field.name identifies source column by name?

Should we add any santity checks to verify that fields is not empty and doesn't contain duplicates?

@mbasmanova Thanks for the comment.
The partition field is defined by the upstream engine such as Presto and Spark.
For example in Presto
create table month_t1 (c_int int, c_date date, c_bigint bigint) with (format='PARQUET', partitioning=ARRAY['month(c_date)']);

In above DDL, it defines a partition transform month and the source column is c_date.
Such information will be processed by Iceberg java library, e.g. check the column type, column duplication etc.
And eventually these information will be passed into velox. Ideally, I want to use field ID to identify a source column and that's what the iceberg spec requires. But velox RowType only support matching field by field name so here I use the field name.

Should we add any santity checks to verify that fields is not empty and doesn't contain duplicates?

I think it is not necessary, such scenario has been processed by upstream iceberg library.

But velox RowType only support matching field by field name so here I use the field name.

Got it. Would be nice to clarify in a comment.

I think it is not necessary, such scenario has been processed by upstream iceberg library.

Velox is a separate component and as such it needs to validate external inputs. It cannot simply trust that the caller provides valid input. Without proper validation it is hard to troubleshoot issues as it is not clear whether the bug is in the library (Velox) or the application (Spark).

Thanks, makes sense. I will add some logic to validate the inputs.

mbasmanova · 2025-10-16T15:56:11Z

velox/connectors/hive/iceberg/tests/IcebergTestBase.cpp

      columnHandles,
      locationHandle,
      fileFormat_,
+      nullptr,


Would you add argument name as a comment to improve readability?

/*foo=*/nullptr

Thanks, will do.

mbasmanova · 2025-10-16T15:56:49Z

velox/connectors/hive/iceberg/tests/PartitionSpecTest.cpp

+namespace facebook::velox::connector::hive::iceberg {
+namespace {
+
+class PartitionSpecTest : public ::testing::Test {};


Seems like this class is not needed. Just use TEST instead of TEST_F.

Thanks, you are right.

mbasmanova · 2025-10-16T15:57:45Z

velox/connectors/hive/iceberg/tests/PartitionSpecTest.cpp

+  EXPECT_EQ("trunc", TransformTypeName::toName(TransformType::kTruncate));
+}
+
+TEST_F(PartitionSpecTest, basic) {


What are we testing here? Seems redundant.

Thanks, will remove this case.

mbasmanova · 2025-10-16T15:57:56Z

velox/connectors/hive/iceberg/tests/PartitionSpecTest.cpp

+  EXPECT_FALSE(spec.fields[1].parameter.has_value());
+}
+
+TEST_F(PartitionSpecTest, withParameters) {


Thanks, I want to test these two special transforms bucket and truncate that with additional parameter.
I think I will remove all these tests if I combine this PR with the actual partition transform.

It is hard to see what the value of this particular test is. It seems to be verifying that IcebergPartitionSpec::Field's struct was initialized properly. But there is no logic to test... Am I missing something?

Yes, you are correct. This test file is been added when I split the PR to smaller pieces. And it does not exist when I submit #13874. I will delete this test file when I combine the actual partition logic in this PR. It does not make sense to test the struct initialisation.

mbasmanova · 2025-10-16T15:59:35Z

@PingLiuPing It looks like this PR just add a new struct, but doesn't use it. Is this correct? Might be better to combine this RP with the one that introduces functionality that uses this struct.

PingLiuPing · 2025-10-16T16:09:37Z

@PingLiuPing It looks like this PR just add a new struct, but doesn't use it. Is this correct? Might be better to combine this RP with the one that introduces functionality that uses this struct.

@mbasmanova yes, in this PR I just add the basic data structures (PartitionSpec) that will be used later. The partition transform operations will be a huge PR, and I'm thinking to split it to several standalone PRs.

mbasmanova · 2025-10-16T16:15:51Z

The partition transform operations will be a huge PR, and I'm thinking to split it to several standalone PRs.

Let's include this spec in the first PR.

PingLiuPing · 2025-10-26T22:03:31Z

The partition transform operations will be a huge PR, and I'm thinking to split it to several standalone PRs.

Let's include this spec in the first PR.

@mbasmanova Thanks for your review comments.

I have integrated the partition spec with Iceberg identity partition transform. I will add other transforms in following PRs since this PR has more than 2K lines of change with only identity transform.

mbasmanova · 2025-10-27T06:24:59Z

velox/connectors/hive/iceberg/IcebergDataSink.h

  /// @param serdeParameters Additional serialization/deserialization parameters
  /// for the file format.
  IcebergInsertTableHandle(
      std::vector<HiveColumnHandlePtr> inputColumns,


IcebergInsertTableHandle takes HiveColumnHandlePtr as columns. HiveColumnHandle identifies which column is a partition key and which is not. It looks like in Iceberg this definition is not sufficient. Should we stop using HiveColumnHandle and introduce a separate IcebergColumnHandle? Otherwise, I assume we would have to document (and check) that kPartitionKey should never be used with Iceberg table columns.

class HiveColumnHandle : public ColumnHandle { public: /// NOTE: Make sure to update the mapping in columnTypeNames() when modifying /// this. enum class ColumnType { kPartitionKey, kRegular, kSynthesized, /// A zero-based row number of type BIGINT auto-generated by the connector. /// Rows numbers are unique within a single file only. kRowIndex, kRowId, };

@mbasmanova Thanks for the insightful comment.
So far with HiveColumnHandlePtr is still Ok. We only need to get which columns are partitionKey columns from HiveColumnHandlePtr. But we need to add IcebergInsertTableHandle in following PR (when collecting iceberg data file stats).

mbasmanova · 2025-11-05T12:19:42Z

velox/connectors/hive/iceberg/IcebergPartitionUtil.h

+// Convert a partition value to its string representation for use in
+// partition directory path. The format follows the Iceberg specification
+// for partition path encoding.
+class IcebergPartitionUtil : public HivePartitionUtil {


perhaps, IcebergPartitionUtil -> IcebergPartitionPath

PingLiuPing · 2025-11-06T10:11:36Z

@mbasmanova Thank you for taking the time to review this PR. I’ve addressed your comments and rebased the branch.

mbasmanova · 2025-11-06T11:12:27Z

@PingLiuPing Thank you for iterating. GitHub is slow on this PR. Wondering if this is because there are lots of changes or comments or both. Wondering if there is a way to extract a smaller PRs from it.

I'm still trying to understand the big picture. I'm reading this doc: https://iceberg.apache.org/docs/latest/partitioning/#icebergs-hidden-partitioning Let me know if there are better resources.

My understanding at this point is that partitioning configuration consists of one or more transforms. Each transform takes a single column as input. Different transforms can use the same or different columns as inputs. Each transform produces a partitioning key (hidden column). The values of all partition keys are used to generate partition directory. The inputs to transforms are columns in the table. CTAS or INSERT INTO query produces a set of output columns, which are mapped to columns in the table schema. The names of query output columns (input to PartionIdGenerator and TableWrite operator) do not necessarily match the names in the table schema. Hence, InsertTableHandle (or similar) must provide the mappings: index if input column -> schema column.

Is this about right?

mbasmanova · 2025-11-06T11:16:02Z

velox/connectors/hive/iceberg/TransformEvaluator.cpp

+  if (field.parameter.has_value()) {
+    exprArgs.emplace_back(
+        std::make_shared<core::ConstantTypedExpr>(
+            INTEGER(), variant(field.parameter.value())));


variant -> Variant

mbasmanova · 2025-11-06T11:18:39Z

velox/connectors/hive/iceberg/TransformEvaluator.cpp

+      exprSet_.get(), rows, *input, results);
+
+  // Verify that all expressions preserved the vector size.
+  for (auto i = 0; i < numExpressions; i++) {


This loop seems redundant.

mbasmanova · 2025-11-06T11:19:06Z

velox/connectors/hive/iceberg/TransformEvaluator.h

+/// Provides static methods to build expression trees from Iceberg partition
+/// specification. These expressions can be compiled and evaluated using
+/// Velox's expression evaluation framework.
+class TransformExprBuilder {


Let's move this to its own file. Would it make sense to extract this class into a separate PR?

mbasmanova · 2025-11-06T11:19:57Z

velox/connectors/hive/iceberg/TransformEvaluator.h

+
+namespace facebook::velox::connector::hive::iceberg {
+
+/// Utility class for converting Iceberg partition specification to Velox


Utility class for converting

replace with "Converts ..."; aim to start comments with active verbs

There is a lot of repetition in the comments. Please, revise.

Thanks. revised.

mbasmanova · 2025-11-06T11:21:28Z

velox/connectors/hive/iceberg/PartitionSpec.h

+///
+/// The partition spec defines:
+/// - Unique ID for versioning and evolution.
+/// - Which source columns to use for partitioning (identified by field name,


source columns

The 'source' here actually means column in the table schema... I was confused about this for quite some time.

Refined a bit.

mbasmanova · 2025-11-06T11:23:23Z

velox/connectors/hive/iceberg/PartitionSpec.cpp

+    const auto& name = field.name;
+    if (!isValidPartitionType(type)) {
+      VELOX_USER_FAIL(
+          "Type '{}' is not supported as a partition column.", type->name());


Make sure to put runtime information at the end of the messages. This makes it easier to grep for errors that appear in prod. Please, update throughout.

"Type is not supported as a partition column: {}"

Thanks, will check all patterns.

mbasmanova · 2025-11-06T11:24:09Z

velox/connectors/hive/iceberg/IcebergPartitionPath.h

+      const override;
+
+ private:
+  TransformType transformType_;


mbasmanova · 2025-11-06T11:25:59Z

velox/connectors/hive/iceberg/IcebergPartitionIdGenerator.h

+  /// @return RowVector with one column per transformed column, columns in same
+  /// order as IcebergPartitionSpec::fields. Returns nullptr if no partitions
+  /// have been created.
+  RowVectorPtr partitionKeys() const {


use const & for return type

mbasmanova · 2025-11-06T11:26:48Z

velox/connectors/hive/iceberg/IcebergPartitionIdGenerator.cpp

+template <TypeKind Kind>
+std::pair<std::string, std::string> makePartitionKeyValueString(
+    const IcebergPartitionUtilPtr& formatter,
+    const BaseVector* partitionVector,


change raw pointer to const &

mbasmanova · 2025-11-06T11:27:08Z

velox/connectors/hive/iceberg/IcebergPartitionPath.h

+  TransformType transformType_;
+};
+
+using IcebergPartitionUtilPtr = std::shared_ptr<const IcebergPartitionPath>;


IcebergPartitionUtilPtr -> IcebergPartitionPathPtr

mbasmanova · 2025-11-06T11:28:47Z

velox/connectors/hive/iceberg/IcebergPartitionIdGenerator.h

+  /// partition column.
+  ///
+  /// @return RowVector with one column per transformed column, columns in same
+  /// order as IcebergPartitionSpec::fields. Returns nullptr if no partitions


Returns nullptr if no partitions

Is this accurate? The implementation suggests that return value is never nullptr.

oh, you are right. Removed.

mbasmanova · 2025-11-06T11:29:47Z

velox/connectors/hive/iceberg/IcebergPartitionIdGenerator.cpp

+
+  rowType_ = ROW(std::move(partitionKeyNames), std::move(partitionKeyTypes));
+
+  partitionValues_ = BaseVector::create<RowVector>(


We have a getter that returns partitionValues_ as is. It seems this vector may have more rows than are actually valid / populated. Is this intentional?

Thanks for the comment. Yes, it is intentional since the the maximum row is maxPartitions_.
I refined this to dynamically resize.

velox/connectors/hive/iceberg/IcebergPartitionIdGenerator.cpp

PingLiuPing · 2025-11-06T11:33:55Z

@PingLiuPing Thank you for iterating. GitHub is slow on this PR. Wondering if this is because there are lots of changes or comments or both. Wondering if there is a way to extract a smaller PRs from it.

I'm still trying to understand the big picture. I'm reading this doc: https://iceberg.apache.org/docs/latest/partitioning/#icebergs-hidden-partitioning Let me know if there are better resources.

My understanding at this point is that partitioning configuration consists of one or more transforms. Each transform takes a single column as input. Different transforms can use the same or different columns as inputs. Each transform produces a partitioning key (hidden column). The values of all partition keys are used to generate partition directory. The inputs to transforms are columns in the table. CTAS or INSERT INTO query produces a set of output columns, which are mapped to columns in the table schema. The names of query output columns (input to PartionIdGenerator and TableWrite operator) do not necessarily match the names in the table schema. Hence, InsertTableHandle (or similar) must provide the mappings: index if input column -> schema column.

Is this about right?

@mbasmanova , Yes, your understanding is perfect. Only one small thing need to highlight,

Different transforms can use the same or different columns as inputs.

Not all transforms can be applied to same column. There are some subtle restrictions on this. See this comment from PartitionSpec.h

/// Multiple transforms on the same column are allowed, but with restrictions:
/// - Transforms are organized into 4 categories: Identity, Temporal
///   (Year/Month/Day/Hour), Bucket, and Truncate.
/// - Each category can appear at most once per column.
/// - Example valid specs: ARRAY['truncate(a, 2)', 'bucket(a, 16)', 'a'] or
///   ARRAY['year(b)', 'bucket(b, 16)', 'b']

Let me know if there are better resources.

When I implementing the partition transform I primarily reference to Iceberg spec directly, https://iceberg.apache.org/spec/#partitioning

Wondering if there is a way to extract a smaller PRs from it.

Yes, I think so. It is just hard to make each PR to integrated with existing code to provide some new functionality.

I think I can submit a few separate PRs as preliminary groundwork and then I can submit another PR to integrate them together and also integrate with existing code. Would this way be Ok?

mbasmanova

@PingLiuPing Do you think it is possible to extract PartitionSpec, TransformEvaluator and IcebergPartitionPath into a separate PR?

mbasmanova · 2025-11-06T11:36:17Z

I think I can submit a few separate PRs as preliminary groundwork and then I can submit another PR to integrate them together and also integrate with existing code. Would this way be Ok?

That would be great. Thank you for being so accommodating. At this point, I feel that most if not all major questions have been resolved and I expect we can proceed with final reviews quickly. Having smaller PRs would help quite a bit.

PingLiuPing · 2025-11-06T11:39:07Z

I think I can submit a few separate PRs as preliminary groundwork and then I can submit another PR to integrate them together and also integrate with existing code. Would this way be Ok?

That would be great. Thank you for being so accommodating. At this point, I feel that most if not all major questions have been resolved and I expect we can proceed with final reviews quickly. Having smaller PRs would help quite a bit.

I really appreciate your time and feedback. That’s very encouraging, I’ll split the PR and submit the first PR shortly.

PingLiuPing requested a review from mbasmanova October 3, 2025 16:26

PingLiuPing requested a review from majetideepak as a code owner October 3, 2025 16:26

meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Oct 3, 2025

rui-mo reviewed Oct 7, 2025

View reviewed changes

PingLiuPing force-pushed the lp_iceberg_basic_insertion_02 branch from 2d7689f to e465df4 Compare October 8, 2025 14:39

rui-mo approved these changes Oct 14, 2025

View reviewed changes

velox/connectors/hive/iceberg/PartitionSpec.h Outdated Show resolved Hide resolved

PingLiuPing force-pushed the lp_iceberg_basic_insertion_02 branch 2 times, most recently from 255954f to 97b4512 Compare October 15, 2025 09:38

mbasmanova reviewed Oct 16, 2025

View reviewed changes

PingLiuPing mentioned this pull request Oct 17, 2025

fix: Remove redundant partition value handling in Iceberg column adaptation #14516

Closed

PingLiuPing force-pushed the lp_iceberg_basic_insertion_02 branch 2 times, most recently from 58c82c3 to d2c3edf Compare October 26, 2025 21:52

PingLiuPing changed the title ~~feat: Add iceberg partition specification~~ feat: Support iceberg identity partition transform Oct 26, 2025

PingLiuPing force-pushed the lp_iceberg_basic_insertion_02 branch from d2c3edf to 67dba5c Compare October 26, 2025 22:12

mbasmanova reviewed Oct 27, 2025

View reviewed changes

mbasmanova reviewed Nov 5, 2025

View reviewed changes

PingLiuPing force-pushed the lp_iceberg_basic_insertion_02 branch 3 times, most recently from f6b3eb2 to 96a4c26 Compare November 6, 2025 09:59

mbasmanova reviewed Nov 6, 2025

View reviewed changes

velox/connectors/hive/iceberg/IcebergPartitionIdGenerator.cpp Outdated Show resolved Hide resolved

mbasmanova reviewed Nov 6, 2025

View reviewed changes

velox/connectors/hive/iceberg/IcebergPartitionIdGenerator.cpp Outdated Show resolved Hide resolved

mbasmanova reviewed Nov 6, 2025

View reviewed changes

Support iceberg partition transform

102596e

PingLiuPing force-pushed the lp_iceberg_basic_insertion_02 branch from 96a4c26 to 102596e Compare November 6, 2025 13:55

This was referenced Nov 6, 2025

feat: Add iceberg partition specification #15423

Closed

feat: Add support for evaluating Iceberg partition transforms #15440

Closed

feat: Add Iceberg partition name generator #15461

Closed

PingLiuPing closed this Nov 14, 2025


		namespace facebook::velox::connector::hive::iceberg {

		/// Utility class for converting Iceberg partition specification to Velox


		rowType_ = ROW(std::move(partitionKeyNames), std::move(partitionKeyTypes));

		partitionValues_ = BaseVector::create<RowVector>(

feat: Support iceberg partition transform #15035

feat: Support iceberg partition transform #15035

Uh oh!

Conversation

PingLiuPing commented Oct 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

netlify bot commented Oct 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for meta-velox canceled.

Uh oh!

PingLiuPing commented Oct 7, 2025

Uh oh!

mbasmanova commented Oct 7, 2025

Uh oh!

rui-mo left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rui-mo left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

PingLiuPing commented Oct 16, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mbasmanova Oct 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

PingLiuPing Oct 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mbasmanova commented Oct 16, 2025

PingLiuPing commented Oct 3, 2025 •

edited

Loading

netlify bot commented Oct 3, 2025 •

edited

Loading

mbasmanova Oct 16, 2025 •

edited

Loading

PingLiuPing Oct 16, 2025 •

edited

Loading