Skip to content

Conversation

@PingLiuPing
Copy link
Collaborator

@PingLiuPing PingLiuPing commented Nov 7, 2025

Introduce infrastructure for evaluating Iceberg partition transforms.

  • TransformExprBuilder converts Iceberg partition specifications into Velox expressions
  • TransformEvaluator evaluates multiple transform expressions in a single pass using compiled ExprSet.

@netlify
Copy link

netlify bot commented Nov 7, 2025

Deploy Preview for meta-velox canceled.

Name Link
🔨 Latest commit fe7e347
🔍 Latest deploy log https://app.netlify.com/projects/meta-velox/deploys/6911e45d50d45900096cac03

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Nov 7, 2025
@PingLiuPing PingLiuPing requested review from mbasmanova and removed request for majetideepak November 7, 2025 15:20
@PingLiuPing
Copy link
Collaborator Author

@mbasmanova This PR continues from #15035 (review).
In next PR, I plan to include IcebergPartitionIdGenerator and do some refactor on Hive PartitionIdGenerator, I roughly count the lines which should be also around 700 lines. After that, all the preliminary work should be complete, and I’ll submit another PR to integrate them with IcebergDataSink.

@mbasmanova
Copy link
Contributor

Ci is red


/// Converts a partition value to its string representation for use in
/// partition directory path. The format follows the Iceberg specification
/// for partition path encoding.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have a link to the spec? Would be nice to add here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the comment.
I cannot find an explicit document about this. I will revise the comment.
I followed the Iceberg java code to implement this.

explicit IcebergPartitionPath(TransformType transformType)
: transformType_(transformType) {}

~IcebergPartitionPath() override = default;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not needed

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks.


using HivePartitionUtil::toPartitionString;

std::string toPartitionString(int32_t value, const TypePtr& type)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perhaps, document the format of the result

/// Converts a partition value to its string representation for use in
/// partition directory path. The format follows the Iceberg specification
/// for partition path encoding.
class IcebergPartitionPath : public HivePartitionUtil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be possible to extract this class into a separate PR and add a test?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for reviewing.

I will separate this to a new PR.

/// type, and optional parameter (e.g., bucket count, truncate width).
/// @param inputFieldName Name of the source column in the input RowVector.
/// @return Typed expression representing the transform.
static core::TypedExprPtr toExpression(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to expose this API?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. I moved it to anonymous namespace.


namespace {

constexpr char const* kIcebergFunctionPrefix{"iceberg_"};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move this next to its only use

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks.

class TransformTest : public test::IcebergTestBase {
protected:
template <typename T>
VectorPtr createFlatVector(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not use VectorMaker or helper methods from VectorTestBase?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Changed to VectorMaker


// Create expected vector and verify results.
auto expectedVector = createFlatVector<OUT>(expectedValues);
ASSERT_EQ(resultVector[0]->size(), expectedValues.size());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use assertEqualVectors or similar from VectorTestBase.h

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks.

@PingLiuPing PingLiuPing force-pushed the lp_iceberg_transform_eval branch from ff2e9e0 to ae18463 Compare November 7, 2025 19:47
case TransformType::kTruncate:
return type;
}
VELOX_UNREACHABLE("Unknown transform type");
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mbasmanova Have to add this to fix the building break on Linux. Macos is fine without this change.

Comment on lines 32 to 33
const std::vector<std::optional<IN>>& inputValues,
const std::vector<std::optional<OUT>>& expectedValues,
Copy link
Contributor

@mbasmanova mbasmanova Nov 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we replace these arguments with RowVectorPtr and use makeFlatVector and similar in the callers?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Just updated.

@PingLiuPing PingLiuPing force-pushed the lp_iceberg_transform_eval branch from ae18463 to d007247 Compare November 7, 2025 22:15
protected:
void testTransform(
const IcebergPartitionSpec::Field& field,
const RowVectorPtr& inputRowVector,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: inputRowVector -> input, expectedRowVector -> expected

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks.

auto transformExprs = TransformExprBuilder::toExpressions(
spec,
std::vector<column_index_t>{0},
asRowType(inputRowVector->type()));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

input->rowType()

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks

const RowVectorPtr& inputRowVector,
const RowVectorPtr& expectedRowVector) {
// Build and evaluate transform expressions.
auto spec = std::make_shared<const IcebergPartitionSpec>(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we test single field specs only? Why not test multi-field specs?

Copy link
Collaborator Author

@PingLiuPing PingLiuPing Nov 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the comment.
I added those test one by one when I implement the transformation logic one by one, and type by type.
I will refactor it to support test multiple transforms.

{makeFlatVector<StringView>(
{StringView("\x01\x02\x03", 3), StringView("", 0)},
VARBINARY())}),
makeRowVector({makeFlatVector<StringView>(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

drop StringView or use std::string

makeFlatVector<std::string>({"foo", "bar"})

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks.

@PingLiuPing PingLiuPing force-pushed the lp_iceberg_transform_eval branch from d007247 to 8047d9b Compare November 8, 2025 20:26

// Register Iceberg functions once for expression evaluation.
static std::once_flag registerFlag;
static constexpr char const* kIcebergFunctionPrefix{"iceberg_"};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

iceberg_ prefix is now hard-coded in 2 places: here and in TransformExprBuilder.cpp

Also, It seems better to make these functions available unconditionally, not only after running a CTAS or INSERT INTO query.

I suggest to move the logic for registering these functions to the same place where we register Iceberg connector and add a constant for prefix, which can be reused without hard-coding multiple times.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the comment.
Iceberg connector does not exist at the moment. Can I register these functions when registering Hive connector?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Iceberg connector does not exist at the moment.

Hmm... but then, how can it be used? Should we create one or at least add a function to register what's needed for Iceberg?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently it depends on HiveConnector::createDataSink to create an IcebergDataSink.

Should we create one or at least add a function to register what's needed for Iceberg?

How about adding a function first in this PR, and adding IcebergConnector in a separate PR?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about adding a function first in this PR, and adding IcebergConnector in a separate PR?

Sounds good.

makeFlatVector<std::string>({("\x01\x02\x03"), ("")}, VARBINARY()),
makeFlatVector<Timestamp>(
{Timestamp(0, 0), Timestamp(1609459200, 0)})}),
makeRowVector(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's not copy-paste expected values for identity transform. Let's use 'input'.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks.

makeNullableFlatVector<Timestamp>({std::nullopt}),
makeNullableFlatVector<Timestamp>({std::nullopt}),
}),
makeRowVector({
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, this case tested all transforms on null value. Most of the expected vector type is different with input.


TEST_F(TransformTest, multipleTransforms) {
const auto& rowType =
ROW({"c_int", "c_date", "c_varchar"}, {INTEGER(), DATE(), VARCHAR()});
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no need to use custom column names; use c0, c1, c2... and then drop rowType->name() argument from makeRowVector; ditto all other places

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks.

@PingLiuPing PingLiuPing force-pushed the lp_iceberg_transform_eval branch from 8047d9b to fe7e347 Compare November 10, 2025 13:10
Copy link
Contributor

@mbasmanova mbasmanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks.

@mbasmanova mbasmanova added the ready-to-merge PR that have been reviewed and are ready for merging. PRs with this tag notify the Velox Meta oncall label Nov 10, 2025
@mbasmanova mbasmanova changed the title feat: Add iceberg partition transform evaluator and utility feat: Add support for evaluating Iceberg partition transforms Nov 10, 2025
@PingLiuPing
Copy link
Collaborator Author

@peterenescu Could you please help to import this PR? Thank you.

@meta-codesync
Copy link

meta-codesync bot commented Nov 10, 2025

@peterenescu has imported this pull request. If you are a Meta employee, you can view this in D86686271.

@meta-codesync
Copy link

meta-codesync bot commented Nov 10, 2025

@peterenescu merged this pull request in bef78ba.

prestodb-ci pushed a commit to IBM/velox that referenced this pull request Nov 11, 2025
…facebookincubator#15440)"

This reverts commit bef78ba.

Alchemy-item: (ID = 854) Iceberg staging hub commit 2/14 - ecf692f
prestodb-ci pushed a commit to IBM/velox that referenced this pull request Nov 11, 2025
…facebookincubator#15440)"

This reverts commit bef78ba.

Alchemy-item: (ID = 854) Iceberg staging hub commit 2/14 - ecf692f
prestodb-ci pushed a commit to IBM/velox that referenced this pull request Nov 11, 2025
…facebookincubator#15440)"

This reverts commit bef78ba.

Alchemy-item: (ID = 854) Iceberg staging hub commit 2/14 - ecf692f
prestodb-ci pushed a commit to IBM/velox that referenced this pull request Nov 12, 2025
…facebookincubator#15440)"

This reverts commit bef78ba.

Alchemy-item: (ID = 854) Iceberg staging hub commit 2/14 - ecf692f
@facebook-github-bot
Copy link
Contributor

This pull request has been reverted by 4056e41.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. Merged ready-to-merge PR that have been reviewed and are ready for merging. PRs with this tag notify the Velox Meta oncall Reverted

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants