-
Notifications
You must be signed in to change notification settings - Fork 1.4k
feat: Add support for evaluating Iceberg partition transforms #15440
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Add support for evaluating Iceberg partition transforms #15440
Conversation
✅ Deploy Preview for meta-velox canceled.
|
|
@mbasmanova This PR continues from #15035 (review). |
|
Ci is red |
|
|
||
| /// Converts a partition value to its string representation for use in | ||
| /// partition directory path. The format follows the Iceberg specification | ||
| /// for partition path encoding. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you have a link to the spec? Would be nice to add here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the comment.
I cannot find an explicit document about this. I will revise the comment.
I followed the Iceberg java code to implement this.
| explicit IcebergPartitionPath(TransformType transformType) | ||
| : transformType_(transformType) {} | ||
|
|
||
| ~IcebergPartitionPath() override = default; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not needed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks.
|
|
||
| using HivePartitionUtil::toPartitionString; | ||
|
|
||
| std::string toPartitionString(int32_t value, const TypePtr& type) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
perhaps, document the format of the result
| /// Converts a partition value to its string representation for use in | ||
| /// partition directory path. The format follows the Iceberg specification | ||
| /// for partition path encoding. | ||
| class IcebergPartitionPath : public HivePartitionUtil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be possible to extract this class into a separate PR and add a test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for reviewing.
I will separate this to a new PR.
| /// type, and optional parameter (e.g., bucket count, truncate width). | ||
| /// @param inputFieldName Name of the source column in the input RowVector. | ||
| /// @return Typed expression representing the transform. | ||
| static core::TypedExprPtr toExpression( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to expose this API?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. I moved it to anonymous namespace.
|
|
||
| namespace { | ||
|
|
||
| constexpr char const* kIcebergFunctionPrefix{"iceberg_"}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
move this next to its only use
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks.
| class TransformTest : public test::IcebergTestBase { | ||
| protected: | ||
| template <typename T> | ||
| VectorPtr createFlatVector( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not use VectorMaker or helper methods from VectorTestBase?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. Changed to VectorMaker
|
|
||
| // Create expected vector and verify results. | ||
| auto expectedVector = createFlatVector<OUT>(expectedValues); | ||
| ASSERT_EQ(resultVector[0]->size(), expectedValues.size()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use assertEqualVectors or similar from VectorTestBase.h
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks.
ff2e9e0 to
ae18463
Compare
| case TransformType::kTruncate: | ||
| return type; | ||
| } | ||
| VELOX_UNREACHABLE("Unknown transform type"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mbasmanova Have to add this to fix the building break on Linux. Macos is fine without this change.
| const std::vector<std::optional<IN>>& inputValues, | ||
| const std::vector<std::optional<OUT>>& expectedValues, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we replace these arguments with RowVectorPtr and use makeFlatVector and similar in the callers?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. Just updated.
ae18463 to
d007247
Compare
| protected: | ||
| void testTransform( | ||
| const IcebergPartitionSpec::Field& field, | ||
| const RowVectorPtr& inputRowVector, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: inputRowVector -> input, expectedRowVector -> expected
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks.
| auto transformExprs = TransformExprBuilder::toExpressions( | ||
| spec, | ||
| std::vector<column_index_t>{0}, | ||
| asRowType(inputRowVector->type())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
input->rowType()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks
| const RowVectorPtr& inputRowVector, | ||
| const RowVectorPtr& expectedRowVector) { | ||
| // Build and evaluate transform expressions. | ||
| auto spec = std::make_shared<const IcebergPartitionSpec>( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we test single field specs only? Why not test multi-field specs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the comment.
I added those test one by one when I implement the transformation logic one by one, and type by type.
I will refactor it to support test multiple transforms.
| {makeFlatVector<StringView>( | ||
| {StringView("\x01\x02\x03", 3), StringView("", 0)}, | ||
| VARBINARY())}), | ||
| makeRowVector({makeFlatVector<StringView>( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
drop StringView or use std::string
makeFlatVector<std::string>({"foo", "bar"})
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks.
d007247 to
8047d9b
Compare
|
|
||
| // Register Iceberg functions once for expression evaluation. | ||
| static std::once_flag registerFlag; | ||
| static constexpr char const* kIcebergFunctionPrefix{"iceberg_"}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
iceberg_ prefix is now hard-coded in 2 places: here and in TransformExprBuilder.cpp
Also, It seems better to make these functions available unconditionally, not only after running a CTAS or INSERT INTO query.
I suggest to move the logic for registering these functions to the same place where we register Iceberg connector and add a constant for prefix, which can be reused without hard-coding multiple times.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the comment.
Iceberg connector does not exist at the moment. Can I register these functions when registering Hive connector?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Iceberg connector does not exist at the moment.
Hmm... but then, how can it be used? Should we create one or at least add a function to register what's needed for Iceberg?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently it depends on HiveConnector::createDataSink to create an IcebergDataSink.
Should we create one or at least add a function to register what's needed for Iceberg?
How about adding a function first in this PR, and adding IcebergConnector in a separate PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about adding a function first in this PR, and adding IcebergConnector in a separate PR?
Sounds good.
| makeFlatVector<std::string>({("\x01\x02\x03"), ("")}, VARBINARY()), | ||
| makeFlatVector<Timestamp>( | ||
| {Timestamp(0, 0), Timestamp(1609459200, 0)})}), | ||
| makeRowVector( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's not copy-paste expected values for identity transform. Let's use 'input'.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks.
| makeNullableFlatVector<Timestamp>({std::nullopt}), | ||
| makeNullableFlatVector<Timestamp>({std::nullopt}), | ||
| }), | ||
| makeRowVector({ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, this case tested all transforms on null value. Most of the expected vector type is different with input.
|
|
||
| TEST_F(TransformTest, multipleTransforms) { | ||
| const auto& rowType = | ||
| ROW({"c_int", "c_date", "c_varchar"}, {INTEGER(), DATE(), VARCHAR()}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no need to use custom column names; use c0, c1, c2... and then drop rowType->name() argument from makeRowVector; ditto all other places
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks.
8047d9b to
fe7e347
Compare
mbasmanova
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks.
|
@peterenescu Could you please help to import this PR? Thank you. |
|
@peterenescu has imported this pull request. If you are a Meta employee, you can view this in D86686271. |
|
@peterenescu merged this pull request in bef78ba. |
…facebookincubator#15440)" This reverts commit bef78ba. Alchemy-item: (ID = 854) Iceberg staging hub commit 2/14 - ecf692f
…facebookincubator#15440)" This reverts commit bef78ba. Alchemy-item: (ID = 854) Iceberg staging hub commit 2/14 - ecf692f
…facebookincubator#15440)" This reverts commit bef78ba. Alchemy-item: (ID = 854) Iceberg staging hub commit 2/14 - ecf692f
…facebookincubator#15440)" This reverts commit bef78ba. Alchemy-item: (ID = 854) Iceberg staging hub commit 2/14 - ecf692f
|
This pull request has been reverted by 4056e41. |
Introduce infrastructure for evaluating Iceberg partition transforms.