fix: Remove redundant partition value handling in Iceberg column adaptation #14516

PingLiuPing · 2025-08-19T09:21:02Z

Implement IcebergSplitReader::adaptColumns. Without overriding this method, current iceberg reader uses SplitReader::adaptColumns which is specific to Hive implementation. One difference between Hive and Iceberg is Iceberg spec requires all columns should be wrote to data files while in Hive, it only write non-partition columns to data file. So, there are special logic to handle this during read in SplitReader::adaptColumns. But in Iceberg this is not needed.
And for Hive, it populates the column data with partition value directly, this is correct for Hive only, but for Iceberg, there are different kinds transforms other than Identity transform. And we can not deduce the original column from the partition value.

netlify · 2025-08-19T09:21:07Z

✅ Deploy Preview for meta-velox canceled.

Name	Link
🔨 Latest commit	`333e99e`
🔍 Latest deploy log	https://app.netlify.com/projects/meta-velox/deploys/68fa46368d77c10008a2d364

PingLiuPing · 2025-09-08T14:27:24Z

@majetideepak Can you help to review this PR, thank you very much.

PingLiuPing · 2025-09-09T14:24:09Z

@zhli1142015 I think PR #12910 (a dependency of this PR) will be merged soon.
Could you please review this PR as well? It fixes some scenarios when reading Iceberg tables.
I have follow-up PRs planned to address issues with querying Iceberg tables that contains evolution schema.
Thanks.

PingLiuPing · 2025-10-17T21:00:58Z

@mbasmanova
Per our discussion in 15035, I plan to integrate the actual partition transform logic into that PR. To support this, I need to deliver this PR first to run the unit test.

The purpose of this PR is to address certain limitations in the Iceberg reader. When a partition column is defined in an Iceberg table, and when query such iceberg table we cannot reuse the Hive reader. This is because the Hive writer only writes non-partition columns to data files, and the partition column values are derived from the partition map during query phase (there is special logic in hive reader).

In contrast, Iceberg requires all columns (both partition and non-partition) to be written to data files, and hence during reading we do not need those special logic. And for most Iceberg partition transforms, it’s not possible to reconstruct the original column value from the partition value.

mbasmanova · 2025-10-17T21:38:05Z

velox/connectors/hive/SplitReader.h

      VectorPtr& output,
      const std::vector<BaseVector::CopyRange>& ranges);

+  void setPartitionValue(


Would you document this method?

Thanks.
I added the document.

mbasmanova · 2025-10-17T21:38:20Z

velox/connectors/hive/SplitReader.h

-  std::vector<TypePtr> adaptColumns(
+  virtual std::vector<TypePtr> adaptColumns(
      const RowTypePtr& fileType,
      const std::shared_ptr<const velox::RowType>& tableSchema) const;


nit: RowTypePtr

mbasmanova · 2025-10-17T21:38:41Z

velox/connectors/hive/iceberg/IcebergSplitReader.h

  uint64_t next(uint64_t size, VectorPtr& output) override;

 private:
+  std::vector<TypePtr> adaptColumns(


Would you document this method?

Thanks.
I added document to this method.

mbasmanova · 2025-10-17T21:38:46Z

velox/connectors/hive/iceberg/IcebergSplitReader.h

 private:
+  std::vector<TypePtr> adaptColumns(
+      const RowTypePtr& fileType,
+      const std::shared_ptr<const velox::RowType>& tableSchema) const override;


mbasmanova · 2025-10-17T21:39:49Z

velox/connectors/hive/iceberg/IcebergSplitReader.cpp

+  for (auto i = 0; i < childrenSpecs.size(); ++i) {
+    auto* childSpec = childrenSpecs[i].get();


Can this be a for-each loop?

for (const auto& child : childrenSpecs)

mbasmanova · 2025-10-17T21:41:35Z

velox/connectors/hive/iceberg/tests/IcebergReadTest.cpp

  HiveConnectorTestBase::assertQuery(plan, splits, "SELECT 0, '2018-04-06'");
 }

+TEST_F(HiveIcebergTest, testReadDecimal) {


drop 'test' prefix from test method names (please, update pre-existing method names in this file)

Thanks, updated.

mbasmanova · 2025-10-17T21:41:52Z

velox/connectors/hive/iceberg/tests/IcebergReadTest.cpp

 }

+TEST_F(HiveIcebergTest, testReadDecimal) {
+  RowTypePtr rowType{ROW({"c0", "price"}, {BIGINT(), DECIMAL(10, 2)})};


auto rowType = ROW(...)

mbasmanova · 2025-10-17T21:45:42Z

velox/connectors/hive/iceberg/tests/IcebergReadTest.cpp

+    VectorPtr expectedPrice =
+        makeFlatVector<int64_t>({12345, 12345, 12345}, DECIMAL(10, 2));
+    if (i == 1) {
+      expectedPrice =
+          makeFlatVector<int64_t>({67890, 67890, 67890}, DECIMAL(10, 2));
+    }


this code is confusing; consider rewriting

std::vector<int64_t> unscaledPartitionValues = {12345, 67890}; ... auto expectedPrice = makeFlatVector<int64_t>(3, [&](auto) {return unscaledPartitionValues[i];}, DECIMAL(10, 2));

Thanks, I re-implement the test to more accurately cover the code changes.

mbasmanova · 2025-10-17T21:47:26Z

velox/connectors/hive/iceberg/tests/IcebergReadTest.cpp

+       std::make_shared<HiveColumnHandle>(
+           "c0",
+           HiveColumnHandle::ColumnType::kRegular,
+           rowType->childAt(0),
+           rowType->childAt(0))});


Add helper method to reduce copy-paste. Updating existing code in this file as well.

makeColumnHandle("c0", rowType->childAt(0)); makePartitionKeyHandle("price", rowType->childAt(1));

Thanks. Added makeColumnHandles to create the column handle map.

mbasmanova · 2025-10-17T21:48:57Z

velox/connectors/hive/iceberg/tests/IcebergReadTest.cpp

+  assertQuery(plan, splits, "SELECT * FROM tmp", 0);
+}
+
+TEST_F(HiveIcebergTest, testAddNewColumn) {


These tests a quite repetitive. Consider refactoring to reduce boiler plate and copy-paste.

jinchengchenghh · 2025-10-19T12:12:44Z

Which error do you meet? Does it only occur in decimal column? Why?

jinchengchenghh · 2025-10-19T12:17:52Z

velox/connectors/hive/iceberg/IcebergSplitReader.cpp

+std::vector<TypePtr> IcebergSplitReader::adaptColumns(
+    const RowTypePtr& fileType,
+    const std::shared_ptr<const velox::RowType>& tableSchema) const {
+  std::vector<TypePtr> columnTypes = fileType->children();


Please respect infoColumns, iceberg may also have infoColumns such as _delete.

jinchengchenghh · 2025-10-19T12:23:54Z

I don't see error in Gluten unit test, do you have a unit test in presto to reproduce it?

PingLiuPing · 2025-10-20T14:16:13Z

I don't see error in Gluten unit test, do you have a unit test in presto to reproduce it?

@jinchengchenghh With or without IBM#425?

PingLiuPing · 2025-10-20T14:21:17Z

Which error do you meet? Does it only occur in decimal column? Why?

Thanks, I think I need to change the PR description little bit. When I open this PR, #12910 has not been merged. And at that time when reading a decimal partition column there are errors.

But the second point in PR description stands. See #14516 (comment).

jinchengchenghh · 2025-10-20T14:23:48Z

In gluten, we test two mode, With or without IBM#425, both of them can pass all the tests.

PingLiuPing · 2025-10-20T14:41:19Z

In gluten, we test two mode, With or without IBM#425, both of them can pass all the tests.

Ok, without IBM#425, do you have a test case that querying a partitioned iceberg table?
The main purpose of this PR is overwrite adapColumns so that avoid to using the constant partition value to construct the result.
Imagine there is a bucket transform and the bucket value is 4, there is no way to re-construct the original column value from 4.

jinchengchenghh · 2025-10-20T15:08:17Z

Yes, this is a feature added long long ago, our customers also use it.

mbasmanova · 2025-10-22T23:08:28Z

velox/connectors/hive/iceberg/tests/IcebergReadTest.cpp

+  writeToFile(
+      dataFilePath->getPath(), dataVectors, config_, flushPolicyFactory_);


This code repeats. Consider adding helper method to shorten to

writeToFile(dataFilePath, dataVectors);

mbasmanova · 2025-10-22T23:08:46Z

velox/connectors/hive/iceberg/tests/IcebergReadTest.cpp

+  auto icebergSplits = makeIcebergSplits(dataFilePath->getPath());
+
+  // Read with new schema (c0, c1, and c2).
+  auto plan = PlanBuilder(pool_.get())


Drop pool_.get() parameter. It is not needed.

mbasmanova · 2025-10-22T23:11:41Z

velox/connectors/hive/iceberg/tests/IcebergReadTest.cpp

+  auto expectedRegion = makeFlatVector<std::string>({"US", "US", "US"});
+  auto expectedYear = makeFlatVector<int32_t>({2025, 2025, 2025});
+  std::vector<RowVectorPtr> expectedVectors;
+  expectedVectors.push_back(makeRowVector(


consider

makeRowVector(tableRowType->names(), { makeFlatVector<std::string>({"US", "US", "US"}), makeFlatVector<int32_t>({2025, 2025, 2025}), })

No need for expectedRegion and expectedYear variables. Also, no need to repeat column names.

velox/connectors/hive/iceberg/tests/IcebergReadTest.cpp

mbasmanova · 2025-10-23T12:22:05Z

velox/connectors/hive/SplitReader.h


 namespace facebook::velox::connector::hive {

+/// Creates a constant vector from a string representation of a value.


nit: a constant vector of size 1

mbasmanova · 2025-10-23T12:22:32Z

velox/connectors/hive/SplitReader.h


+/// Creates a constant vector from a string representation of a value.
+///
+/// This function is primarily used to materialize partition column values and


This function is primarily used

Drop "This function is primarily". Start with an active verb: Used to ...

mbasmanova · 2025-10-23T12:23:01Z

velox/connectors/hive/SplitReader.h

+/// Creates a constant vector from a string representation of a value.
+///
+/// This function is primarily used to materialize partition column values and
+/// info columns (e.g., $path, $file_size) when reading Hive/Iceberg tables.


Hive/Iceberg -> Hive and Iceberg

mbasmanova · 2025-10-23T12:28:04Z

velox/connectors/hive/SplitReader.h

+/// the same way as CAST(x as VARCHAR). Date values must be formatted using ISO
+/// 8601 as YYYY-MM-DD. If nullopt, creates a null constant vector.
+/// @param pool Memory pool for allocating the constant vector.
+/// @param asLocalTime If true and type is TIMESTAMP, interprets the string


Love this documentation. Detailed and clear. Thank you for taking the time to write it.

velox/connectors/hive/SplitReader.h

velox/connectors/hive/iceberg/tests/IcebergReadTest.cpp

mbasmanova · 2025-10-23T12:32:45Z

velox/connectors/hive/iceberg/tests/IcebergReadTest.cpp


-  HiveConnectorTestBase::assertQuery(
+  auto plan =
+      PlanBuilder(pool_.get()).tableScan(rowType, {}, "", nullptr).planNode();


drop pool_.get() argument; ditto other places

@mbasmanova Thanks.
Removed most of them, and there is one place left where this pool can not be removed.

// Test filter on non-partitioned date column std::vector<std::string> filters = {"ds = date'2018-04-06'"}; plan = PlanBuilder(pool_.get()).tableScan(rowType, filters).planNode();

when removed it, the test cases crashed at

velox/velox/buffer/Buffer.h

Line 392 in 6851fd3

pool->preferredSize(checkedPlus<size_t>(size, kPaddedSize));

since the pool is a NULL pointer.

But by reading this case carefully I don't think it makes sense.
For iceberg:

There is no difference between reading partitioned table and non-partitioned table.

Even we want test reading partitioned table, we should first write partitioned iceberg table, but this case does not write partitioned table at all.

So probably this case can be deleted, what do you think?

I'm not sure I follow.

mbasmanova · 2025-10-23T12:34:07Z

velox/connectors/hive/iceberg/tests/IcebergReadTest.cpp

+  expectedVectors.push_back(makeRowVector(
+      {c0,
+       makeNullConstant(TypeKind::INTEGER, 3),
+       makeNullConstant(TypeKind::VARCHAR, 3)}));


add a comma after the last vector to improve readability

apply the same refactoring to other calls to makeRowVector

Thanks.
There are places where a vector will be used twice so I kept the local variable.
I removed all the local vector variables and get them from the rowVector when they are been used in second place.

mbasmanova

Looks great. Thank you for iterating.

mbasmanova · 2025-10-23T14:37:00Z

velox/connectors/hive/SplitReader.cpp

+    const std::optional<std::string>& value,
+    velox::memory::MemoryPool* pool,
+    bool asLocalTime,
+    bool isPartitionDateDaysSinceEpoch) {


Let's rename these arguments as well.

mbasmanova · 2025-10-23T14:37:19Z

velox/connectors/hive/SplitReader.cpp


 template <TypeKind kind>
-VectorPtr newConstantFromString(
+VectorPtr newConstantFromStringImpl(


Let's rename arguments and remove default value.

Sure. I missed this.

mbasmanova

Thanks.

meta-codesync · 2025-10-23T14:50:03Z

@mbasmanova has imported this pull request. If you are a Meta employee, you can view this in D85344487.

Yuhta · 2025-10-23T15:55:20Z

velox/connectors/hive/SplitReader.cpp

 }
 } // namespace

+VectorPtr newConstantFromString(


Can we move this into the anonymous namespace?

@Yuhta This is a public API now.

meta-codesync · 2025-10-23T19:46:32Z

@mbasmanova merged this pull request in d0f6a24.

…tation (facebookincubator#14516) Summary: Implement IcebergSplitReader::adaptColumns. Without overriding this method, current iceberg reader uses `SplitReader::adaptColumns` which is specific to Hive implementation. One difference between Hive and Iceberg is Iceberg spec requires all columns should be wrote to data files while in Hive, it only write non-partition columns to data file. So, there are special logic to handle this during read in `SplitReader::adaptColumns`. But in Iceberg this is not needed. And for Hive, it populates the column data with partition value directly, this is correct for Hive only, but for Iceberg, there are different kinds transforms other than Identity transform. And we can not deduce the original column from the partition value. Pull Request resolved: facebookincubator#14516 Reviewed By: Yuhta Differential Revision: D85344487 Pulled By: mbasmanova fbshipit-source-id: e0de6bdd44257a37f6727840803754639accdfd1

PingLiuPing self-assigned this Aug 19, 2025

PingLiuPing requested a review from majetideepak as a code owner August 19, 2025 09:21

meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Aug 19, 2025

PingLiuPing requested a review from jinchengchenghh August 19, 2025 09:21

PingLiuPing changed the title ~~fix: Read iceberg table decimal column error~~ fix: Read error from iceberg table Sep 9, 2025

PingLiuPing requested a review from zhli1142015 September 9, 2025 14:19

PingLiuPing force-pushed the lp_fix_iceberg_read_decimal branch 2 times, most recently from a372d99 to 90c79bb Compare October 17, 2025 16:52

mbasmanova reviewed Oct 17, 2025

View reviewed changes

jinchengchenghh reviewed Oct 19, 2025

View reviewed changes

mbasmanova reviewed Oct 22, 2025

View reviewed changes

velox/connectors/hive/iceberg/tests/IcebergReadTest.cpp Outdated Show resolved Hide resolved

PingLiuPing force-pushed the lp_fix_iceberg_read_decimal branch 2 times, most recently from 58a3a11 to 7ffa060 Compare October 23, 2025 10:03

mbasmanova reviewed Oct 23, 2025

View reviewed changes

velox/connectors/hive/SplitReader.h Outdated Show resolved Hide resolved

mbasmanova reviewed Oct 23, 2025

View reviewed changes

velox/connectors/hive/iceberg/tests/IcebergReadTest.cpp Outdated Show resolved Hide resolved

mbasmanova reviewed Oct 23, 2025

View reviewed changes

PingLiuPing force-pushed the lp_fix_iceberg_read_decimal branch from 7ffa060 to ced9ba5 Compare October 23, 2025 13:49

mbasmanova reviewed Oct 23, 2025

View reviewed changes

mbasmanova approved these changes Oct 23, 2025

View reviewed changes

mbasmanova added the ready-to-merge PR that have been reviewed and are ready for merging. PRs with this tag notify the Velox Meta oncall label Oct 23, 2025

Fix read iceberg table decimal column error

333e99e

PingLiuPing force-pushed the lp_fix_iceberg_read_decimal branch from ced9ba5 to 333e99e Compare October 23, 2025 15:13

Yuhta approved these changes Oct 23, 2025

View reviewed changes

meta-codesync bot closed this in d0f6a24 Oct 23, 2025

facebook-github-bot added the Merged label Oct 23, 2025

kgpai mentioned this pull request Oct 29, 2025

misc: Backout 'Support decimal as partition key type in Hive write and read' #15310

Closed

		for (auto i = 0; i < childrenSpecs.size(); ++i) {
		auto* childSpec = childrenSpecs[i].get();

		writeToFile(
		dataFilePath->getPath(), dataVectors, config_, flushPolicyFactory_);


		namespace facebook::velox::connector::hive {

		/// Creates a constant vector from a string representation of a value.

fix: Remove redundant partition value handling in Iceberg column adaptation #14516

fix: Remove redundant partition value handling in Iceberg column adaptation #14516

Uh oh!

Conversation

PingLiuPing commented Aug 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

netlify bot commented Aug 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for meta-velox canceled.

Uh oh!

PingLiuPing commented Sep 8, 2025

Uh oh!

PingLiuPing commented Sep 9, 2025

Uh oh!

PingLiuPing commented Oct 17, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jinchengchenghh commented Oct 19, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jinchengchenghh commented Oct 19, 2025

Uh oh!

PingLiuPing commented Oct 20, 2025

Uh oh!

PingLiuPing commented Oct 20, 2025

Uh oh!

jinchengchenghh commented Oct 20, 2025

Uh oh!

PingLiuPing commented Oct 20, 2025

Uh oh!

jinchengchenghh commented Oct 20, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

PingLiuPing commented Aug 19, 2025 •

edited

Loading

netlify bot commented Aug 19, 2025 •

edited

Loading