Skip to content

Conversation

@PingLiuPing
Copy link
Collaborator

@PingLiuPing PingLiuPing commented Aug 19, 2025

Implement IcebergSplitReader::adaptColumns. Without overriding this method, current iceberg reader uses SplitReader::adaptColumns which is specific to Hive implementation. One difference between Hive and Iceberg is Iceberg spec requires all columns should be wrote to data files while in Hive, it only write non-partition columns to data file. So, there are special logic to handle this during read in SplitReader::adaptColumns. But in Iceberg this is not needed.
And for Hive, it populates the column data with partition value directly, this is correct for Hive only, but for Iceberg, there are different kinds transforms other than Identity transform. And we can not deduce the original column from the partition value.

@PingLiuPing PingLiuPing self-assigned this Aug 19, 2025
@netlify
Copy link

netlify bot commented Aug 19, 2025

Deploy Preview for meta-velox canceled.

Name Link
🔨 Latest commit 333e99e
🔍 Latest deploy log https://app.netlify.com/projects/meta-velox/deploys/68fa46368d77c10008a2d364

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Aug 19, 2025
@PingLiuPing
Copy link
Collaborator Author

@majetideepak Can you help to review this PR, thank you very much.

@PingLiuPing PingLiuPing changed the title fix: Read iceberg table decimal column error fix: Read error from iceberg table Sep 9, 2025
@PingLiuPing
Copy link
Collaborator Author

@zhli1142015 I think PR #12910 (a dependency of this PR) will be merged soon.
Could you please review this PR as well? It fixes some scenarios when reading Iceberg tables.
I have follow-up PRs planned to address issues with querying Iceberg tables that contains evolution schema.
Thanks.

@PingLiuPing PingLiuPing force-pushed the lp_fix_iceberg_read_decimal branch 2 times, most recently from a372d99 to 90c79bb Compare October 17, 2025 16:52
@PingLiuPing
Copy link
Collaborator Author

@mbasmanova
Per our discussion in 15035, I plan to integrate the actual partition transform logic into that PR. To support this, I need to deliver this PR first to run the unit test.

The purpose of this PR is to address certain limitations in the Iceberg reader. When a partition column is defined in an Iceberg table, and when query such iceberg table we cannot reuse the Hive reader. This is because the Hive writer only writes non-partition columns to data files, and the partition column values are derived from the partition map during query phase (there is special logic in hive reader).

In contrast, Iceberg requires all columns (both partition and non-partition) to be written to data files, and hence during reading we do not need those special logic. And for most Iceberg partition transforms, it’s not possible to reconstruct the original column value from the partition value.

VectorPtr& output,
const std::vector<BaseVector::CopyRange>& ranges);

void setPartitionValue(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you document this method?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks.
I added the document.

std::vector<TypePtr> adaptColumns(
virtual std::vector<TypePtr> adaptColumns(
const RowTypePtr& fileType,
const std::shared_ptr<const velox::RowType>& tableSchema) const;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: RowTypePtr

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks.

uint64_t next(uint64_t size, VectorPtr& output) override;

private:
std::vector<TypePtr> adaptColumns(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you document this method?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks.
I added document to this method.

private:
std::vector<TypePtr> adaptColumns(
const RowTypePtr& fileType,
const std::shared_ptr<const velox::RowType>& tableSchema) const override;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RowTypePtr

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks.

Comment on lines 150 to 151
for (auto i = 0; i < childrenSpecs.size(); ++i) {
auto* childSpec = childrenSpecs[i].get();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be a for-each loop?

for (const auto& child : childrenSpecs)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes.

HiveConnectorTestBase::assertQuery(plan, splits, "SELECT 0, '2018-04-06'");
}

TEST_F(HiveIcebergTest, testReadDecimal) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

drop 'test' prefix from test method names (please, update pre-existing method names in this file)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, updated.

}

TEST_F(HiveIcebergTest, testReadDecimal) {
RowTypePtr rowType{ROW({"c0", "price"}, {BIGINT(), DECIMAL(10, 2)})};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

auto rowType = ROW(...)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks.

Comment on lines 863 to 868
VectorPtr expectedPrice =
makeFlatVector<int64_t>({12345, 12345, 12345}, DECIMAL(10, 2));
if (i == 1) {
expectedPrice =
makeFlatVector<int64_t>({67890, 67890, 67890}, DECIMAL(10, 2));
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this code is confusing; consider rewriting

std::vector<int64_t> unscaledPartitionValues = {12345, 67890};
...
auto expectedPrice =
        makeFlatVector<int64_t>(3, [&](auto) {return unscaledPartitionValues[i];}, DECIMAL(10, 2));

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I re-implement the test to more accurately cover the code changes.

Comment on lines 876 to 880
std::make_shared<HiveColumnHandle>(
"c0",
HiveColumnHandle::ColumnType::kRegular,
rowType->childAt(0),
rowType->childAt(0))});
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add helper method to reduce copy-paste. Updating existing code in this file as well.

makeColumnHandle("c0", rowType->childAt(0));
makePartitionKeyHandle("price", rowType->childAt(1));

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Added makeColumnHandles to create the column handle map.

assertQuery(plan, splits, "SELECT * FROM tmp", 0);
}

TEST_F(HiveIcebergTest, testAddNewColumn) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These tests a quite repetitive. Consider refactoring to reduce boiler plate and copy-paste.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks.

@jinchengchenghh
Copy link
Collaborator

Which error do you meet? Does it only occur in decimal column? Why?

std::vector<TypePtr> IcebergSplitReader::adaptColumns(
const RowTypePtr& fileType,
const std::shared_ptr<const velox::RowType>& tableSchema) const {
std::vector<TypePtr> columnTypes = fileType->children();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please respect infoColumns, iceberg may also have infoColumns such as _delete.

@jinchengchenghh
Copy link
Collaborator

I don't see error in Gluten unit test, do you have a unit test in presto to reproduce it?

@PingLiuPing
Copy link
Collaborator Author

I don't see error in Gluten unit test, do you have a unit test in presto to reproduce it?

@jinchengchenghh With or without IBM#425?

@PingLiuPing
Copy link
Collaborator Author

Which error do you meet? Does it only occur in decimal column? Why?

Thanks, I think I need to change the PR description little bit. When I open this PR, #12910 has not been merged. And at that time when reading a decimal partition column there are errors.

But the second point in PR description stands. See #14516 (comment).

@jinchengchenghh
Copy link
Collaborator

In gluten, we test two mode, With or without IBM#425, both of them can pass all the tests.

@PingLiuPing
Copy link
Collaborator Author

In gluten, we test two mode, With or without IBM#425, both of them can pass all the tests.

Ok, without IBM#425, do you have a test case that querying a partitioned iceberg table?
The main purpose of this PR is overwrite adapColumns so that avoid to using the constant partition value to construct the result.
Imagine there is a bucket transform and the bucket value is 4, there is no way to re-construct the original column value from 4.

@jinchengchenghh
Copy link
Collaborator

Yes, this is a feature added long long ago, our customers also use it.

Comment on lines 882 to 883
writeToFile(
dataFilePath->getPath(), dataVectors, config_, flushPolicyFactory_);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code repeats. Consider adding helper method to shorten to

writeToFile(dataFilePath, dataVectors);

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks.

auto icebergSplits = makeIcebergSplits(dataFilePath->getPath());

// Read with new schema (c0, c1, and c2).
auto plan = PlanBuilder(pool_.get())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Drop pool_.get() parameter. It is not needed.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks.

auto expectedRegion = makeFlatVector<std::string>({"US", "US", "US"});
auto expectedYear = makeFlatVector<int32_t>({2025, 2025, 2025});
std::vector<RowVectorPtr> expectedVectors;
expectedVectors.push_back(makeRowVector(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

consider

makeRowVector(tableRowType->names(), {
   makeFlatVector<std::string>({"US", "US", "US"}),
   makeFlatVector<int32_t>({2025, 2025, 2025}),
})

No need for expectedRegion and expectedYear variables. Also, no need to repeat column names.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks.

@PingLiuPing PingLiuPing force-pushed the lp_fix_iceberg_read_decimal branch 2 times, most recently from 58a3a11 to 7ffa060 Compare October 23, 2025 10:03

namespace facebook::velox::connector::hive {

/// Creates a constant vector from a string representation of a value.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: a constant vector of size 1

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks,


/// Creates a constant vector from a string representation of a value.
///
/// This function is primarily used to materialize partition column values and
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function is primarily used

Drop "This function is primarily". Start with an active verb: Used to ...

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks.

/// Creates a constant vector from a string representation of a value.
///
/// This function is primarily used to materialize partition column values and
/// info columns (e.g., $path, $file_size) when reading Hive/Iceberg tables.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hive/Iceberg -> Hive and Iceberg

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks.

/// the same way as CAST(x as VARCHAR). Date values must be formatted using ISO
/// 8601 as YYYY-MM-DD. If nullopt, creates a null constant vector.
/// @param pool Memory pool for allocating the constant vector.
/// @param asLocalTime If true and type is TIMESTAMP, interprets the string
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Love this documentation. Detailed and clear. Thank you for taking the time to write it.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cheers.


HiveConnectorTestBase::assertQuery(
auto plan =
PlanBuilder(pool_.get()).tableScan(rowType, {}, "", nullptr).planNode();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

drop pool_.get() argument; ditto other places

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mbasmanova Thanks.
Removed most of them, and there is one place left where this pool can not be removed.

  // Test filter on non-partitioned date column
  std::vector<std::string> filters = {"ds = date'2018-04-06'"};
  plan = PlanBuilder(pool_.get()).tableScan(rowType, filters).planNode();

when removed it, the test cases crashed at

pool->preferredSize(checkedPlus<size_t>(size, kPaddedSize));
since the pool is a NULL pointer.

But by reading this case carefully I don't think it makes sense.
For iceberg:

  1. There is no difference between reading partitioned table and non-partitioned table.
  2. Even we want test reading partitioned table, we should first write partitioned iceberg table, but this case does not write partitioned table at all.

So probably this case can be deleted, what do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I follow.

expectedVectors.push_back(makeRowVector(
{c0,
makeNullConstant(TypeKind::INTEGER, 3),
makeNullConstant(TypeKind::VARCHAR, 3)}));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a comma after the last vector to improve readability

apply the same refactoring to other calls to makeRowVector

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks.
There are places where a vector will be used twice so I kept the local variable.
I removed all the local vector variables and get them from the rowVector when they are been used in second place.

Copy link
Contributor

@mbasmanova mbasmanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great. Thank you for iterating.

@PingLiuPing PingLiuPing force-pushed the lp_fix_iceberg_read_decimal branch from 7ffa060 to ced9ba5 Compare October 23, 2025 13:49
const std::optional<std::string>& value,
velox::memory::MemoryPool* pool,
bool asLocalTime,
bool isPartitionDateDaysSinceEpoch) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's rename these arguments as well.


template <TypeKind kind>
VectorPtr newConstantFromString(
VectorPtr newConstantFromStringImpl(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's rename arguments and remove default value.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. I missed this.

Copy link
Contributor

@mbasmanova mbasmanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks.

@mbasmanova mbasmanova added the ready-to-merge PR that have been reviewed and are ready for merging. PRs with this tag notify the Velox Meta oncall label Oct 23, 2025
@meta-codesync
Copy link

meta-codesync bot commented Oct 23, 2025

@mbasmanova has imported this pull request. If you are a Meta employee, you can view this in D85344487.

@PingLiuPing PingLiuPing force-pushed the lp_fix_iceberg_read_decimal branch from ced9ba5 to 333e99e Compare October 23, 2025 15:13
}
} // namespace

VectorPtr newConstantFromString(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we move this into the anonymous namespace?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Yuhta This is a public API now.

@meta-codesync meta-codesync bot closed this in d0f6a24 Oct 23, 2025
@meta-codesync
Copy link

meta-codesync bot commented Oct 23, 2025

@mbasmanova merged this pull request in d0f6a24.

mhaseeb123 pushed a commit to mhaseeb123/velox that referenced this pull request Oct 27, 2025
…tation (facebookincubator#14516)

Summary:
Implement IcebergSplitReader::adaptColumns. Without overriding this method, current iceberg reader uses `SplitReader::adaptColumns` which is specific to Hive implementation. One difference between Hive and Iceberg is Iceberg spec requires all columns should be wrote to data files while in Hive, it only write non-partition columns to data file. So, there are special logic to handle this during read in `SplitReader::adaptColumns`. But in Iceberg this is not needed.
And for Hive, it populates the column data with partition value directly, this is correct for Hive only, but for Iceberg, there are different kinds transforms other than Identity transform. And we can not deduce the original column from the partition value.

Pull Request resolved: facebookincubator#14516

Reviewed By: Yuhta

Differential Revision: D85344487

Pulled By: mbasmanova

fbshipit-source-id: e0de6bdd44257a37f6727840803754639accdfd1
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. Merged ready-to-merge PR that have been reviewed and are ready for merging. PRs with this tag notify the Velox Meta oncall

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants