fix: Iceberg positional delete compare base64 encoded value with raw value#16592
fix: Iceberg positional delete compare base64 encoded value with raw value#16592PingLiuPing wants to merge 1 commit intofacebookincubator:mainfrom
Conversation
✅ Deploy Preview for meta-velox canceled.
|
| @@ -920,7 +922,13 @@ TEST_F(HiveIcebergTest, skipDeleteFileByPositionUpperBound) { | |||
| makeFlatVector<int64_t>({0, 1, 2})})}; | |||
| writeToFile(deleteFilePath->getPath(), deleteVectors); | |||
|
|
|||
| // upperBound "2" is the max position in the delete file. | |||
| // upperBound "2" is the max position in the delete file. Iceberg stores | |||
There was a problem hiding this comment.
@PingLiuPing Would you confirm that the test fails without the fix?
There was a problem hiding this comment.
Yes,
unknown file: Failure
C++ exception with description "Exception: VeloxRuntimeError
Error Source: RUNTIME
Error Code: INVALID_STATE
Reason: Operator::getOutput failed for [operator: TableScan, plan node ID: 0]: Non-digit character found: "AgAAAAAAAAA="
Retriable: False
Function: operator()
"AgAAAAAAAAA=" is base64 encoded value.
| uint64_t upperBound = 2; | ||
| auto upperBoundLE = folly::Endian::little(upperBound); | ||
| auto encodedUpperBound = encoding::Base64::encode( | ||
| folly::StringPiece( |
There was a problem hiding this comment.
Can we use std::string_view?
| static constexpr int32_t kPosId = 2'147'483'545; | ||
| static constexpr int32_t kFilePathId = 2'147'483'546; |
There was a problem hiding this comment.
Where do these numbers come from?
There was a problem hiding this comment.
They are defined in https://iceberg.apache.org/spec/#position-delete-files, added a comment to explain the value.
|
/claude-review |
f4e336c to
c9481fe
Compare
✅ Claude Code ReviewRequested by @mbasmanova SummaryI've completed a thorough analysis of this PR fixing the Iceberg positional delete Base64 decoding issue. Issues Found🟡 Suggestion: Error Handling EnhancementFile: 🟡 Suggestion: Const CorrectnessFile: const uint64_t posDeleteUpperBound = folly::Endian::little(
*reinterpret_cast<const uint64_t*>(decodedBound.data()));🟡 Suggestion: Alignment ConsiderationFile: // Use memcpy to avoid potential alignment issues with reinterpret_cast
std::memcpy(&posDeleteUpperBound, decodedBound.data(), sizeof(uint64_t));🟢 Nitpick: Include OrganizationFile: Positive Observations✅ Correctness: The fix properly addresses the core issue by decoding Base64 before parsing as uint64_t, which is exactly what the Iceberg specification requires. ✅ Endianness Handling: Proper use of ✅ Memory Safety: Appropriate use of ✅ Size Validation: The ✅ Test Coverage: The test properly demonstrates the fix by encoding a test value using the same process (little-endian + Base64). ✅ Code Organization: Constants moved to header file with clear naming ( ✅ Boundary Case Handling: The logic correctly handles the case where upper bound equals split offset (should continue processing). The fix is technically sound and addresses a real bug where Base64-encoded values were being incorrectly parsed as decimal strings, which would throw exceptions. The implementation follows Velox patterns and properly handles the Iceberg specification for storing statistics as Base64-encoded little-endian binary values. ℹ️ About this responseThis review was generated by Claude Code. It analyzed the PR diff and codebase to provide feedback. Limitations:
Available commands:
|
c9481fe to
d9ae324
Compare
czentgr
left a comment
There was a problem hiding this comment.
Thanks for the quick fix.
When reading Iceberg positional delete files, the IcebergSplitReader optimizes away irrelevant delete files by comparing the delete file's upper_bounds statistic against the current split's starting row offset.
This PR fixes a bugs in that logic. Iceberg stores column statistics (lower_bounds / upper_bounds) in manifests as Base64-encoded, little-endian binary values (for long type). For a long column (such as the positional delete
poscolumn), the value is an 8-byte little-endian integer, Base64-encoded.Previously it passed the raw Base64 string directly to folly::to<uint64_t>(), which attempts to parse it as a decimal string. For example, the value
CgAAAAAAAAA=should be decoded to integer 10, but folly::to<uint64_t>("CgAAAAAAAAA=") throws exception.This PR fix this by decoding the upper bound first and then reinterpret the bytes as uint64_t and then apply
folly::Endian::little()to convert from the iceberg spec specified little-endian encoding to host byte order before comparing against splitOffset_.This fixes the issue prestodb/presto#27237