-
Notifications
You must be signed in to change notification settings - Fork 397
Description
Apache Iceberg version
0.9.1 (latest release)
Please describe the bug 🐞
When attempting to add Parquet files to an Iceberg table using Table.add_files, the operation fails if a column defined as DecimalType in the Iceberg schema is physically stored as FIXED_LEN_BYTE_ARRAY in the Parquet file, even if the decimal's precision would typically map to INT32 or INT64 according to Iceberg's preferred Parquet mapping.
I see in the Iceberg Spec that on-write the mapping is correct. However, the current behaviour seems to overly restrict the physical Parquet type for decimals during the file addition process. I believe this greatly limits the kinds of parquet files that can be "added" to an Iceberg table this way.
Steps to Reproduce:
- Define an Iceberg table schema with a
DecimalTypecolumn, for example,Decimal(10, 2).- Iceberg's preferred Parquet physical type for
Decimal(10, 2)would beINT64.
- Iceberg's preferred Parquet physical type for
- Create a Parquet file where the corresponding column for this
Decimal(10, 2)is physically stored asFIXED_LEN_BYTE_ARRAY. The data itself is valid forDecimal(10, 2). - Attempt to add this Parquet file to the Iceberg table using
Table.add_files.
Behavior:
The Table.add_files operation fails, with the following error:
ValueError: Unexpected physical type FIXED_LEN_BYTE_ARRAY for DecimalType(10, 2) expected INT32indicating a mismatch between the expected physical type (e.g., INT64) and the actual physical type (FIXED_LEN_BYTE_ARRAY) found in the Parquet file for the decimal column.
Expected Behavior:
The Table.add_files operation should succeed and correctly read the decimal values from the FIXED_LEN_BYTE_ARRAY physical storage. The Iceberg reader/writer should be lenient with the physical storage format of decimals OR otherwise Table.add_files should document these limitations.
Environment:
- Python version: 3.12.9
- Parquet library and version: pyarrow 20.0.0
P.S. If this is just user error and I shouldn't be trying to do things this way I'd be happy to hear alternatives.
Willingness to contribute
- I can contribute a fix for this bug independently
- I would be willing to contribute a fix for this bug with guidance from the Iceberg community
- I cannot contribute a fix for this bug at this time