feat: Add missing LZ4 and ZSTD compression codec classes#25021
feat: Add missing LZ4 and ZSTD compression codec classes#25021ZacBlanco wants to merge 1 commit intoprestodb:masterfrom
Conversation
4f83cd2 to
6c8b2a9
Compare
presto-hive/src/main/java/com/facebook/presto/hive/HiveCompressionCodec.java
Outdated
Show resolved
Hide resolved
hantangwangd
left a comment
There was a problem hiding this comment.
Change looks good to me, just one nit.
presto-iceberg/src/main/java/com/facebook/presto/iceberg/IcebergSessionProperties.java
Outdated
Show resolved
Hide resolved
6c8b2a9 to
97778fe
Compare
97778fe to
74a3adf
Compare
hantangwangd
left a comment
There was a problem hiding this comment.
After checking the code in detail, I found that we haven't really support LZ4 for PARQUET for now. Referring to the code here. So do you think it makes sense to allow LZ4 configuration once it's really supported for PARQUET?
| assertQuerySucceeds(session, format("CREATE TABLE %s (i bigint) WITH (\"write.format.default\" = '%s')", tableName, format.name())); | ||
| assertQuery(format("SELECT value FROM \"%s$properties\" WHERE key = 'write.%s.compression-codec'", tableName, format.name().toLowerCase(ROOT)), format("VALUES '%s'", codecName)); | ||
| assertQuery(format("SELECT value FROM \"%s$properties\" WHERE key = 'write.format.default'", tableName), format("VALUES '%s'", format.name())); | ||
| assertUpdate(format("INSERT INTO %s SELECT num FROM UNNEST(sequence(0, 1000)) as t(num)", tableName), "VALUES 1001"); |
There was a problem hiding this comment.
| assertQuerySucceeds(session, format("CREATE TABLE %s (i bigint) WITH (\"write.format.default\" = '%s')", tableName, format.name())); | |
| assertQuery(format("SELECT value FROM \"%s$properties\" WHERE key = 'write.%s.compression-codec'", tableName, format.name().toLowerCase(ROOT)), format("VALUES '%s'", codecName)); | |
| assertQuery(format("SELECT value FROM \"%s$properties\" WHERE key = 'write.format.default'", tableName), format("VALUES '%s'", format.name())); | |
| assertUpdate(format("INSERT INTO %s SELECT num FROM UNNEST(sequence(0, 1000)) as t(num)", tableName), "VALUES 1001"); | |
| assertQuerySucceeds(session, format("CREATE TABLE %s WITH (\"write.format.default\" = '%s') as select * from lineitem with no data", tableName, format.name())); | |
| assertQuery(session, format("SELECT value FROM \"%s$properties\" WHERE key = 'write.%s.compression-codec'", tableName, format.name().toLowerCase(ROOT)), format("VALUES '%s'", codecName)); | |
| assertQuery(session, format("SELECT value FROM \"%s$properties\" WHERE key = 'write.format.default'", tableName), format("VALUES '%s'", format.name())); | |
| assertUpdate(session, format("INSERT INTO %s SELECT * from lineitem", tableName), "select count(*) from lineitem"); | |
| assertQuery(session, format("SELECT * FROM %s", tableName), "select * from lineitem"); |
It seems that if we do the insertion using the session that including a compression codec, we will get test fails on PARQUET + LZ4/ZSTD.
For ZSTD, we might try to fetch its corresponding CompressionCodecName using an mismatched fully qualified name. Referring to here.
There was a problem hiding this comment.
Thanks, this is a good catch! I have fixed the writers by adding support for LZ4 and Zstd when set in the session
There was a problem hiding this comment.
Seems the following statements still fail for LZ4:
assertQuerySucceeds(session, format("CREATE TABLE %s WITH (\"write.format.default\" = '%s') as select * from lineitem with no data", tableName, format.name()));
assertUpdate(session, format("INSERT INTO %s SELECT * from lineitem", tableName), "select count(*) from lineitem");
assertQuery(format("SELECT * FROM %s", tableName), "select * from lineitem");
An error occurs when querying the actual parquet data compressed with LZ4. It seems there exists a problem with the airlift LZ4 compressor, as described in ParquetCompressor.getCompressor:
// When using airlift LZO or LZ4 compressor, decompressing page in reader throws exception.
......
The error information is as follows:
java.lang.IllegalArgumentException: Invalid offset or length (8, 16782243) in array of length 60232
at io.airlift.compress.lz4.Lz4Decompressor.verifyRange(Lz4Decompressor.java:108)
at io.airlift.compress.lz4.Lz4Decompressor.decompress(Lz4Decompressor.java:34)
at com.facebook.presto.parquet.ParquetCompressionUtils.decompress(ParquetCompressionUtils.java:151)
......
74a3adf to
7624603
Compare
|
I don't find the |
It appears that |
Great question! You're correct, and both hive.compression-codec and iceberg.compression-codec are documented in Configuration Properties topics in the Hive Connector and Iceberg Connector pages. I'm seeing three properties to discuss: 1 For the Hive and Iceberg connector-specific session properties, I agree that it would be appropriate to add them in the corresponding connector docs. 2 3 Revising the Hive Connector page to add a new Session Properties topic - which would imply gathering the various references to session properties to populate it - seems a large piece of work that isn't appropriate to add to this PR. I will open a doc issue about that reorganization of the Hive Connector page. In the meantime, what do you think of adding a mention of |
Doc issue created, see #25110. |
I just confirmed that there is no
Sounds great to me. |
These codecs are available in the writers, but don't seem to have been configured correctly. Trying to write tables with these formats previously threw errors. This change enables LZ4 and ZSTD compression for Parquet writers in Hive and Iceberg
7624603 to
c364bdc
Compare
|
Closing this in favor of PR #26346 |
Description
These codecs are available in the writers, but don't seem to have been configured correctly. Trying to write tables with these formats previously threw errors. This change enables LZ4 and ZSTD compression for Parquet writers in Iceberg and Hive
Motivation and Context
When users set the
compression_codecsession property or*.compression-codecconnector property with LZ4 or ZSTD with parquet format as the default, tables would fail to be created due to the codec being null inside HiveCompressionCodec inside of Iceberg. I couldn't find a good reason for keeping these null, so I populated the correct enum variants and added tests to ensure they worked. Since this code is shared between Iceberg and Hive connectors, I added tests for different file type and compression codec variants to ensure we have compatibility across all of the potential configuration combinations.Impact
LZ4andZSTDwhen creating iceberg tables with parquet as the default file formatTest Plan
Contributor checklist
Release Notes