Core: add Jackson serialization util for Trino and Presto#3210
Core: add Jackson serialization util for Trino and Presto#3210jackye1995 wants to merge 1 commit intoapache:mainfrom jackye1995:trino-serde-api
Conversation
| } | ||
|
|
||
| public static FileScanTask createFileScanTask(DataFile file, DeleteFile[] deletes, String schemaString, | ||
| String specString, ResidualEvaluator residuals) { |
There was a problem hiding this comment.
The only unknown part is the serialization of ResidualEvaluator, which requires serialization of Iceberg Expression. In theory we can convert it to Trino TupleDomain (basically the reverse of https://github.com/trinodb/trino/blob/master/plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/ExpressionConverter.java), but I have not verified if that is lossy or not, maybe there is a better way to serialize expressions directly in Iceberg to a string format that can be easily deserialized back.
There was a problem hiding this comment.
Eventually, we may want to have an expression parser, but I've been trying to avoid the complexity of that for a long time. We could also have a similar util class to convert expressions to/from JSON.
Does Trino actually use the residuals? If not, we could just skip serializing the residual evaluator.
| metrics.lowerBounds(), metrics.upperBounds(), splitOffsets, null, sortOrderId, keyMetadata); | ||
| } | ||
|
|
||
| GenericDataFile(int specId, FileContent content, String filePath, FileFormat format, PartitionData partition, |
There was a problem hiding this comment.
Shouldn't content be hard-coded to DATA like the constructor above?
| long fileSizeInBytes, long recordCount, Map<Integer, Long> columnSizes, | ||
| Map<Integer, Long> valueCounts, Map<Integer, Long> nullValueCounts, | ||
| Map<Integer, Long> nanValueCounts, Map<Integer, ByteBuffer> lowerBounds, | ||
| Map<Integer, ByteBuffer> upperBounds, List<Long> splitOffsets, int[] equalityFieldIds, |
There was a problem hiding this comment.
No need for equalityFieldIds for data files.
| /** | ||
| * Util methods to help Jackson serialization of Iceberg objects, useful in systems like Presto and Trino. | ||
| */ | ||
| public class JacksonSerializationUtil { |
There was a problem hiding this comment.
All of the other to/from JSON classes are named SomethingParser. Can we use the same pattern?
| return new BaseFileScanTask(file, deletes, schemaString, specString, residuals); | ||
| } | ||
|
|
||
| public static DataFile createDataFile(int specId, FileContent content, String filePath, FileFormat format, |
There was a problem hiding this comment.
Could this use the builders instead of a direct construtor?
| equalityFieldIds, sortOrderId, keyMetadata); | ||
| } | ||
|
|
||
| public static List<byte[]> partitionDataToBytesMap(PartitionData partition) { |
There was a problem hiding this comment.
The spec is used in deserialization to convert bytes to values. Why not also use it to convert values to bytes rather than instanceof checks? Then you'd know whether a value matches the assumptions at serialization time rather than potentially deserializing incorrectly. For example, if I have an long in the partition that should be an int serializing 8 bytes but deserializing just the first 4 is a silent error.
| return partitionData; | ||
| } | ||
|
|
||
| private static byte[] partitionValueToBytes(Object value, int pos) { |
There was a problem hiding this comment.
Why not use the existing Conversions.toByteBuffer and Conversions.fromByteBuffer instead? Those implement the single-value serializations defined in the spec.
| equalityFieldIds, sortOrderId, keyMetadata); | ||
| } | ||
|
|
||
| public static List<byte[]> partitionDataToBytesMap(PartitionData partition) { |
There was a problem hiding this comment.
Instead of accepting a PartitionData, you may want to consider using StructLike.
| .collect(Collectors.toList()); | ||
| } | ||
|
|
||
| public static PartitionData bytesMapToPartitionData(List<byte[]> values, PartitionSpec spec) { |
There was a problem hiding this comment.
This seems unused. Do i understand correctly, it's for Trino's consumption?
We should exercise this code within Iceberg. Let's have a test.
|
This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the dev@iceberg.apache.org list. Thank you for your contributions. |
|
This pull request has been closed due to lack of activity. This is not a judgement on the merit of the PR in any way. It is just a way of keeping the PR queue manageable. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time. |
This is related to adding MoR delete reader in Trino: trinodb/trino#8534, and any potential future support for Presto and Trino.
Currently there is no way to serialize
FileScanTaskwithout using Java or Kryo serialization. But Presto and Trino uses Jackson for all serializations. This PR adds some util methods to make sure we can reconstruct aFileScanTaskobject.The biggest usage, as the linked PR suggests, is for the
DeleteFilterthat has constructorDeleteFilter(FileScanTask task, Schema tableSchema, Schema requestedSchema). It would be a lot of code duplication for Trino to add another delete filter implementation, but on the other side there is no way for Trino to use the Iceberg delete filter if it cannot reconstruct the file scan task and associated delete and data files.@losipiuk @electrum @findepi @rdblue @ChunxuTang