Core: add Jackson serialization util for Trino and Presto by jackye1995 · Pull Request #3210 · apache/iceberg

jackye1995 · 2021-10-01T04:15:14Z

This is related to adding MoR delete reader in Trino: trinodb/trino#8534, and any potential future support for Presto and Trino.

Currently there is no way to serialize FileScanTask without using Java or Kryo serialization. But Presto and Trino uses Jackson for all serializations. This PR adds some util methods to make sure we can reconstruct a FileScanTask object.

The biggest usage, as the linked PR suggests, is for the DeleteFilter that has constructor DeleteFilter(FileScanTask task, Schema tableSchema, Schema requestedSchema). It would be a lot of code duplication for Trino to add another delete filter implementation, but on the other side there is no way for Trino to use the Iceberg delete filter if it cannot reconstruct the file scan task and associated delete and data files.

@losipiuk @electrum @findepi @rdblue @ChunxuTang

jackye1995 · 2021-10-01T05:03:49Z

core/src/main/java/org/apache/iceberg/JacksonSerializationUtil.java

+  }
+
+  public static FileScanTask createFileScanTask(DataFile file, DeleteFile[] deletes, String schemaString,
+                                                String specString, ResidualEvaluator residuals) {


The only unknown part is the serialization of ResidualEvaluator, which requires serialization of Iceberg Expression. In theory we can convert it to Trino TupleDomain (basically the reverse of https://github.com/trinodb/trino/blob/master/plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/ExpressionConverter.java), but I have not verified if that is lossy or not, maybe there is a better way to serialize expressions directly in Iceberg to a string format that can be easily deserialized back.

Eventually, we may want to have an expression parser, but I've been trying to avoid the complexity of that for a long time. We could also have a similar util class to convert expressions to/from JSON.

Does Trino actually use the residuals? If not, we could just skip serializing the residual evaluator.

rdblue · 2021-10-01T15:58:04Z

core/src/main/java/org/apache/iceberg/GenericDataFile.java

        metrics.lowerBounds(), metrics.upperBounds(), splitOffsets, null, sortOrderId, keyMetadata);
  }

+  GenericDataFile(int specId, FileContent content, String filePath, FileFormat format, PartitionData partition,


Shouldn't content be hard-coded to DATA like the constructor above?

rdblue · 2021-10-01T15:58:39Z

core/src/main/java/org/apache/iceberg/GenericDataFile.java

+                         long fileSizeInBytes, long recordCount, Map<Integer, Long> columnSizes,
+                         Map<Integer, Long> valueCounts, Map<Integer, Long> nullValueCounts,
+                         Map<Integer, Long> nanValueCounts, Map<Integer, ByteBuffer> lowerBounds,
+                         Map<Integer, ByteBuffer> upperBounds, List<Long> splitOffsets, int[] equalityFieldIds,


No need for equalityFieldIds for data files.

rdblue · 2021-10-01T15:59:37Z

core/src/main/java/org/apache/iceberg/JacksonSerializationUtil.java

+/**
+ * Util methods to help Jackson serialization of Iceberg objects, useful in systems like Presto and Trino.
+ */
+public class JacksonSerializationUtil {


All of the other to/from JSON classes are named SomethingParser. Can we use the same pattern?

rdblue · 2021-10-01T16:00:11Z

core/src/main/java/org/apache/iceberg/JacksonSerializationUtil.java

+    return new BaseFileScanTask(file, deletes, schemaString, specString, residuals);
+  }
+
+  public static DataFile createDataFile(int specId, FileContent content, String filePath, FileFormat format,


Could this use the builders instead of a direct construtor?

rdblue · 2021-10-01T17:36:58Z

core/src/main/java/org/apache/iceberg/JacksonSerializationUtil.java

+        equalityFieldIds, sortOrderId, keyMetadata);
+  }
+
+  public static List<byte[]> partitionDataToBytesMap(PartitionData partition) {


The spec is used in deserialization to convert bytes to values. Why not also use it to convert values to bytes rather than instanceof checks? Then you'd know whether a value matches the assumptions at serialization time rather than potentially deserializing incorrectly. For example, if I have an long in the partition that should be an int serializing 8 bytes but deserializing just the first 4 is a silent error.

rdblue · 2021-10-01T17:38:56Z

core/src/main/java/org/apache/iceberg/JacksonSerializationUtil.java

+    return partitionData;
+  }
+
+  private static byte[] partitionValueToBytes(Object value, int pos) {


Why not use the existing Conversions.toByteBuffer and Conversions.fromByteBuffer instead? Those implement the single-value serializations defined in the spec.

rdblue · 2021-10-01T17:41:05Z

core/src/main/java/org/apache/iceberg/JacksonSerializationUtil.java

+        equalityFieldIds, sortOrderId, keyMetadata);
+  }
+
+  public static List<byte[]> partitionDataToBytesMap(PartitionData partition) {


Instead of accepting a PartitionData, you may want to consider using StructLike.

findepi · 2021-12-16T12:31:53Z

core/src/main/java/org/apache/iceberg/JacksonSerializationUtil.java

+        .collect(Collectors.toList());
+  }
+
+  public static PartitionData bytesMapToPartitionData(List<byte[]> values, PartitionSpec spec) {


This seems unused. Do i understand correctly, it's for Trino's consumption?
We should exercise this code within Iceberg. Let's have a test.

github-actions · 2024-07-28T00:14:54Z

This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the dev@iceberg.apache.org list. Thank you for your contributions.

github-actions · 2024-08-05T00:13:52Z

This pull request has been closed due to lack of activity. This is not a judgement on the merit of the PR in any way. It is just a way of keeping the PR queue manageable. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time.

Core: add Jackson serialization util for Presto and Trino

f8a76f7

github-actions bot added the core label Oct 1, 2021

jackye1995 commented Oct 1, 2021

View reviewed changes

rdblue reviewed Oct 1, 2021

View reviewed changes

beinan mentioned this pull request Oct 25, 2021

presto iceberg connector error prestodb/presto#16858

Open

findepi reviewed Dec 16, 2021

View reviewed changes

github-actions bot added the stale label Jul 28, 2024

github-actions bot closed this Aug 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Core: add Jackson serialization util for Trino and Presto#3210

Core: add Jackson serialization util for Trino and Presto#3210
jackye1995 wants to merge 1 commit intoapache:mainfrom
jackye1995:trino-serde-api

jackye1995 commented Oct 1, 2021

Uh oh!

jackye1995 Oct 1, 2021 •

edited

Loading

Uh oh!

rdblue Oct 1, 2021

Uh oh!

rdblue Oct 1, 2021

Uh oh!

rdblue Oct 1, 2021

Uh oh!

rdblue Oct 1, 2021

Uh oh!

rdblue Oct 1, 2021

Uh oh!

rdblue Oct 1, 2021

Uh oh!

rdblue Oct 1, 2021

Uh oh!

rdblue Oct 1, 2021

Uh oh!

findepi Dec 16, 2021

Uh oh!

github-actions bot commented Jul 28, 2024

Uh oh!

github-actions bot commented Aug 5, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

jackye1995 commented Oct 1, 2021

Uh oh!

jackye1995 Oct 1, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Jul 28, 2024

Uh oh!

github-actions bot commented Aug 5, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

jackye1995 Oct 1, 2021 •

edited

Loading