Skip to content

Core: add Jackson serialization util for Trino and Presto#3210

Closed
jackye1995 wants to merge 1 commit intoapache:mainfrom
jackye1995:trino-serde-api
Closed

Core: add Jackson serialization util for Trino and Presto#3210
jackye1995 wants to merge 1 commit intoapache:mainfrom
jackye1995:trino-serde-api

Conversation

@jackye1995
Copy link
Copy Markdown
Contributor

This is related to adding MoR delete reader in Trino: trinodb/trino#8534, and any potential future support for Presto and Trino.

Currently there is no way to serialize FileScanTask without using Java or Kryo serialization. But Presto and Trino uses Jackson for all serializations. This PR adds some util methods to make sure we can reconstruct a FileScanTask object.

The biggest usage, as the linked PR suggests, is for the DeleteFilter that has constructor DeleteFilter(FileScanTask task, Schema tableSchema, Schema requestedSchema). It would be a lot of code duplication for Trino to add another delete filter implementation, but on the other side there is no way for Trino to use the Iceberg delete filter if it cannot reconstruct the file scan task and associated delete and data files.

@losipiuk @electrum @findepi @rdblue @ChunxuTang

@github-actions github-actions bot added the core label Oct 1, 2021
}

public static FileScanTask createFileScanTask(DataFile file, DeleteFile[] deletes, String schemaString,
String specString, ResidualEvaluator residuals) {
Copy link
Copy Markdown
Contributor Author

@jackye1995 jackye1995 Oct 1, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only unknown part is the serialization of ResidualEvaluator, which requires serialization of Iceberg Expression. In theory we can convert it to Trino TupleDomain (basically the reverse of https://github.com/trinodb/trino/blob/master/plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/ExpressionConverter.java), but I have not verified if that is lossy or not, maybe there is a better way to serialize expressions directly in Iceberg to a string format that can be easily deserialized back.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Eventually, we may want to have an expression parser, but I've been trying to avoid the complexity of that for a long time. We could also have a similar util class to convert expressions to/from JSON.

Does Trino actually use the residuals? If not, we could just skip serializing the residual evaluator.

metrics.lowerBounds(), metrics.upperBounds(), splitOffsets, null, sortOrderId, keyMetadata);
}

GenericDataFile(int specId, FileContent content, String filePath, FileFormat format, PartitionData partition,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't content be hard-coded to DATA like the constructor above?

long fileSizeInBytes, long recordCount, Map<Integer, Long> columnSizes,
Map<Integer, Long> valueCounts, Map<Integer, Long> nullValueCounts,
Map<Integer, Long> nanValueCounts, Map<Integer, ByteBuffer> lowerBounds,
Map<Integer, ByteBuffer> upperBounds, List<Long> splitOffsets, int[] equalityFieldIds,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need for equalityFieldIds for data files.

/**
* Util methods to help Jackson serialization of Iceberg objects, useful in systems like Presto and Trino.
*/
public class JacksonSerializationUtil {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All of the other to/from JSON classes are named SomethingParser. Can we use the same pattern?

return new BaseFileScanTask(file, deletes, schemaString, specString, residuals);
}

public static DataFile createDataFile(int specId, FileContent content, String filePath, FileFormat format,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this use the builders instead of a direct construtor?

equalityFieldIds, sortOrderId, keyMetadata);
}

public static List<byte[]> partitionDataToBytesMap(PartitionData partition) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The spec is used in deserialization to convert bytes to values. Why not also use it to convert values to bytes rather than instanceof checks? Then you'd know whether a value matches the assumptions at serialization time rather than potentially deserializing incorrectly. For example, if I have an long in the partition that should be an int serializing 8 bytes but deserializing just the first 4 is a silent error.

return partitionData;
}

private static byte[] partitionValueToBytes(Object value, int pos) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not use the existing Conversions.toByteBuffer and Conversions.fromByteBuffer instead? Those implement the single-value serializations defined in the spec.

equalityFieldIds, sortOrderId, keyMetadata);
}

public static List<byte[]> partitionDataToBytesMap(PartitionData partition) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of accepting a PartitionData, you may want to consider using StructLike.

.collect(Collectors.toList());
}

public static PartitionData bytesMapToPartitionData(List<byte[]> values, PartitionSpec spec) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems unused. Do i understand correctly, it's for Trino's consumption?
We should exercise this code within Iceberg. Let's have a test.

@github-actions
Copy link
Copy Markdown

This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the dev@iceberg.apache.org list. Thank you for your contributions.

@github-actions github-actions bot added the stale label Jul 28, 2024
@github-actions
Copy link
Copy Markdown

github-actions bot commented Aug 5, 2024

This pull request has been closed due to lack of activity. This is not a judgement on the merit of the PR in any way. It is just a way of keeping the PR queue manageable. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time.

@github-actions github-actions bot closed this Aug 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants