-
Notifications
You must be signed in to change notification settings - Fork 3k
core: add JSON parser for ContentFile and FileScanTask #6934
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| /** | ||
| * Return the schema for this file scan task. | ||
| */ | ||
| default Schema schema() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is needed so that FileScanTaskParser (added in this PR) can serialize the schema. Then during the deserialization part, schema can be pass into the constructor of BaseFileScanTask.
Keep it at this level (not base ContentScanTask interface or lower) to limit the scope of change.
| return file; | ||
| } | ||
|
|
||
| protected Schema schema() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
exposed as protected so that BaseFileScanTask can use it to implement the FileScanTask#schema() method
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Little odd that we reverse engineer the schema from the string here, but seems like the most backwards compatible thing we can do here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agree it is a little odd. On the other hand, partition spec is in the same model in this class. As you said, otherwise we would have to change the constructors of a bunch of classes. The current choice of passing schema and spec as strings is to make those scan tasks serializable.
@Override
public PartitionSpec spec() {
if (spec == null) {
synchronized (this) {
if (spec == null) {
this.spec = PartitionSpecParser.fromJson(schema(), specString);
}
}
}
return spec;
}
cc @nastra
| import org.apache.iceberg.util.ArrayUtil; | ||
| import org.apache.iceberg.util.JsonUtil; | ||
|
|
||
| class ContentFileParser { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since DataFile and DeleteFile has the same structure, calling this ContentFileParser without any generic type.
| private ByteBuffer keyMetadata = null; | ||
| private List<Long> splitOffsets = null; | ||
| private List<Integer> equalityFieldIds = null; | ||
| private Integer sortOrderId = SortOrder.unsorted().orderId(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
relocated the line here to follow the same order of definition
| private Map<Integer, ByteBuffer> upperBounds = null; | ||
| private ByteBuffer keyMetadata = null; | ||
| private List<Long> splitOffsets = null; | ||
| private List<Integer> equalityFieldIds = null; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add a setter for equalityFieldIds so that the parser unit test can cover this field too.
|
|
||
| private final PartitionSpec spec; | ||
|
|
||
| ContentFileParser(PartitionSpec spec) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unlike other JSON parser with a static singleton pattern, ContentFileParser depends on the partition spec. Hence this is a regular class and constructor.
af84243 to
f271871
Compare
nastra
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
did a high-level pass over the parsers themselves and left a few comments. I haven't had a chance to look closer at the tests yet
stevenzwu
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nastra thx a lot for the initial review. I addressed the comments in the latest commit
core/src/test/java/org/apache/iceberg/TestFileScanTaskParser.java
Outdated
Show resolved
Hide resolved
core/src/test/java/org/apache/iceberg/TestFileScanTaskParser.java
Outdated
Show resolved
Hide resolved
core/src/test/java/org/apache/iceberg/TestContentFileParser.java
Outdated
Show resolved
Hide resolved
core/src/test/java/org/apache/iceberg/TestContentFileParser.java
Outdated
Show resolved
Hide resolved
core/src/test/java/org/apache/iceberg/TestContentFileParser.java
Outdated
Show resolved
Hide resolved
core/src/test/java/org/apache/iceberg/TestFileScanTaskParser.java
Outdated
Show resolved
Hide resolved
a8062a7 to
4d57100
Compare
78fec72 to
92a162f
Compare
nastra
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry for the late re-review @stevenzwu, I've left a few more comments.
core/src/test/java/org/apache/iceberg/TestFileScanTaskParser.java
Outdated
Show resolved
Hide resolved
nastra
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've been mainly focusing on the JSON parsers and left a few comments, but overall this looks almost ready. It would be great to get some additional input from another reviewer
| import org.junit.jupiter.params.provider.Arguments; | ||
| import org.junit.jupiter.params.provider.MethodSource; | ||
|
|
||
| public class TestContentFileParser { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be good to also add a test with a plain JSON string to see how the full JSON looks like. And then maybe also another test with a plain JSON string where all optional fields (metrics, equality field ids, sort order id, split offsets, ...) are missing
4242a0b to
00bf6c0
Compare
|
|
||
| JsonNode pNode = node.get(property); | ||
| Preconditions.checkArgument( | ||
| pNode.isTextual(), "Cannot parse from non-text value: %s: %s", property, pNode); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: maybe we should mention that we're trying to parse this from text to a binary representation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also fixed a couple other error msgs with the same problem.
|
Spark CI build failed with some seemingly env problem |
61e40a7 to
0016f36
Compare
…pected json string
a465d34 to
8105811
Compare
|
merging after rebase |
|
@stevenzwu we are seeing Trino OOM issue during scan planning and it might be because we introduce table schema to each file scan task in this PR. The issue happens in conjunction with very wide table schema and ParallelIterable usage. I wonder your thought on this? One possible way is to store schema id instead of actual schema to save memory usage. LMK if this is the right place to discuss or we can move somewhere else. |
|
@puchengy please create a new issue to track and discuss this problem. Agree with the overhead of serializing the schema for scan task. If we were to just serialize schema id, the serializer would need to get hold of the schemas. It would require major refactoring of the call stack to allow pass-in. At that time, we opted into the simpler approach. But we can discuss the alternative. Does Trino use the JSON parser for file scan task? |
this closes issue #1698.
There are two motivations as described by issue #1698.