core: add JSON parser for ContentFile and FileScanTask #6934

stevenzwu · 2023-02-24T19:25:16Z

this closes issue #1698.

There are two motivations as described by issue #1698.

provide a more stable serialization (than Java serialization) for Flink checkpoint
can be used by REST catalog for scan planning or committing files

stevenzwu · 2023-02-24T19:28:32Z

api/src/main/java/org/apache/iceberg/FileScanTask.java

+  /**
+   * Return the schema for this file scan task.
+   */
+  default Schema schema() {


this is needed so that FileScanTaskParser (added in this PR) can serialize the schema. Then during the deserialization part, schema can be pass into the constructor of BaseFileScanTask.

Keep it at this level (not base ContentScanTask interface or lower) to limit the scope of change.

stevenzwu · 2023-02-24T19:29:46Z

core/src/main/java/org/apache/iceberg/BaseContentScanTask.java

    return file;
  }

+  protected Schema schema() {


exposed as protected so that BaseFileScanTask can use it to implement the FileScanTask#schema() method

Little odd that we reverse engineer the schema from the string here, but seems like the most backwards compatible thing we can do here.

agree it is a little odd. On the other hand, partition spec is in the same model in this class. As you said, otherwise we would have to change the constructors of a bunch of classes. The current choice of passing schema and spec as strings is to make those scan tasks serializable.

@Override public PartitionSpec spec() { if (spec == null) { synchronized (this) { if (spec == null) { this.spec = PartitionSpecParser.fromJson(schema(), specString); } } } return spec; }

cc @nastra

stevenzwu · 2023-02-24T19:30:23Z

core/src/main/java/org/apache/iceberg/ContentFileParser.java

+import org.apache.iceberg.util.ArrayUtil;
+import org.apache.iceberg.util.JsonUtil;
+
+class ContentFileParser {


since DataFile and DeleteFile has the same structure, calling this ContentFileParser without any generic type.

stevenzwu · 2023-02-24T19:31:18Z

core/src/main/java/org/apache/iceberg/DataFiles.java

    private ByteBuffer keyMetadata = null;
    private List<Long> splitOffsets = null;
+    private List<Integer> equalityFieldIds = null;
+    private Integer sortOrderId = SortOrder.unsorted().orderId();


relocated the line here to follow the same order of definition

stevenzwu · 2023-02-24T19:31:51Z

core/src/main/java/org/apache/iceberg/DataFiles.java

    private Map<Integer, ByteBuffer> upperBounds = null;
    private ByteBuffer keyMetadata = null;
    private List<Long> splitOffsets = null;
+    private List<Integer> equalityFieldIds = null;


add a setter for equalityFieldIds so that the parser unit test can cover this field too.

stevenzwu · 2023-02-24T19:33:09Z

core/src/main/java/org/apache/iceberg/ContentFileParser.java

+
+  private final PartitionSpec spec;
+
+  ContentFileParser(PartitionSpec spec) {


Unlike other JSON parser with a static singleton pattern, ContentFileParser depends on the partition spec. Hence this is a regular class and constructor.

core/src/main/java/org/apache/iceberg/FileScanTaskParser.java

nastra

did a high-level pass over the parsers themselves and left a few comments. I haven't had a chance to look closer at the tests yet

core/src/main/java/org/apache/iceberg/ContentFileParser.java

core/src/test/java/org/apache/iceberg/TableTestBase.java

core/src/main/java/org/apache/iceberg/FileScanTaskParser.java

stevenzwu

@nastra thx a lot for the initial review. I addressed the comments in the latest commit

core/src/main/java/org/apache/iceberg/ContentFileParser.java

core/src/main/java/org/apache/iceberg/FileScanTaskParser.java

core/src/test/java/org/apache/iceberg/TableTestBase.java

core/src/main/java/org/apache/iceberg/ContentFileParser.java

core/src/main/java/org/apache/iceberg/FileScanTaskParser.java

core/src/test/java/org/apache/iceberg/TestFileScanTaskParser.java

core/src/main/java/org/apache/iceberg/ContentFileParser.java

core/src/test/java/org/apache/iceberg/TestContentFileParser.java

core/src/test/java/org/apache/iceberg/TestFileScanTaskParser.java

core/src/main/java/org/apache/iceberg/FileScanTaskParser.java

nastra

sorry for the late re-review @stevenzwu, I've left a few more comments.

core/src/main/java/org/apache/iceberg/ContentFileParser.java

core/src/test/java/org/apache/iceberg/TestFileScanTaskParser.java

core/src/main/java/org/apache/iceberg/ContentFileParser.java

core/src/main/java/org/apache/iceberg/FileScanTaskParser.java

nastra

I've been mainly focusing on the JSON parsers and left a few comments, but overall this looks almost ready. It would be great to get some additional input from another reviewer

core/src/main/java/org/apache/iceberg/util/JsonUtil.java

core/src/test/java/org/apache/iceberg/util/TestJsonUtil.java

nastra · 2023-04-25T09:06:22Z

core/src/test/java/org/apache/iceberg/TestContentFileParser.java

+import org.junit.jupiter.params.provider.Arguments;
+import org.junit.jupiter.params.provider.MethodSource;
+
+public class TestContentFileParser {


I think it would be good to also add a test with a plain JSON string to see how the full JSON looks like. And then maybe also another test with a plain JSON string where all optional fields (metrics, equality field ids, sort order id, split offsets, ...) are missing

nastra · 2023-05-04T06:58:03Z

core/src/main/java/org/apache/iceberg/util/JsonUtil.java

+
+    JsonNode pNode = node.get(property);
+    Preconditions.checkArgument(
+        pNode.isTextual(), "Cannot parse from non-text value: %s: %s", property, pNode);


nit: maybe we should mention that we're trying to parse this from text to a binary representation

I also fixed a couple other error msgs with the same problem.

stevenzwu · 2023-05-04T23:25:20Z

Spark CI build failed with some seemingly env problem

        Caused by:
        java.lang.RuntimeException: The root scratch dir: /tmp/hive on HDFS should be writable. Current permissions are: rwxr-xr-x
            at org.apache.hadoop.hive.ql.session.SessionState.createRootHDFSDir(SessionState.java:724)
            at org.apache.hadoop.hive.ql.session.SessionState.createSessionDirs(SessionState.java:654)
            at org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:586)
            at org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:548)
            at org.apache.spark.sql.hive.client.HiveClientImpl.newState(HiveClientImpl.scala:174)
            at org.apache.spark.sql.hive.client.HiveClientImpl.<init>(HiveClientImpl.scala:129)
            at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
            at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
            at java.base/jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
            at java.base/java.lang.reflect.Constructor.newInstance(Constructor.java:490)
            at org.apache.spark.sql.hive.client.IsolatedClientLoader.createClient(IsolatedClientLoader.scala:293)
            at org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:492)
            at org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:352)
            at org.apache.spark.sql.hive.HiveExternalCatalog.client$lzycompute(HiveExternalCatalog.scala:71)
            at org.apache.spark.sql.hive.HiveExternalCatalog.client(HiveExternalCatalog.scala:70)
            at org.apache.spark.sql.hive.HiveExternalCatalog.$anonfun$databaseExists$1(HiveExternalCatalog.scala:224)
            at scala.runtime.java8.JFunction0$mcZ$sp.apply(JFunction0$mcZ$sp.java:23)
            at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:102)

…pected json string

stevenzwu · 2023-06-26T16:02:02Z

merging after rebase

puchengy · 2025-08-22T14:56:14Z

@stevenzwu we are seeing Trino OOM issue during scan planning and it might be because we introduce table schema to each file scan task in this PR. The issue happens in conjunction with very wide table schema and ParallelIterable usage. I wonder your thought on this? One possible way is to store schema id instead of actual schema to save memory usage.

LMK if this is the right place to discuss or we can move somewhere else.

stevenzwu · 2025-08-22T15:55:06Z

@puchengy please create a new issue to track and discuss this problem.

Agree with the overhead of serializing the schema for scan task. If we were to just serialize schema id, the serializer would need to get hold of the schemas. It would require major refactoring of the call stack to allow pass-in. At that time, we opted into the simpler approach. But we can discuss the alternative.

Does Trino use the JSON parser for file scan task?

github-actions bot added API core labels Feb 24, 2023

stevenzwu commented Feb 24, 2023

View reviewed changes

stevenzwu force-pushed the issue-1698-split-json branch from af84243 to f271871 Compare February 24, 2023 19:37

stevenzwu requested review from RussellSpitzer, aokolnychyi and pvary February 24, 2023 20:08

nastra requested review from nastra and rdblue February 27, 2023 09:04

nastra reviewed Mar 16, 2023

View reviewed changes

stevenzwu commented Mar 20, 2023

View reviewed changes

nastra reviewed Mar 21, 2023

View reviewed changes

core/src/main/java/org/apache/iceberg/ContentFileParser.java Outdated Show resolved Hide resolved

nastra reviewed Apr 3, 2023

View reviewed changes

stevenzwu force-pushed the issue-1698-split-json branch from a8062a7 to 4d57100 Compare April 5, 2023 02:41

nastra reviewed Apr 5, 2023

View reviewed changes

core/src/main/java/org/apache/iceberg/FileScanTaskParser.java Outdated Show resolved Hide resolved

stevenzwu mentioned this pull request Apr 11, 2023

Core: View metadata implementation #6559

Closed

stevenzwu force-pushed the issue-1698-split-json branch from 78fec72 to 92a162f Compare April 18, 2023 16:25

nastra reviewed Apr 24, 2023

View reviewed changes

nastra reviewed Apr 25, 2023

View reviewed changes

stevenzwu mentioned this pull request May 3, 2023

Expose data and file sequence numbers on ContentFile or ContentScanTask #7449

Closed

stevenzwu force-pushed the issue-1698-split-json branch from 4242a0b to 00bf6c0 Compare May 4, 2023 04:51

nastra approved these changes May 4, 2023

View reviewed changes

stevenzwu force-pushed the issue-1698-split-json branch from 61e40a7 to 0016f36 Compare June 24, 2023 03:01

stevenzwu added 5 commits June 25, 2023 07:35

core: add JSON parser for ContentFile and FileScanTask

a9db685

Update spec doc with JSON parser for content file and file scan task.

e3dc0b8

Address Eduard's initial round of comments

31c5c8a

switch to Preconditions.checkArgument

ffe4770

address Eduard's second round of comments

c1d913d

stevenzwu added 7 commits June 25, 2023 07:35

fix compiling error after rebase with master

d5cddf8

address comments from Eduard that were missed earlier

f4f4320

Address latest round of comments from Eduard

a0c7ee3

address Eduard's comments on avoiding SingleValueParser and adding ex…

7daa502

…pected json string

improve error msg from JsonUtil

fddc4bb

fix TestTableIdentifierParser due to error msg change in JsonUtil

6ecd0e4

fix test after rebase

8105811

stevenzwu force-pushed the issue-1698-split-json branch from a465d34 to 8105811 Compare June 25, 2023 14:35

stevenzwu merged commit b8db3f0 into apache:master Jun 26, 2023

singhpk234 mentioned this pull request Nov 28, 2025

Core: Align ContentFile partition JSON with REST spec #14702

Merged


		private final PartitionSpec spec;

		ContentFileParser(PartitionSpec spec) {

core: add JSON parser for ContentFile and FileScanTask #6934

core: add JSON parser for ContentFile and FileScanTask #6934

Uh oh!

Conversation

stevenzwu commented Feb 24, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

stevenzwu Feb 24, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

nastra left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

stevenzwu left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

nastra left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

nastra left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

stevenzwu commented Feb 24, 2023 •

edited

Loading

stevenzwu Feb 24, 2023 •

edited

Loading

puchengy commented Aug 22, 2025 •

edited

Loading