Core: Add SerializableTable #2403

aokolnychyi · 2021-04-01T01:49:06Z

This PR changes our table serialization to avoid sending extra requests to access frequently needed metadata.

core/src/main/java/org/apache/iceberg/SerializedTable.java

aokolnychyi · 2021-04-01T01:50:55Z

@openinx @pvary @yyanyy @rdblue @RussellSpitzer @jackye1995, could you take a look, please?

core/src/test/java/org/apache/iceberg/hadoop/TestTableSerialization.java

core/src/main/java/org/apache/iceberg/SortOrderParser.java

core/src/main/java/org/apache/iceberg/BaseTransaction.java

aokolnychyi · 2021-04-01T16:56:32Z

core/src/main/java/org/apache/iceberg/SerializedTable.java

+      TableOperations ops = ((HasTableOperations) table).operations();
+      return ops.current().metadataFileLocation();
+    } else {
+      return null;


Tables that don't implement HasTableOperations will not be able to load full metadata but would still be serialized and deserialized correctly.

RussellSpitzer

LGTM

aokolnychyi · 2021-04-02T20:52:51Z

core/src/main/java/org/apache/iceberg/SerializedTable.java

+    }
+  }
+
+  private FileIO fileIO(Table table) {


I've changed this place to handle FileIO instead of delegating it to the caller.
We rely on SerializedConfiguration that is introduced in this PR.

This can be made more generic in the future. We can expose ConfigurableFileIO interface or something similar instead of only handling HadoopFileIO.

We can do something similar to dynamic loading, use if (table.io() instanceof Configurable)

Yeah, we would need a way to construct a copy of the class with the serialized conf.

If we knew that all FileIO implementations can be dynamically loaded, we could simply persist the class, props, and optional conf and rebuild it on demand. However, we cannot guarantee all FileIO implementations can be dynamically loaded.

Ah I see, the FIleIO has to directly accept the serialized hadoop config supplier. I thought it's enough after #2333, but the setConf() there still needs to accept a Hadoop config and would not work with the supplier.

In that case, I think we can potentially have something like a SerializedConfigurable interface that has method setSerializedConfSupplier(supplier), and check instance of that class. HadoopFileIO can then implement that class instead.

The main point here is that if we do it for HadoopFileIO only, I think we should at least make sure other FileIOs that leverage hadoop configuration can work with this.

Yeah, we would definitely need a more generic solution for FileIO implementations that depend on Hadoop conf.

I tried to prototype that but it raised a number of questions. The last idea I had was something like this:

interface ConfigurableFileIO<T extends ConfigurableFileIO> extends FileIO, Configurable { T toSerializable(SerializedConfiguration serializedConf); default Object writeReplace() { // default the implementation to call `toSerializable` } }

It was clear this requires a separate PR and more thinking. That's why I propose to address that in a follow-up.

Cool, let's discuss in another PR then. Just to throw another idea I just have:

public interface KryoSerializable<T> extends Serializable { // return a serialized version of self default T serialized() { throw new UnsupportedOperationException("Cannot support kryo serialization"); } }

public interface FileIO extends KryoSerializable<FileIO> { ... }

public class HadoopFileIO ... { @Override public FileIO serialized() { SerializedConfiguration serializedConf = new SerializedConfiguration(getConf()); return new HadoopFileIO(serializedConf::get); } }

By doing this, we enforce everything dynamically loaded to be kryo serializable.

I don't think that we should change the contract for dynamically loaded classes. If a user chooses to use Kryo and a dynamically loaded component, it is their responsibility to make sure the two are compatible. We just need to make sure that Iceberg-supplied classes work with Kryo.

aokolnychyi · 2021-04-02T20:57:15Z

core/src/main/java/org/apache/iceberg/hadoop/SerializedConfiguration.java

+ * Hadoop configuration will be propagated to the captured state once an instance
+ * of {@link SerializedConfiguration} is constructed.
+ */
+public class SerializedConfiguration implements Serializable {


We cannot just replace the internal implementation of the existing SerializableConfiguration class as Hadoop conf is not immutable and can dynamically change. For example, getting FileSystem would trigger a full reload of configs. That's why I added SerializedConfiguration that is used only in SerializedTable for now.

Later, we can use SerializedConfiguration in writeReplace of SerializableConfiguration.

This way reduced the size of the serialized Hadoop conf by 50% as we don't serialize the source of the conf.

Also, this class is Kryo compatible. So we don't have to wrap it into special classes like we do today in Spark.

aokolnychyi · 2021-04-02T20:59:22Z

@RussellSpitzer @jackye1995, could you take one more look? I've updated the approach to make SerializedTable Kryo compatible.

cc @openinx @pvary @yyanyy @rdblue too

core/src/main/java/org/apache/iceberg/SerializedTable.java

rdblue · 2021-04-03T00:37:47Z

core/src/main/java/org/apache/iceberg/hadoop/SerializedConfiguration.java

+
+  public SerializedConfiguration(Configuration conf) {
+    this.confAsMap = Maps.newHashMapWithExpectedSize(conf.size());
+    conf.forEach(entry -> confAsMap.put(entry.getKey(), entry.getValue()));


I think this should also set conf to the one that is passed in. That way this won't create a new configuration if get is called before the object is serialized.

If we do this, there will be no guarantee the conf and map are in sync. This class should be used right before the serialization and should not be accessed before it is serialized.

I moved this class to SerializedTable to hide it and make sure nobody uses it.

aokolnychyi · 2021-04-03T04:14:36Z

Since I redesigned the original approach, we may consider making serialized table classes non-public and offering a factory or a util class to build them instead. We will need to construct serialized tables manually for Kryo.

SerializedTables.newSerializedTable(table)
SerializedTables.createSerializedTable(table)
SerializedTables.forTable(table)

I don't want to call it a util as it has to be in org.apache.iceberg package.

core/src/main/java/org/apache/iceberg/SerializedTable.java

rdblue

Looks good to me. Thanks for pushing this forward, @aokolnychyi!

aokolnychyi · 2021-04-04T04:43:45Z

core/src/main/java/org/apache/iceberg/SerializableTableFactory.java

+   * @param table the original table to copy the state from
+   * @return a read-only serializable table reflecting the current state of the original table
+   */
+  public static Table copyOf(Table table) {


Query engines that need Kryo support will call this method manually.

aokolnychyi · 2021-04-04T04:44:33Z

core/src/main/java/org/apache/iceberg/SerializableTableFactory.java

+/**
+ * A factory to create serializable tables.
+ */
+public class SerializableTableFactory {


I went with a factory instead of a util class so that I can have it in org.apache.iceberg and don't need to open up our base metadata table class.

this doesn't look like a traditional factory class. We probably don't need this wrapper and directly make the SerializableTable a public class.

Then we can either have a constructor like SerializableTable(Table t) or have a static public staticSerializableTable#copyOf(Table t)

We have two table implementations: one for base and one for metadata tables. I don't think making those two classes public is a good idea. We need to do a switch somewhere and expose a method for building serializable tables.

While we could name the class like SerializableTable and do a switch there, it may be a bit weird that it would not implement Table. Let me think more about this. Maybe, we can make the metadata serializable table as a nested class.

SerializableTable.copyOf(table) SerializableTables.copyOf(table) SerializableTableFactory.copyOf(table)

SerializableTables.copyOf(table) looks more accurate than Factory to me.

I've updated to SerializableTable.copyOf(table) and made the metadata table a private nested class.
Could you take one more look, @stevenzwu?

aokolnychyi · 2021-04-04T04:46:14Z

I was not entirely happy with the implementation. I've decided to introduce a factory that should be used by query engines and hide the table implementations. Will need another review round.

stevenzwu · 2021-04-05T04:36:45Z

spark/src/test/java/org/apache/iceberg/KryoHelpers.java

+import org.apache.spark.SparkConf;
+import org.apache.spark.serializer.KryoSerializer;
+
+public class KryoHelpers {


why are this and other test files in spark module?

We depend on Kryo from Spark. Iceberg does not bundle it.

got it. But those tests don't seem tied to Spark Engine. Ideally, they probably should live in iceberg-core module. It is probably fine to add Kryo to test compile dep in the iceberg-core module?

I am afraid each query engine has its own specifics. For example, Spark adds custom serializers for handling unmodifiable collections in Java while Flink does not, which led to exceptions on the Flink side.

There is a number of Kryo related suites in query engine modules. I'd up for refactoring but probably in a separate PR.

sounds good. thx a lot for the context

aokolnychyi · 2021-04-05T21:46:56Z

Thanks for reviewing, @stevenzwu @rdblue @RussellSpitzer @jackye1995!

github-actions bot added the core label Apr 1, 2021

aokolnychyi commented Apr 1, 2021

View reviewed changes

core/src/main/java/org/apache/iceberg/SerializedTable.java Outdated Show resolved Hide resolved

RussellSpitzer reviewed Apr 1, 2021

View reviewed changes

core/src/test/java/org/apache/iceberg/hadoop/TestTableSerialization.java Outdated Show resolved Hide resolved

jackye1995 reviewed Apr 1, 2021

View reviewed changes

core/src/main/java/org/apache/iceberg/SortOrderParser.java Outdated Show resolved Hide resolved

RussellSpitzer reviewed Apr 1, 2021

View reviewed changes

core/src/main/java/org/apache/iceberg/BaseTransaction.java Show resolved Hide resolved

aokolnychyi commented Apr 1, 2021

View reviewed changes

RussellSpitzer approved these changes Apr 1, 2021

View reviewed changes

jackye1995 approved these changes Apr 1, 2021

View reviewed changes

aokolnychyi force-pushed the refactor-table-serializiblity branch from 5802094 to 29075c0 Compare April 2, 2021 20:50

github-actions bot added API spark labels Apr 2, 2021

aokolnychyi commented Apr 2, 2021

View reviewed changes

jackye1995 reviewed Apr 2, 2021

View reviewed changes

core/src/main/java/org/apache/iceberg/SerializedTable.java Outdated Show resolved Hide resolved

rdblue reviewed Apr 3, 2021

View reviewed changes

stevenzwu reviewed Apr 3, 2021

View reviewed changes

core/src/main/java/org/apache/iceberg/SerializedTable.java Outdated Show resolved Hide resolved

rdblue approved these changes Apr 3, 2021

View reviewed changes

Core: Add SerializableTableFactory

4d5e014

aokolnychyi force-pushed the refactor-table-serializiblity branch from 92e293a to 4d5e014 Compare April 4, 2021 04:41

aokolnychyi commented Apr 4, 2021

View reviewed changes

aokolnychyi changed the title ~~Core: Add SerializedTable and SerializedMetadataTable~~ Core: Add SerializableTableFactory Apr 4, 2021

stevenzwu reviewed Apr 5, 2021

View reviewed changes

Switch name

8848911

aokolnychyi changed the title ~~Core: Add SerializableTableFactory~~ Core: Add SerializableTable Apr 5, 2021

stevenzwu approved these changes Apr 5, 2021

View reviewed changes

aokolnychyi merged commit 2843db8 into apache:master Apr 5, 2021

stevenzwu pushed a commit to stevenzwu/iceberg that referenced this pull request Jul 28, 2021

Core: Add SerializableTable (apache#2403)

33a4c0a

InvisibleProgrammer mentioned this pull request Jun 1, 2023

Hive-27306: port iceberg catalog changes apache/hive#4291

Merged

Core: Add SerializableTable #2403

Core: Add SerializableTable #2403

Uh oh!

Conversation

aokolnychyi commented Apr 1, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

aokolnychyi commented Apr 1, 2021

Uh oh!

Uh oh!

Uh oh!

Uh oh!

aokolnychyi Apr 1, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

RussellSpitzer left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

aokolnychyi Apr 2, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

aokolnychyi Apr 2, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

aokolnychyi Apr 2, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

aokolnychyi Apr 2, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

aokolnychyi commented Apr 2, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

aokolnychyi commented Apr 3, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

rdblue left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

stevenzwu Apr 5, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

aokolnychyi Apr 5, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

aokolnychyi commented Apr 1, 2021 •

edited

Loading

aokolnychyi Apr 1, 2021 •

edited

Loading

aokolnychyi Apr 2, 2021 •

edited

Loading

aokolnychyi Apr 2, 2021 •

edited

Loading

aokolnychyi Apr 2, 2021 •

edited

Loading

aokolnychyi Apr 2, 2021 •

edited

Loading

aokolnychyi commented Apr 2, 2021 •

edited

Loading

aokolnychyi commented Apr 3, 2021 •

edited

Loading

stevenzwu Apr 5, 2021 •

edited

Loading

aokolnychyi Apr 5, 2021 •

edited

Loading