FlinkTypeVisitor: Use LogicalTypeVisitor and supports MultisetType #1173

JingsongLi · 2020-07-07T03:41:37Z

This PR wants to improve #1096

Use `LogicalTypeVisitor`

Flink has LogicalTypeVisitor and DataTypeVisitor, they are very useful for visiting types. We don't implement a custom visitor based on instanceOf, it's error prone and not very elegant.
And for FieldsDataType, it not has a good design in 1.9 and 1.10, so in Flink 1.11, it has been refactored to be removed getFieldDataTypes. So I think maybe a LogicalTypeVisitor is enough, since we never touch the physical information in the DataTypes.

Support MultisetType

A CollectionDataType may be MultisetType too. We can map it to Map<T, Integer>.

JingsongLi · 2020-07-07T03:43:58Z

CC: @openinx @rdblue

openinx · 2020-07-07T09:00:00Z

flink/src/main/java/org/apache/iceberg/flink/FlinkTypeVisitor.java

+import org.apache.flink.table.types.logical.ZonedTimestampType;

-public class FlinkTypeVisitor<T> {
+public abstract class FlinkTypeVisitor<T> implements LogicalTypeVisitor<T> {


@JingsongLi I'm curious that what's the difference between the flink style LogicalTypeVisitor and iceberg style visitor... Currently, all of the visitor are iceberg style, I'm not quite sure that what's the benifits to convert it to flink style visitor ...

Update: OK, I read the background in this issues here (#1173 (comment)), sounds reasonable.

BTW, seems this FlinkTypeVisitor can be package access (I forget to check the access before).

openinx · 2020-07-07T09:10:25Z

flink/src/main/java/org/apache/iceberg/flink/FlinkTypeToType.java

-  public Type map(KeyValueDataType map, Type keyType, Type valueType) {
+  public Type visit(MultisetType multisetType) {
+    Type elementType = multisetType.getElementType().accept(this);
+    return Types.MapType.ofRequired(getNextId(), getNextId(), elementType, Types.IntegerType.get());


Sounds good that we've extended support the flink multiset data type .

openinx

The patch looks good to me overall, left few comments. @rdblue you may want to take a final check. Thanks.

openinx · 2020-07-09T09:32:13Z

flink/src/main/java/org/apache/iceberg/flink/FlinkTypeToType.java

+    List<Types.NestedField> newFields = Lists.newArrayListWithExpectedSize(rowType.getFieldCount());
+    boolean isRoot = root == rowType;
+
+    List<Type> types = rowType.getFields().stream()


Seems here we don't need to loop twice ( the first loop to get List<Type> and the next loop to get List<Types.NestedField> ). Could be simplified like the following:

@Override public Type visit(RowType rowType) { List<Types.NestedField> newFields = Lists.newArrayListWithExpectedSize(rowType.getFieldCount()); boolean isRoot = root == rowType; for (int i = 0; i < rowType.getFieldCount(); i++) { int id = isRoot ? i : getNextId(); RowType.RowField field = rowType.getFields().get(i); String name = field.getName(); String comment = field.getDescription().orElse(null); Type type = field.getType().accept(this); if (field.getType().isNullable()) { newFields.add(Types.NestedField.optional(id, name, type, comment)); } else { newFields.add(Types.NestedField.required(id, name, type, comment)); } } return Types.StructType.of(newFields); }

One thing is : we may adjust the place to generate field Id for nested types, then we may need to adjust the unit test ..

I'd prefer to keep the loop twice. If we need change the generation ID for nested types, I think it is better to change Spark too.

I'm OK about the current twice loop here now, let's just keep the consistence id generation with spark here.

rdblue · 2020-07-14T16:56:44Z

flink/src/main/java/org/apache/iceberg/flink/FlinkTypeToType.java

+      String comment = field.getDescription().orElse(null);
+
+      if (field.getType().isNullable()) {
+        newFields.add(Types.NestedField.optional(id, name, types.get(i), comment));


There is also a factory method that accepts a nullability boolean, NestedField.of.

rdblue · 2020-07-14T16:58:27Z

flink/src/test/java/org/apache/iceberg/flink/TestFlinkSchemaUtil.java

        .field("decimal", DataTypes.DECIMAL(2, 2))
        .field("decimal2", DataTypes.DECIMAL(38, 2))
        .field("decimal3", DataTypes.DECIMAL(10, 1))
+        .field("multiset", DataTypes.MULTISET(DataTypes.STRING().notNull()))


What happens for a multiset of nullable items?

Just like a nullable key in Map, because the default behavior in Flink is nullable key, we support its conversion:

in the conversion of Flink type to Iceberg type, just ignore the nullable of key.

in the conversion of Iceberg type to Flink type, the nullable of key becomes false.

Okay, for rows that are passed to Iceberg that have null map keys or null values in a multiset, what should happen?

Null values are OK, the problem is null keys.
For null keys support, looks like formats are OK, the only constraint of formats is that Avro only support string key of map type.
But the thing is that whether we have any special optimizations for not null. The answer is yes, see ParquetValueWriters.option. If a null key comes to parquet writer, I think there should be NullPointException. This looks not so elegant.

Another choice is what I said in https://github.com/apache/iceberg/pull/1096/files/8891cd5438306f0b4b226706058beff7c3cd4080#diff-12a375418217cdc6be26c73e02d56065R102
We can throw a UnsupportedException here to tell users, although Flink has default nullable map key.

rdblue · 2020-07-14T16:59:22Z

Overall, looks good to me. I'll merge this. Thanks @JingsongLi, I think the logical type visitor looks clean. And thanks to @openinx for reviewing!

FlinkTypeVisitor: Use LogicalTypeVisitor and supports MultisetType

4af5bf4

openinx reviewed Jul 7, 2020

View reviewed changes

Public to package

47beb33

rdblue added this to the Flink Sink milestone Jul 7, 2020

openinx reviewed Jul 9, 2020

View reviewed changes

JingsongLi mentioned this pull request Jul 14, 2020

Bump Flink to 1.11 #1201

Merged

rdblue reviewed Jul 14, 2020

View reviewed changes

rdblue merged commit dfc8ec3 into apache:master Jul 14, 2020

rdblue mentioned this pull request Jul 14, 2020

Add TypeToFlinkType: convert iceberg types to Flink types #1174

Merged

cmathiesen pushed a commit to ExpediaGroup/iceberg that referenced this pull request Aug 19, 2020

Flink: Extend LogicalTypeVisitor and support MultisetType (apache#1173)

fa35130

JingsongLi deleted the fix_flink_type branch November 5, 2020 09:41

FlinkTypeVisitor: Use LogicalTypeVisitor and supports MultisetType #1173

FlinkTypeVisitor: Use LogicalTypeVisitor and supports MultisetType #1173

Uh oh!

Conversation

JingsongLi commented Jul 7, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Use LogicalTypeVisitor

Support MultisetType

Uh oh!

JingsongLi commented Jul 7, 2020

Uh oh!

openinx Jul 7, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

openinx left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rdblue commented Jul 14, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

JingsongLi commented Jul 7, 2020 •

edited

Loading

Use `LogicalTypeVisitor`

openinx Jul 7, 2020 •

edited

Loading