Parquet: Add type visitor with partner type #1391

chenjunjiedada · 2020-08-27T04:07:41Z

This is a refactor for parquet schema visitor which accepts a partner type, such as Iceberg Type, Spark DataType, Flink LogicalType.

chenjunjiedada · 2020-10-13T09:56:00Z

@JingsongLi , This is similar to the visitor you wrote before, would you mind to take a look?

JingsongLi · 2020-10-13T10:16:43Z

Thanks @chenjunjiedada for the contribution, I'll take a look these two days.

JingsongLi

Thanks @chenjunjiedada , sorry for late review, please rebase latest master.

JingsongLi · 2020-10-29T06:33:45Z

flink/src/main/java/org/apache/iceberg/flink/data/FlinkParquetReaders.java

+    public ParquetValueReader<RowData> message(org.apache.iceberg.types.Type expected, MessageType message,
                                               List<ParquetValueReader<?>> fieldReaders) {
-      return struct(expected, message.asGroupType(), fieldReaders);
+      if (expected == null) {


NIT: return struct(expected == null ? null : expected.asStructType(), message.asGroupType(), fieldReaders);

JingsongLi · 2020-10-29T06:35:57Z

flink/src/main/java/org/apache/iceberg/flink/data/FlinkParquetWriters.java

                                        List<ParquetValueWriter<?>> fieldWriters) {
      List<Type> fields = struct.getFields();
-      List<RowField> flinkFields = sStruct.getFields();
+      List<RowField> flinkFields = ((RowType) fStruct).getFields();


Can we change to use LogicalType.getChildren?

OK, updated.

JingsongLi · 2020-10-29T06:49:48Z

parquet/src/main/java/org/apache/iceberg/parquet/ParquetTypeWithPartnerVisitor.java

+  private final Deque<String> fieldNames = Lists.newLinkedList();
+
+  public static <P, T> T visit(P partnerType, Type type, ParquetTypeWithPartnerVisitor<P, T> visitor) {
+    if (type instanceof MessageType) {


Preconditions.checkNotNull(partnerType, "Invalid partnerType: null");

Looks like we have to allow the partner type to be null in current logic since we will call visitFields where the partner could be null. Also when visiting message/struct, we allow the expected to be null. Let me investigate whether we could avoid this.

@JingsongLi , The visit is also called by vistitList and visitMap where we cannot guarantee that the inner type is not null. So I think we should allow null partner here. Does that make sense to you?

If it's possible to have null for the type, then the if{} else if{} else {} will throw NPE ? Because they did not do any nullable check for type.

OK, I got the wrong thing. Pls ignore my above comment.

JingsongLi · 2020-10-29T06:55:01Z

parquet/src/main/java/org/apache/iceberg/parquet/ParquetTypeWithPartnerVisitor.java

+  }
+
+
+  public void beforeField(Type type) {


I don't quite understand the meaning of the following methods. Will they be overrided?

This is used to customize the stack when visiting the type, such as ApplyNameMapping visitor which override some of them to generate the correct mapping.

JingsongLi · 2020-10-29T07:02:38Z

parquet/src/main/java/org/apache/iceberg/parquet/ParquetTypeWithPartnerVisitor.java

+  protected abstract P arrayElementType(P arrayType);
+  protected abstract P mapKeyType(P mapType);
+  protected abstract P mapValueType(P mapType);
+  protected abstract Pair<String, P> fieldNameAndType(P structType, int pos, Integer fieldId);


fieldId -> parquetFieldId

JingsongLi · 2020-10-29T07:08:23Z

parquet/src/main/java/org/apache/iceberg/parquet/ParquetTypeWithPartnerVisitor.java

+          }
+          Type valueType = repeatedKeyValue.getType(1);
+          visitor.beforeValueField(valueType);
+          try {


Maybe we can have a method runWithStack(Runnable, Type)?

ParquetTypeVisitor also has the same visiting code logic. How about using a separated PR to refactor both?

JingsongLi · 2020-10-29T07:15:28Z

parquet/src/main/java/org/apache/iceberg/parquet/TypeWithSchemaVisitor.java

-    return list.toArray(new String[0]);
+    Type type = struct.field(fieldId).type();
+    String name = struct.field(fieldId).name();
+    return Pair.of(name, type);


NIT:

Types.NestedField field = struct.field(fieldId); return field == null ? null : Pair.of(field.name(), field.type());

JingsongLi · 2020-10-29T07:18:46Z

parquet/src/main/java/org/apache/iceberg/parquet/ParquetTypeWithPartnerVisitor.java

+  protected abstract P arrayElementType(P arrayType);
+  protected abstract P mapKeyType(P mapType);
+  protected abstract P mapValueType(P mapType);
+  protected abstract Pair<String, P> fieldNameAndType(P structType, int pos, Integer fieldId);


Why we need fieldNameAndType? Looks like just type is OK?

Yes, you are right.

JingsongLi · 2020-10-29T07:30:23Z

spark/src/main/java/org/apache/iceberg/spark/data/SparkParquetWriters.java

          newOption(repeatedKeyValue.getType(0), keyWriter),
          newOption(repeatedKeyValue.getType(1), valueWriter),
-          sMap.keyType(), sMap.valueType());
+          ((MapType) sMap).keyType(), ((MapType) sMap).valueType());


Can we use arrayElementType mapKeyType mapValueType etc... to avoid casting?

chenjunjiedada · 2020-10-29T14:01:59Z

Thanks a lot @JingsongLi! I will update this tomorrow.

chenjunjiedada · 2020-11-03T01:36:34Z

@JingsongLi , I think this is ready for another review. Could you please take a look?

openinx · 2020-11-03T04:18:03Z

parquet/src/main/java/org/apache/iceberg/parquet/ParquetTypeWithPartnerVisitor.java

+      visitor.beforeField(field);
+      Integer fieldId = field.getId() == null ? null : field.getId().intValue();
+      results.add(visit(visitor.fieldType(struct, i, fieldId), field, visitor));
+      visitor.afterField(field);


Q: should we use a try-finally block here for the beforeField & afterField ?

Yes, I think so. Nice catch!

openinx · 2020-11-03T08:40:04Z

parquet/src/main/java/org/apache/iceberg/parquet/TypeWithSchemaVisitor.java

-    List<String> list = Lists.newArrayList(fieldNames.descendingIterator());
-    list.add(name);
-    return list.toArray(new String[0]);
+    Types.NestedField field = structType.asStructType().field(fieldId);


I think it's a bug here ? Because in the method definition, we have a parquetFieldId but here we use this id to access the iceberg nested field ? That sounds unreasonable.

protected abstract P fieldType(P structType, int pos, Integer parquetFieldId);

btw: we might need to have a unit test to cover this case ?

openinx · 2020-11-03T08:55:51Z

flink/src/main/java/org/apache/iceberg/flink/data/ParquetWithFlinkSchemaVisitor.java

-    List<String> list = Lists.newArrayList(fieldNames.descendingIterator());
-    list.add(name);
-    return list.toArray(new String[0]);
+    return ((RowType) structType).getTypeAt(pos);


I'm also curious whether the inner fields are keep the same order as the parquet's fields. If not, then we would get the incorrect data type by the pos provided from parquet.

openinx · 2020-11-03T09:16:19Z

parquet/src/main/java/org/apache/iceberg/parquet/TypeWithSchemaVisitor.java

-      if (field.getId() != null) {
-        id = field.getId().intValue();
-      }
-      Types.NestedField iField = (struct != null && id >= 0) ? struct.field(id) : null;


@rdblue , Is it a bug in the master branch here ? we use the parquet field id to access the iceberg nested field ? That sounds unreasonable...

There are two ways to traverse the schemas together, by name and by ID. When we are building a reader, we use the ID method because names don't necessarily match between when the file was written and the current table schema. The columns are identified by ID, so this is correct in that case.

Traversing two schemas by name is only done when we know that the names between the two match. For example, when Spark runs a CTAS operation, we convert the Spark schema to an Iceberg schema and we know that the two have the same structure and field names, but the Spark schema has no IDs. So when we build a writer, we have to match by name but we know that this is safe. (The two schemas are needed to build writers that convert, like short -> int.)

The columns are identified by ID, so this is correct in that case.

OK, I saw the the ParquetWriter will persist the parquet schema which was converted from iceberg schema by TypeToMessageType (it will use the same field id for the converted parquet field). So the TypeWithSchemaVisitor 's parquet schema should have the same field ids as the iceberg table's schema.

I just want to make sure it was designed intentionally. Thanks for the context and confirmation.

rdblue · 2020-11-03T17:36:51Z

I'm not sure we want to move forward with these changes. This is a lot of code churn and I don't see much of a benefit. Abstracting this requires using more generic types and calling asStruct or similar methods. Without a motivating use case like a new visitor, I wonder if this has enough value to warrant making the changes.

chenjunjiedada · 2020-11-04T02:32:46Z

@rdblue , This was motivated when ParquetWithFlinkSchemaVisitor was adding, see discussion here #1272 (comment).

* Core, Spark 3.5: Remove dangling deletes as part of RewriteDataFilesAction (apache#9724) * Spark 3.4: Action to remove dangling deletes (apache#11377) * SpotlessApply --------- Co-authored-by: Hongyue/Steve Zhang <[email protected]>

probot-autolabeler bot added flink parquet pig spark labels Aug 27, 2020

chenjunjiedada force-pushed the refactor-for-flink-reader-and-writer branch from 5328205 to 38a1ee2 Compare August 27, 2020 05:05

JingsongLi requested changes Oct 29, 2020

View reviewed changes

JingsongLi reviewed Oct 29, 2020

View reviewed changes

chenjunjiedada added 2 commits October 30, 2020 10:50

Parquet: Add type visitor with partner type

b2c15cc

address comments

d5db93d

chenjunjiedada force-pushed the refactor-for-flink-reader-and-writer branch from 38a1ee2 to d5db93d Compare October 30, 2020 07:31

chenjunjiedada added 2 commits October 30, 2020 16:04

Allow partner type to be null

228eb94

use visitor's methods to avoid casting

5c6bd62

openinx reviewed Nov 3, 2020

View reviewed changes

use try/finally to wrap stack context

41d1a5f

openinx reviewed Nov 3, 2020

View reviewed changes

chenjunjiedada closed this Nov 12, 2021

Parquet: Add type visitor with partner type #1391

Parquet: Add type visitor with partner type #1391

Uh oh!

Conversation

chenjunjiedada commented Aug 27, 2020

Uh oh!

chenjunjiedada commented Oct 13, 2020

Uh oh!

JingsongLi commented Oct 13, 2020

Uh oh!

JingsongLi left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

chenjunjiedada Oct 30, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

chenjunjiedada commented Oct 29, 2020

Uh oh!

chenjunjiedada commented Nov 3, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rdblue commented Nov 3, 2020

Uh oh!

chenjunjiedada commented Nov 4, 2020

chenjunjiedada Oct 30, 2020 •

edited

Loading