Flink: Using RowData to avro reader and writer #1232

JingsongLi · 2020-07-23T04:00:05Z

rdblue · 2020-07-26T22:34:51Z

Looks like this one needs to be rebased after the others from #1231 are merged.

rdblue · 2020-07-29T01:02:18Z

core/src/main/java/org/apache/iceberg/avro/AvroWithPartnerByStructureVisitor.java

+ * @param <P> Partner type.
+ * @param <T> Return T.
+ */
+public abstract class AvroWithPartnerByStructureVisitor<P, T> {


I think this PR needs to be rebased now that #1235 is in, right?

rdblue · 2020-07-29T01:06:39Z

flink/src/main/java/org/apache/iceberg/flink/data/FlinkValueReaders.java

+  }
+
+  static ValueReader<MapData> arrayMap(ValueReader<?> keyReader,
+                                                 ValueReader<?> valueReader) {


Nit: indentation is off.

rdblue · 2020-07-29T01:11:01Z

flink/src/main/java/org/apache/iceberg/flink/data/FlinkValueReaders.java

+
+    @Override
+    public TimestampData read(Decoder decoder, Object reuse) throws IOException {
+      // TODO Do we need to consider time zones.


Time zones are left to the processing engine. It is up to the engine to convert times to concrete values for storage and from concrete values for display. Iceberg's responsibility is to return the value without modification.

rdblue · 2020-07-29T01:13:10Z

flink/src/main/java/org/apache/iceberg/flink/data/FlinkValueWriters.java

+
+      BigDecimal decimal = d.toBigDecimal();
+
+      byte fillByte = (byte) (decimal.signum() < 0 ? 0xFF : 0x00);


Can we move this logic into a common DecimalUtil method? I think we have quite a few copies of it.

Created #1265 for this.

openinx · 2020-07-29T02:17:05Z

flink/src/test/java/org/apache/iceberg/flink/data/TestFlinkAvroReaderWriter.java

-  public TemporaryFolder temp = new TemporaryFolder();
+  @Override
+  protected void writeAndValidate(Schema schema) throws IOException {
+    List<RowData> inputs = generateDataFromAvroFile(schema);


I see you will generate the List<Record> firstly, then write to the file appender, and finally read them into List<RowData>, could we just use the RandomData#generateRowData to produce those RowData ?

First, RandomData now is incorrect, like array, like timestamp with zone, and etc..

Second, using Iceberg avro writer can test format compatible better.

openinx · 2020-07-29T02:29:57Z

flink/src/test/java/org/apache/iceberg/flink/data/RandomData.java

+  private static Iterable<RowData> generateRowData(Schema schema, int numRecords,
+      Supplier<RandomRowGenerator> supplier) {
+    DataStructureConverter<Object, Object> converter =
+        DataStructureConverters.getConverter(TypeConversions.fromLogicalToDataType(FlinkSchemaUtil.convert(schema)));


Here we may need to call converter.open(RandomData.class.getClassLoader()) to initialize the converter ?

Yes, we can, only StructuredObjectConverter implements open, but now, Flink not support structure type. (It is not RowType).
I'll revert this method in RandomData, it is not be used.

flink/src/main/java/org/apache/iceberg/flink/TaskWriterFactory.java

openinx · 2020-08-04T10:39:22Z

flink/src/main/java/org/apache/iceberg/flink/data/FlinkValueReaders.java

+      int nanos = ((int) (micros % 1000)) * 1000;
+      if (nanos < 0) {
+        nanos += 1_000_000;
+        mills -= 1;
+      }


Here it's simple to use floorDiv and floorMod :

long mills = Math.floorDiv(micros, 1000); int nanos = Math.floorMod(micros, 1000)*1000;

I wrote a simple benchmark, Math.floor** will be 10% slower.

@openinx, that might influence fixing the timestamp types in ORC!

flink/src/main/java/org/apache/iceberg/flink/data/FlinkValueReaders.java

rdblue · 2020-08-05T20:15:16Z

flink/src/main/java/org/apache/iceberg/flink/FlinkTypeVisitor.java

 import org.apache.flink.table.types.logical.ZonedTimestampType;

-abstract class FlinkTypeVisitor<T> implements LogicalTypeVisitor<T> {
+public abstract class FlinkTypeVisitor<T> implements LogicalTypeVisitor<T> {


Does this need to be public? The only reference to FlinkTypeVisitor that I see in this PR is here, so I'm not sure why this is needed.

No need, I used to think the reading and writing will rely on FlinkTypeVisitor.

rdblue · 2020-08-05T20:38:13Z

flink/src/main/java/org/apache/iceberg/flink/data/FlinkValueWriters.java

+    }
+  }
+
+  private static class ArrayWriter<T> implements ValueWriter<ArrayData> {


Eventually, we should refactor this into a base class for array data, so that the encoder parts are shared between Flink and Spark. Not something we should do right now, though.

rdblue · 2020-08-05T20:44:36Z

flink/src/test/java/org/apache/iceberg/flink/data/TestFlinkAvroReaderWriter.java

-  @Test
-  public void testNormalData() throws IOException {
-    testCorrectness(COMPLEX_SCHEMA, NUM_RECORDS, RandomData.generate(COMPLEX_SCHEMA, NUM_RECORDS, 19982));
+  private List<RowData> generateDataFromAvroFile(Schema schema) throws IOException {


I think it would be better to validate Flink RowData against generic Record. That's what we do in Spark tests, where we first write using generics (or Avro in older tests) and then validate that the records we read using the Spark object model are equivalent. By doing that, you not only test that RowData to disk and back to RowData works, but that the records are actually equivalent to another read format.

You are right, we should have a asserter for RowData and Record.

rdblue · 2020-08-05T20:47:54Z

@JingsongLi, this looks ready to go so I merged it. I think we can still improve some of the tests by validating the read and write paths separately and comparing records against Iceberg generics. But I believe that @chenjunjiedada is working on the validations or assert methods in another PR so we can get that done later.

Thanks for working on this, it looks great.

JingsongLi · 2020-08-06T02:20:52Z

Thanks @rdblue for your patient review, I will continue to pay attention to and participate in the follow-up improvement.

JingsongLi mentioned this pull request Jul 23, 2020

Flink: Using RowData to avro reader and writer #1231

Closed

rdblue mentioned this pull request Jul 26, 2020

Abstract the generic task writers for sharing the common codes between spark and flink #1213

Merged

JingsongLi force-pushed the avro_flink branch 2 times, most recently from bf08cf2 to 50d3b88 Compare July 28, 2020 08:13

rdblue reviewed Jul 29, 2020

View reviewed changes

openinx reviewed Jul 29, 2020

View reviewed changes

JingsongLi force-pushed the avro_flink branch 3 times, most recently from 8792a28 to 977a2ce Compare August 3, 2020 01:51

JingsongLi added 2 commits August 3, 2020 13:49

Flink: Using RowData to avro reader and writer

b80c767

Remove avro test in TestTaskWriters

5b634af

JingsongLi force-pushed the avro_flink branch from 18edfe6 to 5b634af Compare August 3, 2020 05:49

Remove new HashMap

dad83fc

JingsongLi mentioned this pull request Aug 4, 2020

Flink: Implement Flink InputFormat and integrate it to FlinkCatalog #1293

Closed

openinx reviewed Aug 4, 2020

View reviewed changes

Address comments

a86efd9

rdblue reviewed Aug 5, 2020

View reviewed changes

rdblue merged commit fc5e3e5 into apache:master Aug 5, 2020

openinx mentioned this pull request Aug 7, 2020

Flink: Refactor to replace Row type with RowData type in write path. #1305

Closed

cmathiesen pushed a commit to ExpediaGroup/iceberg that referenced this pull request Aug 19, 2020

Flink: Update Avro reader and writer to use RowData (apache#1232)

e904850

JingsongLi deleted the avro_flink branch November 5, 2020 09:41


		BigDecimal decimal = d.toBigDecimal();

		byte fillByte = (byte) (decimal.signum() < 0 ? 0xFF : 0x00);

Flink: Using RowData to avro reader and writer #1232

Flink: Using RowData to avro reader and writer #1232

Uh oh!

Conversation

JingsongLi commented Jul 23, 2020

Uh oh!

rdblue commented Jul 26, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rdblue commented Aug 5, 2020

Uh oh!

JingsongLi commented Aug 6, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants