Flink: Add ChangeLog DataStream end-to-end unit tests. #1974

openinx · 2020-12-22T04:33:40Z

Add unit tests to proof that flink DataStream job could write the CDC events correctly. Will open a separate PR to address the flink SQL unit tests.

rdblue · 2020-12-22T17:42:30Z

flink/src/main/java/org/apache/iceberg/flink/sink/FlinkSink.java

+     * @param columns defines the iceberg table's key.
+     * @return {@link Builder} to connect the iceberg table.
+     */
+    public Builder equalityFieldColumns(List<String> columns) {


Do you think that we should consider adding primary key columns to the spec?

In the next PR https://github.com/openinx/incubator-iceberg/commit/a863c66eb3d72dd975ea64c75ed2ac35984c17fe, The flink table SQL's primary key will act as the equality field columns. The semantic of iceberg equality columns is almost the same as primary key, one difference I can think of is: the uniqueness of key are not enforced. In this discussion, we don't guarantee the uniqueness when writing a key which has been also wrote in the previous committed txn, that means if :

Txn-1: INSERT key1, txn commit; Txn-2: INSERT key1, txn commit;

Then the table will have two records with the same key.

If people really need iceberg to maintain the key's uniqueness, then they will need to transform all the INSERT to UPSERT, which means DELETE firstly and then INSERT the new values.

It will introduce another issues: Each INSERT will be regarded as an UPSERT, so it write a DELETE and a INSERT. Finally the size of delete files will be almost same as the size of data files. The process of merging on read will be quite inefficient because there are too many useless DELETE to JOIN.

The direct way is using bloom filter to reduce the useless DELETE, say we will generate bloom filter binary for each committed data file. When bootstrap the flink/spark job we will need to prefetch all the bloom filter binary from parquet/avro data files's metadata. Before writing a equality delete, we will check the bloom filter, and if the bloom filter indicate that all the committed data files are not containing the given key, then we could skip to append that equality-delete. That would reduce lots of useless DELETE in delete files. Of course, the bloom filter will have 'false positive' issue, but that probability is less than 1%, that means we may append
small amout of deletes whose keys don't exist in the current table. In my view, that should be OK.

In summary, I think it's reasonable to regard those equality fields as primary key in iceberg table, people could choose to use UNIQUENESS ENFORCED or UNIQUENESS NOT-ENFORCED, in this way they could trade off between strong semantic and performance.

For the bloom filter idea, @wangmiao1981 has been working on a proposal for secondary indexes. I think that could be used for the check you're suggesting here.

people could choose to use UNIQUENESS ENFORCED or UNIQUENESS NOT-ENFORCED, in this way they could trade off between strong semantic and performance.

Are you saying that if uniqueness is enforced, each insert becomes an upsert. But if uniqueness is not enforced, then the sink would assume that whatever is emitting records will correctly delete before inserting? That sounds reasonable to me.

Finally the size of delete files will be almost same as the size of data files. The process of merging on read will be quite inefficient because there are too many useless DELETE to JOIN.

I think that even if uniqueness is not enforced, tables will quickly require compaction to rewrite the equality deletes. I think we should spend some time making sure that we have good ways to maintain tables and compact equality deletes into position deletes, and position deletes into data files.

Are you saying that if uniqueness is enforced, each insert becomes an upsert. But if uniqueness is not enforced, then the sink would assume that whatever is emitting records will correctly delete before inserting?

Yes. If someone are exporting relational database's change log events to apache iceberg table and they could guarantee the exactly-once semantics (For example, the flink-cdc-connector could guarantee that), then the uniqueness is always correct when we just write the INSERT/DELETE/UPDATE_BEFORE/UPDATE_AFTER to iceberg. While in some other cases, for example flink aggregate job to refresh the metrics count value, we will write the same key several times without deleting first, then we should regard all the INSERT as UPSERT.

even if uniqueness is not enforced, tables will quickly require compaction to rewrite the equality deletes.

That was planned in the second phase, include:

Use bloom filter to reduce lots of useless deletes;

Minor compaction to convert parts of equality deletes to pos-deletes

Major compaction to eliminate all the deletes.

Make the whole read path & write path more stable. For example, cache policy reduce duplicated delete files loading when merging on read in the same tasks; Spill to disk if the insertedRowMap is exceeding the task's memory threshold, etc. I will evaluate the read & write & compaction paths in a large dataset, making this to be a stable solution for production.

It's good to have a document to collect all those things for reviewing.

I’d vote for not ensuring uniqueness as it is really hard at scale. If we are to ensure this at write, we have to join the incoming data with the target table making it really expensive. Doing this at read would require sorting the data not only by the sort key but also by the sequence number.

rdblue · 2020-12-22T17:43:53Z

flink/src/main/java/org/apache/iceberg/flink/sink/FlinkSink.java

+              column, table.schema());
+          equalityFieldIds.add(field.fieldId());
+        }
+      }


Why not do this conversion in equalityFieldColumns and keep the column ids in the builder instead of the source column names?

Because the FlinkSink is an API which will be exposed to flink's DataStream users, the concept of equality field id is harder to understand for those flink users. Equality field column names will be more friendly.

rdblue · 2020-12-22T17:45:25Z

flink/src/test/java/org/apache/iceberg/flink/sink/TestFlinkIcebergSinkV2.java

+    DataStream<Row> dataStream = env.addSource(new BoundedTestSource<>(elementsPerCheckpoint), ROW_TYPE_INFO);
+
+    // Shuffle by the equality key, so that different operations from the same key could be wrote in order when
+    // executing tasks in parallelism.


Nit: I think you mean "executing tasks in parallel" rather than "parallelism".

Thanks for pointing it out, will address it in next update.

rdblue · 2020-12-22T17:46:07Z

flink/src/test/java/org/apache/iceberg/flink/sink/TestFlinkIcebergSinkV2.java

+        "+I", RowKind.INSERT,
+        "-D", RowKind.DELETE,
+        "-U", RowKind.UPDATE_BEFORE,
+        "+U", RowKind.UPDATE_AFTER);


Could this be a private static map instead of defining it each time a row is created?

rdblue · 2020-12-22T17:49:54Z

flink/src/test/java/org/apache/iceberg/flink/sink/TestFlinkIcebergSinkV2.java

+
+  @Test
+  public void testChangeLogOnIdKey() throws Exception {
+    List<String> equalityFieldIds = ImmutableList.of("id");


Should this be equalityFieldNames?

rdblue · 2020-12-22T17:54:38Z

flink/src/test/java/org/apache/iceberg/flink/sink/TestFlinkIcebergSinkV2.java

+    List<String> equalityFieldIds = ImmutableList.of("id");
+    List<List<Row>> elementsPerCheckpoint = ImmutableList.of(
+        ImmutableList.of(
+            row("+I", 1, "aaa"),


Minor: This makes it look like the row has an operation as its first column, but that doesn't align with the key selector below that uses row.getField(0) to get the ID. I think it would make tests easier to read if row passed the row kind at the end. That way the fields align.

I'm not sure if it is worth changing all of the rows. Up to you.

The current way is correct because it will maintain rowKind in a separate field ( rather than in the shared fields array) , see here.

rdblue

I left a few minor comments, but nothing is a blocker.

openinx · 2020-12-23T06:56:42Z

All checks passed, I've merged this patch to repo so that I could create the next PR for flink table cdc e2e unit tests. Thanks @rdblue for reviewing.

Flink: Add ChangeLog DataStream end-to-end unit tests.

59d77aa

github-actions bot added the flink label Dec 22, 2020

openinx added 2 commits December 22, 2020 12:57

Minor changes.

8a64f26

Add equalityFieldColumns in FlinkSink API.

c4c75e4

rdblue reviewed Dec 22, 2020

View reviewed changes

rdblue approved these changes Dec 22, 2020

View reviewed changes

Addressing comments from Ryan.

113be25

openinx merged commit 1b66bdf into apache:master Dec 23, 2020

This was referenced Dec 29, 2020

Flink: Add unit test to write CDC events by SQL. #1978

Closed

Core: Add primary key spec. #2010

Closed

Flink: Add ChangeLog DataStream end-to-end unit tests. #1974

Flink: Add ChangeLog DataStream end-to-end unit tests. #1974

Uh oh!

Conversation

openinx commented Dec 22, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rdblue left a comment

Choose a reason for hiding this comment

Uh oh!

openinx commented Dec 23, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

openinx commented Dec 22, 2020 •

edited

Loading