Skip to content

Conversation

@openinx
Copy link
Member

@openinx openinx commented Dec 23, 2020

To verify that flink sql could write cdc/upsert events to iceberg table correctly, I made few changes to let it work:

  1. Mapping the flink CREATE TABLE DDL's primary key to iceberg's equality field columns, that means when creating a flink table with primary key, it will create an iceberg table with the equality.field.columns properties indicating which columns are equality columns. Apache Flink does not support altering primary key in DDL, so we don't support alter equality field columns in SQL, will need to use the iceberg java API to accomplish schema evolution.

  2. I exposed the iceberg format v2 to end users in this patch because I found that the iceberg catalog has no way to share the TableTestBase. It's not the correct time to expose format v2, so I will try to find another way to make the unit tests work.

  3. Provide a basic unit test to verify the sql job work, it will need more work to add more unit tests, still in-progress.

@openinx openinx self-assigned this Dec 23, 2020
@openinx openinx marked this pull request as draft December 23, 2020 14:56
@openinx openinx marked this pull request as ready for review December 25, 2020 13:48
}
private static List<String> toEqualityColumns(TableSchema schema) {
List<String> equalityColumns = Lists.newArrayList();
schema.getPrimaryKey().ifPresent(uniqueConstraint -> equalityColumns.addAll(uniqueConstraint.getColumns()));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like we should track primary key columns in the table format rather than in properties. If we are going to add primary key columns, then we should add it before v2 so that v2 writers are required to not drop it.

}

@Test
public void testSqlChangeLogOnIdKey() throws Exception {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These tests look the same as the non-SQL tests. Is there a way to set up the test and expected data for both?

Copy link
Contributor

@zhangjun0x01 zhangjun0x01 Feb 8, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I read the source code of the flink cdc test class and found that it used docker to start a mysql service to get the cdc data. It seems that there really is no good way to get the cdc data in flink.

List<List<Row>> inputRowsPerCheckpoint,
List<List<Record>> expectedRecordsPerCheckpoint) throws Exception {
String partitionByCause = partitioned ? "PARTITIONED BY (data)" : "";
sql("CREATE TABLE %s (id INT, data VARCHAR, PRIMARY KEY(%s) NOT ENFORCED) %s WITH %s",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of creating v2 tables using the table property, I think it would be better to load the table just after creation and update it using the API. That way you can test with a v2 table, but we don't need to change any of the core classes or expose the format before it is finalized.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I saw there's an upgradeToFormatVersion API to do the upgrade in TestSparkReaderDeletes.java, I can use the similar way.

if (schema.getPrimaryKey().isPresent()) {
throw new UnsupportedOperationException("Creating table with primary key is not supported yet.");
}
private static List<String> toEqualityColumns(TableSchema schema) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the format, I think that equality columns should be tracked by ID rather than by name so that renaming columns doesn't break the primary key metadata.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it should track ID here. I was thinking that it could be a temporary solution before introducing the primary key specification. Considering the renaming issue, it seems not a good idea to track column names in iceberg properties.

I will create another PR to introduce the primary key specification based on the discussion from here.

List<String> equalityFieldColumns = toEqualityColumns(table.getSchema());
if (!equalityFieldColumns.isEmpty()) {
properties.put(TableProperties.EQUALITY_FIELD_COLUMNS,
org.apache.commons.lang.StringUtils.join(equalityFieldColumns, ","));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We try to avoid using StringUtils and other commons classes because we don't want to leak the dependency. Could you use Guava's Joiner instead?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds great. I did not realize that there's a Joiner in Guava. Thanks.

String concatColumns = PropertyUtil.propertyAsString(table.properties(),
TableProperties.EQUALITY_FIELD_COLUMNS,
TableProperties.DEFAULT_EQUALITY_FIELD_COLUMNS);
String[] columns = org.apache.commons.lang3.StringUtils.split(concatColumns, ",");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here. Could you use Splitter instead?

@zhangjun0x01
Copy link
Contributor

Now,IcebergTableSink only implements AppendStreamTableSink, not implements the UpsertStreamTableSink, but in flink 1.12, the two interface have deprecated, if we want to use flink sql write the CDC data to iceberg, I think we need to refactor the IcebergTableSink with new interface DynamicTableSink first,I have did part of this refactor work, and after it is completed, I will open a pr

@openinx
Copy link
Member Author

openinx commented Sep 7, 2021

Closed via #2410.

@openinx openinx closed this Sep 7, 2021
@openinx openinx deleted the flink-sql-pk branch September 7, 2021 08:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants