Skip to content

Conversation

@openinx
Copy link
Member

@openinx openinx commented Aug 11, 2020

This patch addressing the third point in issue #1305

  1. We will need an extra patch doing the refactor to replace all the Row type with RowData (I have implemented one in my own branch 2af37c5), and make sure all the unit tests could pass. From this point in time, all flink development and unit tests will use RowData.

After this patch, only FlinkParquetReaders, FlinkParquetWriters, RandomData, TestFlinkParquetReaderWriter will reference Row data type.

image

Once we merged the pull requests about RowData parquet readers (#1266) and parquet writers (#1272), there should be no other core classes that will use Row data type.

this.targetFileSizeBytes = targetFileSizeBytes;
this.format = format;
this.appenderFactory = new FlinkFileAppenderFactory(schema, tableProperties);
this.flinkSchema = FlinkSchemaUtil.convert(schema);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In Spark, we had a bug where Spark may produce a row with a short, which is stored as an int in Iceberg. In a CTAS query, data would actually get passed to Iceberg with the short and we would end up with a ClassCastException. That's why we now pass the dataset schema when creating writers. You might want to watch out for a similar bug.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the great remanding, I think the bug you mentioned is this one #999. The flink don't support CTAS but support INSERT INTO iceberg_table SELECT * from table_2, if the table_2 has a TINYINT or SMALLINT, them its BinaryRowData queried from SELECT will be byte or short, we also need the raw flink's schema to read the values from BinaryRowData (rather than the flink schema converted from iceberg schema), and write those byte or short into integer. Let me consider how to fix this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This only appears in Spark with a CTAS query because that's the only time that Spark doesn't get a schema back from the table. When Spark has a table schema, it will automatically insert casts to the appropriate types so this problem doesn't happen. I'm not sure if Flink does that, but if it does then you wouldn't need to worry about that bug.

@chenjunjiedada
Copy link
Collaborator

@openinx , I rebased #1272 and found failed UT. It turns out #1272 depends on this. I think this should be merged before #1272. #1266 could be merged at first though.

private static PositionalGetter<?> buildGetter(LogicalType logicalType, Type type) {
switch (type.typeId()) {
case STRING:
switch (logicalType.getTypeRoot()) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed the type from iceberg type to flink's logical type here, because the value of tinyint & smallint is a byte & short, when cast to the byte or short to Integer here, it will throw a cast failure exception. Using logical type here so that we could cast it to integer right way.

return new UUID(mostSigBits, leastSigBits);
};
case VARBINARY:
if (Type.TypeID.UUID.equals(type.typeId())) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think an identity check would be okay since this is an enum symbol, but either way is fine.

ByteBuffer bb = ByteBuffer.wrap(row.getBinary(pos));
long mostSigBits = bb.getLong();
long leastSigBits = bb.getLong();
return new UUID(mostSigBits, leastSigBits);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like another area where we should have a util method to convert (though it shouldn't block this commit).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's true, we could have a separate pull request for this.

TimestampType timestampType = (TimestampType) logicalType;
return (row, pos) -> {
LocalDateTime localDateTime = row.getTimestamp(pos, timestampType.getPrecision()).toLocalDateTime();
return DateTimeUtil.microsFromTimestamp(localDateTime);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not use the same logic here that is used in the other timestamp type? Both of the values are TimestampData that is returned by getTimestamp. It seems like converting directly to a microsecond value is better than going through LocalDateTime here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it could be the same to convert TimestampData to a long. I separate them because the TimestampType are different, and we are depending the TimestampType.getPrecision() or LocalZonedTimestampType.getPrecision() to get the precision (though we could use the constant 6 here, but better to use the timestamp's precision getter).

for (RowData row : rows) {
Integer id = row.isNullAt(0) ? null : row.getInt(0);
String data = row.isNullAt(1) ? null : row.getString(1).toString();
records.add(createRecord(id, data));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could also use the assertEquals implementation that @chenjunjiedada has added in #1266. That would be better than converting a specific record type.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sense.

@rdblue
Copy link
Contributor

rdblue commented Aug 12, 2020

I have a few minor comments, but I don't think any of those should block this going in. I think it is correct. We can make tests better by using assertEquals between generics and Flink RowData in follow-ups once #1266 is in.

@rdblue rdblue merged commit a6a511a into apache:master Aug 12, 2020
@rdblue
Copy link
Contributor

rdblue commented Aug 12, 2020

Merged. Thanks for working on this, @openinx!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants