Skip to content

Conversation

@openinx
Copy link
Member

@openinx openinx commented Jul 7, 2020

This patch abstract the common codes of PartitionKey to the newly introduced class BasePartitionKey, and both spark PartitionKey and flink PartitionKey will extend this base class. I also provide the unit tests for flink PartitionKey.

@rdblue
Copy link
Contributor

rdblue commented Jul 10, 2020

@openinx, now that the RC for 0.9.0 is out, I should be able to pick back up on Flink reviews tomorrow. I'll probably start with this one since we need to clean this up. Thanks!

@rdblue
Copy link
Contributor

rdblue commented Jul 10, 2020

@openinx, I opened an alternative to this PR, #1195. Please take a look.

This solution looks fairly clean for producing a PartitionKey for a specific format, but it requires building a subclass of PartitionKey for every row representation as well as new Accessor classes. I'd like to make it possible to reuse the existing PartitionKey class as well as the existing Accessor implementations (produced by Schema.accessorForField(id)) that are currently used for expression evaluation.

The approach I took in the other PR is to reuse the existing accessors, which accept a StructLike. To make that work, I just needed to add a wrapper class that adapts Spark's InternalRow to StructLike, and that converts Spark objects to Iceberg's internal representation. I think that's going to be a better long-term approach than multiple PartitionKey classes.

@openinx openinx changed the title Generilize the BasePartitionkey to abstract the common codes for spark and flink. Flink: add flink row PartitionKey. Jul 14, 2020
@openinx
Copy link
Member Author

openinx commented Jul 14, 2020

Ping @rdblue for reviewing.

}

private static PositionalGetter buildGetter(Type type) {
if (type instanceof Types.StructType) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The objects returned by this wrapper need to be Iceberg's internal representation:

  • int for DateType: number of days from epoch
  • long for TimeType: number of microseconds from midnight
  • long for both TimestampType: number of microseconds from epoch
  • ByteBuffer for both fixed(L) and binary types
  • BigDecimal for decimal(P, S)

Because we Flink uses the same in-memory representation as Iceberg generics, this should use the same conversions that we use for Record.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the details, we discussed about this thing in here, maybe you want to take a look :-) .

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this needs to convert to the representation that internal classes use.

Iceberg's generic data model is intended for passing data to and from Java applications, which is why they use friendlier classes. It is up to data models like Iceberg generics or Flink's data model to convert to that representation. Iceberg core should modify data as little as possible.

import org.junit.Assert;
import org.junit.Test;

public class TestPartitionKey {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a test based on Spark's TestPartitionValues? That tests every supported type, null values, and different column orders.

partitionKey2.partition(rowWrapper.wrap(row));
Assert.assertEquals(1, partitionKey2.size());
Assert.assertEquals(200, (int) partitionKey2.get(0, Integer.class));
Assert.assertEquals(partitionKey2.toPath(), "structType.innerIntegerType=200");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you split each of these blocks into a separate test case? There are lots of different cases mixed together in this method. Mixing cases together makes it harder to see what is broken when tests fail because you don't get a picture of what is common across failed cases since many of them don't run.

@rdblue rdblue changed the title Flink: add flink row PartitionKey. Flink: Add wrapper to adapt Row to StructLike Jul 14, 2020
@openinx
Copy link
Member Author

openinx commented Jul 15, 2020

Addressed all the comments, Pls take another look, thanks @rdblue .

@rdblue rdblue closed this Jul 15, 2020
@rdblue rdblue reopened this Jul 15, 2020
@rdblue rdblue merged commit d1b6d16 into apache:master Jul 15, 2020
@rdblue
Copy link
Contributor

rdblue commented Jul 15, 2020

I reopened this to run tests against master with the CI fix. It's passing tests so I'll merge it. Thanks, @openinx!

HotSushi pushed a commit to HotSushi/iceberg that referenced this pull request Jul 23, 2020
@openinx openinx deleted the generalize-partition-key branch August 1, 2020 13:08
cmathiesen pushed a commit to ExpediaGroup/iceberg that referenced this pull request Aug 19, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants