Skip to content

Conversation

@rdblue
Copy link
Contributor

@rdblue rdblue commented Dec 12, 2020

This adds a new API for partition spec evolution and an implementation to produce updated specs. This doesn't include adding the new API to table and transaction, adding a last assigned partition ID to metadata, or documentation for the new API. This is primarily the implementation and tests, to be followed with more integration and docs.

return add(sourceId, fieldId, name, Transforms.fromString(column.type(), transform));
}

Builder add(int sourceId, int fieldId, String name, Transform<?, ?> transform) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needed to add fields using Transform, not a string.

public Builder alwaysNull(String sourceName, String targetName) {
checkAndAddPartitionName(targetName);
Types.NestedField sourceColumn = findSourceColumn(sourceName);
checkAndAddPartitionName(targetName, sourceColumn.fieldId()); // can duplicate a source column name
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needed to allow identity partitions that have been deleted.

* will not be resolved and will result in a {@link CommitFailedException}.
*/
public interface UpdatePartitionSpec extends PendingUpdate<PartitionSpec> {
UpdatePartitionSpec addField(String sourceName);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will add Javadoc for these methods after there is agreement on this API.

return results;
}

static <R> R visit(Schema schema, PartitionField field, PartitionSpecVisitor<R> visitor) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Refactored this out of the method above to visit individual fields, for setting default partition names based on the transform.

@rdblue rdblue force-pushed the add-partition-spec-evolution branch from 9d39ee1 to b141533 Compare December 12, 2020 02:01
@rdblue
Copy link
Contributor Author

rdblue commented Dec 12, 2020

@jun-he, could you look at this? It implements what we discussed for partition spec evolution.

@rdblue rdblue force-pushed the add-partition-spec-evolution branch from b141533 to d1bd036 Compare December 12, 2020 21:11
import static org.apache.iceberg.expressions.Expressions.truncate;
import static org.apache.iceberg.expressions.Expressions.year;

public class TestUpdatePartitionSpecV1 {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add a test for .removeField("myField").addField("myField")? What's the expected outcome for that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The expected result of any combination should be the same as if the two were done as separate changes, except for cases that signal some problem. For example, renaming a field and deleting the original field shows that there are two conflicting changes, even though we could apply one and then fail the second.

I agree that this should have a test for that case, as well as possibly some additional checks for consistency with rename.

Copy link

@johnclara johnclara Dec 14, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, so would order matter?
is .removeField("myField").addField("myField") different from .addField("myField").removeField("myField")?

Copy link
Contributor Author

@rdblue rdblue Dec 14, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Imagine I have partition shard=bucket(id, 16) and ran .removeField("shard").addField("shard", bucket(id, 32)). That replaces a 16-bin bucket scheme with a 32-bin bucket scheme. The opposite would add a bucket partition and then remove it, which looks like a mistake. So the second one would be rejected, the first would be allowed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added quite a few more test cases for situations like this and improved error messages. Now, if you attempt to rename a field that was added you get an error that clearly says you can't rename an added field, instead of an error that says the field is unknown. Same thing for combinations of adds and deletes.

I also added a check to not allow adding duplicate time columns, like days("ts") and hours("ts"). That's allowed if the partition already existed, but not when adding partition fields.

Copy link
Contributor

@danielcweeks danielcweeks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

API looks fine to me. The only open question I have is what about type promotion of partition fields. We allow for add/remove/rename, but what if you want to promote int to String (etc.)?

@rdblue
Copy link
Contributor Author

rdblue commented Dec 16, 2020

The only open question I have is what about type promotion of partition fields.

Partitions fields store the type output by the partition transform. In a lot of cases, that won't change. For example, bucket always produces ints no matter what the input type. Some transforms, like identity, will produce values of the input type. The only way to change the output type for those transforms is to change the input type through column promotion. When an identity partition column is promoted from int to long or float to double, the partition column is promoted automatically as well and the existing values are promoted on read.

In case you're wondering, we were also careful to make sure type promotion doesn't change the result of any partition functions. If the partition transform doesn't produce the same value when it is promoted after transformation, then promotion must be done before the transformation: bucket(int_col, width) is actually bucket((long) int_col, width). (This is also one reason why promotion to String is dangerous and not supported.)

@rdblue
Copy link
Contributor Author

rdblue commented Dec 16, 2020

Thanks for reviewing, @danielcweeks and @johnclara! I'll merge this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants