Skip to content

Conversation

@rdblue
Copy link
Contributor

@rdblue rdblue commented Apr 15, 2020

This adds a new transform function, void, that always produces a null value. Because void and null are Java keywords, the PartitionSpecBuilder is configured using alwaysNull.

The purpose of this transform is to be a stand-in for partition transforms that are removed from a spec. In the v1 table format, IDs for partition fields are not tracked by PartitionSpec. Instead, they are assigned starting at 1000 for each spec. Because tables may have more than one spec, manifest files could have incompatible partition field structs. This is not a problem for job planning because each manifest is read independently, but it can break metadata tables that show a union of all manifest data files or entries.

The void transform can be used to avoid a problem with ID assignment. If a table has two partition fields, 1000: categorical string, 1001: ts_day int, then removing the categorical partition will create a new partition spec with 1000: ts_day int. That would create a problem in the metadata tables. Instead of deleting the categorical partition, it should be replaced with a void partition to keep the IDs aligned: 1000: always_null string, 1001: ts_day int.

@rdblue rdblue requested a review from danielcweeks April 15, 2020 19:09
@rdblue
Copy link
Contributor Author

rdblue commented Apr 15, 2020

@jun-he, FYI. This is related to #922. Instead of deleting partitions in v1, we can replace them with the void transform.

Comment on lines +64 to +71
public String toHumanString(Void value) {
return "null";
}

@Override
public String toString() {
return "void";
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor nit here to think about . . . we seem to be using both void and null. Would it make more sense to just consistently use void as it seems to better indicate there there is no expected value?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Null is the human-readable string for the value produced by the transform. Void is the name of the transform. I considered naming it something like always_null but void seemed shorter and less error prone (was that alwaysNull or always-null?)

@danielcweeks
Copy link
Contributor

One minor comment, but +1 (pending checks)

@rdblue rdblue force-pushed the add-null-constant-transform branch from 659fcd4 to 688a37f Compare April 15, 2020 20:22
@rdblue rdblue merged commit e0a3304 into apache:master Apr 15, 2020
Fokko pushed a commit to Fokko/iceberg that referenced this pull request Apr 21, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants