-
Notifications
You must be signed in to change notification settings - Fork 3k
Core: Add position and equality delta writer interfaces #3176
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| import org.apache.iceberg.deletes.PositionDelete; | ||
| import org.apache.iceberg.relocated.com.google.common.base.Preconditions; | ||
|
|
||
| public class BasePositionDeltaWriter<T> implements PositionDeltaWriter<T> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the writer to use for Spark merge-on-read.
| } | ||
|
|
||
| @Override | ||
| public WriteResult result() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am using the existing WriteResult that @openinx created. It has a builder and already takes care converting values to arrays for serialization.
| * @param spec a partition spec | ||
| * @param partition a partition or null if the spec is unpartitioned | ||
| */ | ||
| void deleteKey(T key, PartitionSpec spec, StructLike partition); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rdblue, this is directDelete you mentioned. Also added docs about the schema expectations.
| * | ||
| * @param <T> the row type | ||
| */ | ||
| public interface EqualityDeltaWriter<T> extends Closeable { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This one will be implemented by the CDC writer that I will submit in a separate PR. It is large.
| * @param spec a partition spec | ||
| * @param partition a partition or null if the spec is unpartitioned | ||
| */ | ||
| void delete(CharSequence path, long pos, T row, PartitionSpec spec, StructLike partition); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rdblue, I kept the optional row before spec and partition. In most new APIs, spec and partition are the last arguments so even though row is an optional argument, having spec and partition as last seems more consistent.
| /** | ||
| * Deletes a key from the provided spec/partition. | ||
| * <p> | ||
| * This method assumes the delete key schema matches the equality field IDs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know what this means :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll try to rephrase it then :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed.
| void insert(T row, PartitionSpec spec, StructLike partition); | ||
|
|
||
| /** | ||
| * Deletes a position in the provided spec/partition without persisting the old row. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I understand what this one means either, the old row is the original row matching the position of this delete file? Why would I be persisting it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The spec allows us to persist the deleted row in positional delete files. This may be helpful to reconstruct CDC records or to persist the sort key for min/max filtering.
That being said, I don't plan to persist it from Spark.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah so "Delete a position and record the deleted row in the delete file" vs "Delete a position"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that's a good way to put it. I'll update.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated.
| } | ||
|
|
||
| /** | ||
| * Deletes a position in the provided spec/partition and persists the old row. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same question as last javadoc
RussellSpitzer
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Other than Java Doc comments I think this is good to go, I would maybe add tests for an "all delete" and "all insert" operation just to cover those edge cases but the api looks good to me now that @aokolnychyi explained the purposes :)
|
Updated the Javadoc and also added tests for delete and insert only cases. |
|
Looks great. Thanks for getting these in, @aokolnychyi! |
|
Thanks for reviewing, @rdblue @RussellSpitzer! |
This PR adds position and equality delta writer interfaces and contains a subset of changes in PR #2945.