Skip to content

Conversation

@aokolnychyi
Copy link
Contributor

This PR adds position and equality delta writer interfaces and contains a subset of changes in PR #2945.

import org.apache.iceberg.deletes.PositionDelete;
import org.apache.iceberg.relocated.com.google.common.base.Preconditions;

public class BasePositionDeltaWriter<T> implements PositionDeltaWriter<T> {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the writer to use for Spark merge-on-read.

}

@Override
public WriteResult result() {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am using the existing WriteResult that @openinx created. It has a builder and already takes care converting values to arrays for serialization.

* @param spec a partition spec
* @param partition a partition or null if the spec is unpartitioned
*/
void deleteKey(T key, PartitionSpec spec, StructLike partition);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rdblue, this is directDelete you mentioned. Also added docs about the schema expectations.

*
* @param <T> the row type
*/
public interface EqualityDeltaWriter<T> extends Closeable {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one will be implemented by the CDC writer that I will submit in a separate PR. It is large.

* @param spec a partition spec
* @param partition a partition or null if the spec is unpartitioned
*/
void delete(CharSequence path, long pos, T row, PartitionSpec spec, StructLike partition);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rdblue, I kept the optional row before spec and partition. In most new APIs, spec and partition are the last arguments so even though row is an optional argument, having spec and partition as last seems more consistent.

@aokolnychyi
Copy link
Contributor Author

/**
* Deletes a key from the provided spec/partition.
* <p>
* This method assumes the delete key schema matches the equality field IDs.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know what this means :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll try to rephrase it then :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed.

void insert(T row, PartitionSpec spec, StructLike partition);

/**
* Deletes a position in the provided spec/partition without persisting the old row.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand what this one means either, the old row is the original row matching the position of this delete file? Why would I be persisting it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The spec allows us to persist the deleted row in positional delete files. This may be helpful to reconstruct CDC records or to persist the sort key for min/max filtering.

That being said, I don't plan to persist it from Spark.

Copy link
Member

@RussellSpitzer RussellSpitzer Sep 24, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah so "Delete a position and record the deleted row in the delete file" vs "Delete a position"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that's a good way to put it. I'll update.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.

}

/**
* Deletes a position in the provided spec/partition and persists the old row.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same question as last javadoc

Copy link
Member

@RussellSpitzer RussellSpitzer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other than Java Doc comments I think this is good to go, I would maybe add tests for an "all delete" and "all insert" operation just to cover those edge cases but the api looks good to me now that @aokolnychyi explained the purposes :)

@aokolnychyi
Copy link
Contributor Author

Updated the Javadoc and also added tests for delete and insert only cases.

@rdblue rdblue merged commit ec200e7 into apache:master Sep 26, 2021
@rdblue
Copy link
Contributor

rdblue commented Sep 26, 2021

Looks great. Thanks for getting these in, @aokolnychyi!

@aokolnychyi
Copy link
Contributor Author

Thanks for reviewing, @rdblue @RussellSpitzer!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants