Skip to content

Conversation

@xinbinhuang
Copy link
Contributor

@xinbinhuang xinbinhuang commented Sep 1, 2021

closes: #3014

@github-actions github-actions bot added the API label Sep 1, 2021
@rdblue
Copy link
Contributor

rdblue commented Sep 20, 2021

I agree with Russell. It would be great to add some tests to this. The overall change looks good.

@github-actions github-actions bot added the core label Sep 26, 2021
@xinbinhuang xinbinhuang changed the title Returns UnpartitionedSpec for VoidTransform on all fields Returns isUnpartitioned=true for VoidTransform on all fields Sep 27, 2021
@xinbinhuang
Copy link
Contributor Author

@RussellSpitzer @rdblue Added the tests but not sure if I put them at the right place/file. PTAL :)

Copy link
Member

@RussellSpitzer RussellSpitzer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, I only had a minor question about whether we should test v1 tables who have had spec columns added and then removed but I think this isn't strictly necessary.

@xinbinhuang
Copy link
Contributor Author

Looks good to me, I only had a minor question about whether we should test v1 tables who have had spec columns added and then removed but I think this isn't strictly necessary.

@RussellSpitzer Sorry for taking this long. I've added the test to check unpartitioned spec after removing all columns. PTAL.

@xinbinhuang
Copy link
Contributor Author

@RussellSpitzer spark is failing but I don't think it's relevant to this PR. I'm going to rebase and force push

@RussellSpitzer
Copy link
Member

@xinbinhuang Let's double check that test, I'm worried that there may be some part of the code that assumes "isUnpartitioned" means empty partition spec when we actually define it now as "empty or all void". Let's make sure the error isn't because of that

@RussellSpitzer
Copy link
Member

RussellSpitzer commented Nov 11, 2021

Ok so the test failure is real and is caused by the check here

if (spec.isUnpartitioned()) {
return new UnpartitionedDataWriter(writerFactory, fileFactory, io, spec, format, targetFileSize);

Which uses "isUnpartitioned" to check whether or not to use UnpartitionedDataWriter. UnpartitionedWriter passes through "null" as the value of the Spec which causes an NPE

https://github.com/apache/iceberg/blame/9f07b83725e05206219368c020c3c772771a63d0/core/src/main/java/org/apache/iceberg/DataFiles.java#L50

Where we attempt to copy through the values into the spec (which is all void transforms).

Before this patch V1 Tables with only void transforms would be considered "partitioned" by this code and deal with this correctly (but perhaps more confusingly.)

As far as I can tell we have a few ways to go forward here.

  1. The Unpartitioned Writer can get an empty version of the PartitionSpec passed through as the value, basically a struct like where all the values are empty.
  2. The if statement between Unpartitioned and Partitioned can check whether there are no fields in the spec rather than whether the spec is unpartitioned
  3. We change the copy method to always return early if the spec is unpartitioned, rather than trying to reset partition values

I think I am leaning towards 1 or 3. I feel like we are dealing with an unpartitioned spec and because of V1 this means we have to be a bit careful about how we deal with that.

Copy link
Contributor Author

@xinbinhuang xinbinhuang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@RussellSpitzer @rdblue I did it slightly different from what Russell suggested, as I think this's slightly cleaner and easier to reason about. WDYT?

Comment on lines +222 to +226
if (isPartitioned) {
this.partitionData = copyPartitionData(spec, newPartition, partitionData);
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only copy if the spec is partitionable.

Copy link
Member

@RussellSpitzer RussellSpitzer Nov 17, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would rather we fix this in one of the other ways we discussed. This ends up feeling a little strange to me since you can call "withPartition" and the arg is ignored. I don't like when functions can take arguments but then not doing anything with them.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aokolnychyi Do you mind taking a look at this change?

I have a chat with @RussellSpitzer offline, and would love to have your thoughts on this.

The current implementation

Alternatively, we are thinking:

GenericRecord emptyRecord = GenericRecord.create(spec.schema());
delegate = writerFactory.newDataWriter(outputFile, spec, emptyRecord);

at around here:

delegate = writerFactory.newDataWriter(outputFile, spec, null);

this.spec = spec;
this.specId = spec.specId();
this.isPartitioned = spec.fields().size() > 0;
this.isPartitioned = !spec.isUnpartitioned();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More inclusive check for isPartitioned to includes V1 table

}

static PartitionData copyPartitionData(PartitionSpec spec, StructLike partitionData, PartitionData reuse) {
Preconditions.checkArgument(!spec.isUnpartitioned(), "Can't copy partition data to a unpartitioned table");
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The underlying method explicitly check for isUnpartitioned() in case this method is called outside the builder. (FileMetadata.java)

this.spec = spec;
this.specId = spec.specId();
this.isPartitioned = spec.fields().size() > 0;
this.isPartitioned = !spec.isUnpartitioned();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do similar thing to FileMetadata.java

if (isPartitioned) {
this.partitionData = DataFiles.copyPartitionData(spec, newPartition, partitionData);
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do similar thing to FileMetadata.java

@xinbinhuang
Copy link
Contributor Author

xinbinhuang commented Nov 30, 2021

@rdblue @aokolnychyi @kbendick would you help take a look when you have time? Would love to get this merged before it goes into stale. thank you :)

@RussellSpitzer
Copy link
Member

Sorry it took me so long to get to this, I think this is good to go once the PR has been rebased

@xinbinhuang
Copy link
Contributor Author

(@RussellSpitzer sorry didn't see the last message from you)
@RussellSpitzer @rdblue Just rebased. PTAL

@RussellSpitzer
Copy link
Member

@xinbinhuang Looks good to me, once tests pass I think we are good to go.

@RussellSpitzer RussellSpitzer merged commit d350c9b into apache:master Dec 9, 2022
@RussellSpitzer
Copy link
Member

My commit title was inverted, mia culpa. For anyone looking this up in the future I meant that "all void transforms should be false"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

PartitionSpec isUnpartitioned returns true for tables which previously had Partitions but no longer do

4 participants