Skip to content

Conversation

@xloya
Copy link
Contributor

@xloya xloya commented Sep 17, 2021

1. What this MR does:
When the equality delete set in DeleteFilter saves a piece of data read from an equal delete parquet file in memory, it will call the get() method of the GenericRecord class for data comparison.If the table schema definition of the current position of GenericRecord is DATE/TIMESTAMP/TIME type, it will be converted to LocalDate/LocalDateTime/LocalTime type in memory. When the get() method of GenericRecord is called, the java class Integer/Long/Long defined in the Types enum of the DATE/TIMESTAMP/TIME type is passed in. Because the type data of LocalDate/LocalDateTime/LocalTime is not an instance of Integer/Long/Long type, an IllegalStateException with Not an intance of xxx is thrown.

The pr first converts the data to the InternalRecordWrapper before saving the data in the equality delete set, and calls its get() method during comparison. If it is a type that needs to be converted such as DATE/TIMESTAMP/TIME, use the transforms parameter to convert, otherwise directly Output raw data type.
2. Which issue this PR fixes:
Fix issue #3119.

@rdblue @chenjunjiedada @openinx Could you help to take a look at this PR? Thanks in advance!

@rdblue
Copy link
Contributor

rdblue commented Sep 19, 2021

Running CI


Iterables.addAll(actualSet, actual.stream()
.map(record -> new InternalRecordWrapper(table2.schema().asStruct()).wrap(record))
.collect(Collectors.toList()));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are you using InternalRecordWrapper here and for the expected set? I think that StructLikeSet should work without it, right? If not, then we need to update the set that is produced by rowSet to do the wrapping since future tests may not know to wrap.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@xloya, can you briefly describe what you did to address this?

Copy link
Contributor Author

@xloya xloya Sep 23, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rdblue OK. The original rowSetWithoutIds()method of DeleteReadTests and the original rowSet() method of TestGenericReaderDeletes returns both the GenericRecord class's data. In TestFlinkInputFormatReaderDeletes the rowSet() method returns RowDataWrapper class's data, and in TestSparkReaderDeletes the rowSet() method returns SparkStructLike class's data.Calling the Assert.assertEquals() method will compare the two StructLikeSets. If it is DATE/TIMESTAMP type of the data of GenericRecord, it will compare(GenericRecord, GenericRecord) and throw an exception of Not a instance of Integer/Long. If actual data class is RowDataWrapper or SparkStructLike, it will compare(GenericRecord, RowDataWrapper/SparkStructLike) and their class mismatch. so they both need to be converted to the InternalRecordWrapper class for comparison

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think either the conversion at the assertion or the conversion at rowSetWithoutIds() and rowSet() is fine, but it must be converted.

);

DeleteFile eqDeletes = FileHelpers.writeDeleteFile(
table2, Files.localOutput(temp.newFile()), Row.of(0), dataDeletes, deleteRowSchema);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see how this delete file would be applied. The partition you're writing it to is Row.of(0), which ends up being 1970-01-01. That should prevent the delete file from being found when planning, so none of the deletes would be applied.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@xloya, what was the resolution to this? Were the deletes not applied? Was the test incorrect and not working?

Copy link
Contributor Author

@xloya xloya Sep 23, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rdblue Previously, the data file was written using Row.of(0), which means that all data is written to the 1970-01-01 partition. So all the equality delete data before is also written to Row.of(0). Now it is changed that the data is written to the corresponding date time partition, and then the equality delete data is also written to the corresponding partition.
If keep the data file written to the partition as 1970-01-01, but the partition where the equality delete data is actually written is changed to 2021-09-01, it will not work.
Found spec on official website:
http://iceberg.apache.org/spec/
Like data files, delete files are tracked by partition. In general, a delete file must be applied to older data files with the same partition; see Scan Planning for details. Column metrics can be used to determine whether a delete file’s rows overlap the contents of a data file or a scan range.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. Thanks!

@rdblue
Copy link
Contributor

rdblue commented Sep 20, 2021

Thanks, @xloya! The implementation looks correct to me, but there are a few things to fix in the tests.

@xloya
Copy link
Contributor Author

xloya commented Sep 21, 2021

Thanks @rdblue for your particular review! I will fix them according to your suggestion these days

@github-actions github-actions bot added the MR label Sep 22, 2021
xiaojiebao added 4 commits September 22, 2021 17:02
…f github.com:xloya/iceberg into fix-equality-delete-set-save-date/timestamp/etc-data
@xloya xloya requested a review from rdblue September 22, 2021 11:54
return set;
}

private StructLikeSet rowSetWithoutIds(Set<Integer> idSet, Table iTable, List<Record> recordList) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: the name iTable is odd. Why not just table?

Copy link
Contributor Author

@xloya xloya Sep 23, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I originally used table, but code style check would report a hidden field error.Maybe there is a conflict with a property called table. iTable means iceberg table.

.map(record -> new InternalRecordWrapper(iTable.schema().asStruct()).wrap(record))
.forEach(set::add);
return set;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the other rowSetWithoutIds method should be rewritten to call this one since it is more generic. That will also avoid the wrapper problem in the future.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok i'll rewrite the origin rowSetWithoutIds() method

}

private StructLikeSet rowSetWithoutIds(int... idsToRemove) {
private StructLikeSet rowSetWithoutIds(Table iTable, List<Record> recordList, int... idsToRemove) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since you're passing the table in, can you make this a static method? You may also be able to use the name table instead of iTable if it is static.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A helpful suggestion, it works, thanks!

@rdblue
Copy link
Contributor

rdblue commented Sep 28, 2021

Looks good to me. Running CI.

@rdblue rdblue added this to the Java 0.12.1 Release milestone Sep 28, 2021
@rdblue rdblue merged commit e52b681 into apache:master Sep 28, 2021
@rdblue
Copy link
Contributor

rdblue commented Sep 28, 2021

Thanks, @xloya! Merged.

@xloya
Copy link
Contributor Author

xloya commented Sep 29, 2021

Thanks for your time, @rdblue !

kbendick pushed a commit to kbendick/iceberg that referenced this pull request Oct 26, 2021
kbendick pushed a commit to kbendick/iceberg that referenced this pull request Oct 27, 2021
rdblue pushed a commit that referenced this pull request Oct 29, 2021
izchen pushed a commit to izchen/iceberg that referenced this pull request Dec 7, 2021
Initial-neko pushed a commit to Initial-neko/iceberg that referenced this pull request Dec 13, 2021
Initial-neko pushed a commit to Initial-neko/iceberg that referenced this pull request Dec 17, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants