Skip to content

Conversation

@rdblue
Copy link
Contributor

@rdblue rdblue commented Apr 24, 2020

This includes commits from #952. Once that PR is merged, I'll remove them from this one.

This merges the fields from ManifestEntry into DataFile for v2 manifests. Now, v2 manifests store DataFile that has status, snapshot id, and sequence number. This should make v2 metadata easier to work with.

Other notable changes:

  • ManifestEntryWrapper has been combined with the v1 IndexedManifestEntry because both are wrapper classes specific to v1
  • For v2, an IndexedDataFile is added that is used for the same purpose as the v1 IndexedManifestEntry wrapper
  • To maintain compatibility, ManifestEntry.Status is unchanged. Instead, this adds FileStatus to iceberg-api
  • This refactor exposed an existing bug in DataFile, where get(int pos, Class<?> javaType) was returning values based on the position in an Avro projection instead of the position in the schema expressions bind to (see getInternal changes)
  • ManifestReader detects whether a manifest is v1 or v2 based on a metadata property in the Avro header; we will probably want to change this and track the version in the manifest list

@rdblue rdblue marked this pull request as draft April 24, 2020 16:43
@rdblue rdblue requested a review from aokolnychyi April 24, 2020 16:43

void checkEntry(ManifestEntry entry, Long expectedSequenceNumber) {
Assert.assertEquals("Status", ManifestEntry.Status.ADDED, entry.status());
void checkEntry(ManifestEntry entry, ManifestEntry.Status status, Long expectedSequenceNumber) {
Copy link
Contributor Author

@rdblue rdblue Apr 24, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This changes how checkEntry and checkDataFile are used. This updates checkEntry to now validate the v1 view of the data by inspecting the ManifestEntry and the DataFile it wraps. Similarly, checkDataFile validates the v2 view where all fields are part of DataFile.

All of the test cases now use both.

@rdblue rdblue force-pushed the v2-manifests branch 2 times, most recently from f29cd94 to c72b334 Compare April 26, 2020 22:53
@aokolnychyi
Copy link
Contributor

Let me go through this now.

Copy link
Contributor

@aokolnychyi aokolnychyi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. A couple of questions.

pos = fromProjectionPos[i];
}

if (!(skipEntryFields && pos < 3)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I feel !skipEntryFields || pos >= 3 would be easier to understand

@Override
public int size() {
return 13;
return 15;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We added 3 more fields but removed block size? The Indexed wrappers will still handle the block correctly?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is what the v1 indexed wrapper is for. It can translate to the v1 format.

.rename("r102", PartitionData.class.getName())
.rename("data_file", GenericDataFile.class.getName())
.rename("r2", GenericDataFile.class.getName())
.classLoader(GenericManifestFile.class.getClassLoader())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just out of curiosity: do we use the class loader from GenericManifestFile on purpose?

.reuseContainers()
.build();
} else {
AvroIterable<GenericDataFile> files = Avro.read(file)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

External systems always write v2 manifests right now. This block ensures they will be read correctly even if the table metadata is still v1, correct? If somebody consumed the recent changes already, we could have v2 manifests of different format?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. This ensures that the correct schema is used to read a v2 manifest.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is very likely that we will also track this in the manifest list as well, but this is a good work-around to keep the changes separate. And it will help us detect if other metadata is ever wrong.

return CloseableIterable.transform(
ManifestFiles.read(manifest, io).project(fileSchema).allEntries(),
file -> (GenericManifestEntry) file);
GenericDataFile.AsManifestEntry.class::cast);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This keeps the old schema because we are using ManifestEntry.getSchema(partitionType) in AsManifestEntry, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. This table continues to use ManifestEntry as a view of the data.

@aokolnychyi
Copy link
Contributor

@rdblue, seems like this breaks our metadata tests. Could you take a look?

@rdblue
Copy link
Contributor Author

rdblue commented May 7, 2020

I'm closing this because we plan to keep manifest_entry and data_file as separate structs.

@rdblue rdblue closed this May 7, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants