Skip to content

Conversation

@rdblue
Copy link
Contributor

@rdblue rdblue commented Apr 9, 2020

This extends #903 with a v2 manifest list. The v2 schema is not final and may change. This is mainly to implement a separate write path and a framework for compatibility tests.

Specific changes:

  • Split ManifestFile schema into v1 and v2 (with sequence numbers)
  • Update GenericManifestFile to v2 schema
  • Update v1 manifest list writer to use the v1 schema
  • Add a v2 manifest list writer
  • Add tests for v1 and v2 manifest list formats

public TemporaryFolder temp = new TemporaryFolder();

@Test
public void testManifestsWithoutRowStats() throws IOException {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved into TestManifestFileVersions.

@chenjunjiedada
Copy link
Collaborator

+1

@rdblue
Copy link
Contributor Author

rdblue commented Apr 10, 2020

Instead of making spec changes in each v2 commit I'm using a v2-spec PR, #912. That way we don't trigger CI runs in PRs that are waiting to be merged for updates to the spec.

@rdblue rdblue force-pushed the v2-manifest-lists branch from 456f2ca to 220c6a5 Compare April 10, 2020 23:30
@rdblue rdblue mentioned this pull request Apr 11, 2020
rdblue added 3 commits April 11, 2020 13:37
* Update GenericManifestFile to v2 schema
* Update v1 manifest list writer to use the v1 schema
* Add a v2 manifest list writer
* Add tests for v1 and v2 manifest list formats
optional(512, "added_rows_count", Types.LongType.get()),
optional(513, "existing_rows_count", Types.LongType.get()),
optional(514, "deleted_rows_count", Types.LongType.get()));
PATH, LENGTH, SPEC_ID,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I remember we had issues with reordering fields in ManifestFile as GenericAvroWriter was using ordinal positions instead of field ids. Did we solve that?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see this is handled in V1Metadata.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, the IndexedRecord field order needs to match the schema order. This is why I added a test for this as well.

this.fromProjectionPos = null;
}

public GenericManifestFile(String path, long length, int specId, Long snapshotId,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even though it was public, I don't think anyone is using it. Seems OK to change.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this should be fine. Classes in core are only semi-public and not part of the API. That module has stronger guarantees.

private Integer existingFilesCount = null;
private Long existingRowsCount = null;
private Integer deletedFilesCount = null;
private Long addedRowsCount = null;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like we are trying to match the new ordering of fields in ManifestFile. Earlier, we co-located ...FilesCount with ...RowsCount to match the ordering of methods in ManifestFile and args in constructors. Is this change intentional?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. I think it's likely that these are going to be null in some cases, like manifests that contain equality delete files. Instead of mixing null, non-null, null, non-null, etc. I think it's better to keep the probably-null columns colocated for compression.

static class V1Writer extends ManifestListWriter {
private V1Writer(OutputFile snapshotFile, long snapshotId, Long parentSnapshotId) {
super(snapshotFile, snapshotId, parentSnapshotId);
private final V1Metadata.IndexedManifestFile wrapper = new V1Metadata.IndexedManifestFile();
Copy link
Contributor

@aokolnychyi aokolnychyi Apr 14, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like this is the place where we can pass a snapshot id similar to how we pass a sequence number to V2 and get rid of the logic for inheriting metadata for ManifestEntry via setSnapshotId and iterating through manifests during commit.

Do I get it correctly, @rdblue?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. I'm going to rework inheritance for snapshot ID in a separate commit.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great!

@rdblue rdblue requested a review from aokolnychyi April 14, 2020 22:37
Copy link
Contributor

@aokolnychyi aokolnychyi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@aokolnychyi aokolnychyi merged commit 487fc1c into apache:master Apr 14, 2020
@aokolnychyi
Copy link
Contributor

I've merged this. Thanks, @rdblue!

waterlx added a commit to waterlx/incubator-iceberg that referenced this pull request Apr 21, 2020
Fokko pushed a commit to Fokko/iceberg that referenced this pull request Apr 21, 2020
waterlx added a commit to waterlx/incubator-iceberg that referenced this pull request Apr 26, 2020
waterlx added a commit to waterlx/incubator-iceberg that referenced this pull request Apr 27, 2020
waterlx added a commit to waterlx/incubator-iceberg that referenced this pull request May 6, 2020
@rdblue rdblue added this to the Row-level Delete milestone May 8, 2020
waterlx added a commit to waterlx/incubator-iceberg that referenced this pull request May 9, 2020
waterlx added a commit to waterlx/incubator-iceberg that referenced this pull request May 27, 2020
waterlx added a commit to waterlx/incubator-iceberg that referenced this pull request May 28, 2020
waterlx added a commit to waterlx/incubator-iceberg that referenced this pull request Jun 3, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants