Skip to content

Conversation

@rdblue
Copy link
Contributor

@rdblue rdblue commented Dec 1, 2018

This adds a separate file, a manifest list, to track the manifests for a snapshot. The manifest list is an Avro file with a row for each manifest. The file columns are used to avoid reading manifests to look for data files.

Columns include:

  • manifest_path: path of the manifest file
  • partition_spec_id: ID of the partition spec used to write the manifest (depends on Store multiple partition specs in table metadata. #3)
  • added_snapshot_id: snapshot ID when the manifest was added to the table
  • added_data_files_count, existing_data_files_count, deleted_data_files_count to track operations
  • partitions: a summary (min, max, and containsNull for each field) of the partitions in the manifest file

Manifest lists are written when the table property write.manifest-lists.enabled is set to true.

Manifest lists are written in the metadata file in place of a list of manifest locations. The snapshot object includes a "manifest-list" key instead of the "manifests" key.

@rdblue
Copy link
Contributor Author

rdblue commented Dec 1, 2018

@danielcweeks, FYI. This branch adds manifest file lists for snapshots and adds a filter used to skip reading manifest files while planning scans. Please review if you have time.

@danielcweeks
Copy link
Contributor

+1

It would be good to clarify the plan with respect to the manifest list vs the manifest list location. If we plan use list location as primary going forward, we should probably mark the the former as deprecated (even if still supported).

One comment nit, other than that it looks good.

This adds a new table property, write.manifest-lists.enabled, that
defaults to false. When enabled, new snapshot manifest lists will be
written into separate files. The file location will be stored in the
snapshot metadata as "manifest-list".
This expression evaluator determines whether a manifest needs to be
scanned or whether it cannot contain data files matching a partition
predicate.
This modifies SnapshotUpdate when writing a snapshot with a manifest
list file. If files for the manifest list do not have full metadata,
then this will scan the manifests to add metadata, including snapshot
ID, added/existing/deleted count, and partition field summaries.
This optimizes ScanSummary and FileHistory to ignore manifests that
cannot have changes in the configured time range.
@rdblue rdblue force-pushed the add-manifest-list branch from 6c95bc7 to 11c6a83 Compare December 5, 2018 19:54
@rdblue
Copy link
Contributor Author

rdblue commented Dec 5, 2018

Since the review included #3, I merged that first and rebased this. I'll merge this when tests are passing.

@rdblue rdblue merged commit 54f9a0f into apache:master Dec 6, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants