Inherit snapshot ids for manifest entries #675

aokolnychyi · 2019-11-29T18:01:36Z

This PR addresses #504.

api/src/main/java/org/apache/iceberg/ManifestFile.java

aokolnychyi · 2019-11-29T18:14:49Z

core/src/main/java/org/apache/iceberg/FastAppend.java

+    Preconditions.checkArgument(manifest.snapshotId() == null, "Snapshot id must be assigned during commit");
+
+    // TODO: avoid reading manifests to simply get stats
+    try (ManifestReader reader = ManifestReader.read(manifest, ops.io(), ops.current().specsById())) {


I think we need to collect more metadata while writing manifests so that we don't have to read manifests to simply get stats. Clearly, this kills all the benefits of inhering the snapshot id.

Do you mean to collect more metadata to build the summary without reading the passing manifest from file system?

Yes, we can keep the summary in ManifestFile instead. For now, this is fine.

Actually, we can use the manifest's summary stats for top-level properties. Partition-level properties are optional so we should just not include them.

@chenjunjiedada, yes, I would like to avoid reading passed manifests for better performance.

@rdblue, could you elaborate on what you mean by top-level and partition-level properties? Do you mean changed-partition-count is optional while added-records, added-data-files and others aren't?

Ideally, we would produce all of the summary stats, but the most important ones are total-data-files, total-records, and the added- or deleted- properties that are used to produce totals. I think it's okay to not write the changed-partition-count metrics if they require scanning the appended manifest.

I think my response was confusing because we keep additional summary information about each partition in our version. I can move that upstream if everyone wants it, but it can make the metadata files quite large. Without a use case for doing this upstream, I didn't think it was a good idea to make everyone's metadata significantly larger.

Do you have a few examples of what stats you collect per partition? One of our customers was asking about some stats about what partitions were modified, for example.

core/src/main/java/org/apache/iceberg/FastAppend.java

core/src/main/java/org/apache/iceberg/GenericPartitionFieldSummary.java

core/src/main/java/org/apache/iceberg/ManifestReader.java

core/src/test/java/org/apache/iceberg/TestFastAppend.java

core/src/test/java/org/apache/iceberg/TestMergeAppend.java

core/src/test/java/org/apache/iceberg/TestTransaction.java

aokolnychyi · 2019-11-29T18:20:28Z

spark/src/main/scala/org/apache/iceberg/spark/SparkTableUtil.scala

      conf: SerializableConfiguration,
      spec: PartitionSpec,
-      basePath: String): Iterator[SparkDataFile] => Iterator[Manifest] = { files =>
+      basePath: String): Iterator[SparkDataFile] => Iterator[ManifestFile] = { files =>


We need to use the real ManifestFile with all stats if we want to skip reading/writing manifests on the driver.

As long as this works, I'm all for it. Can we do the same with DataFile?

I think that should work. I've created #763.

What about making DataFile and ManifestFile extend Serializable?

I'm not sure whether it's a good idea to make an interface Serializable. I'll have to think about that one.

aokolnychyi · 2019-11-29T18:21:42Z

core/src/main/java/org/apache/iceberg/PartitionSummary.java


  List<PartitionFieldSummary> summaries() {
-    return Lists.transform(Arrays.asList(fields), PartitionFieldStats::toSummary);
+    return Arrays.stream(fields).map(PartitionFieldStats::toSummary).collect(Collectors.toList());


This transformation has to be eager as PartitionFieldStats is not serializable.

aokolnychyi · 2019-11-30T10:33:37Z

core/src/main/java/org/apache/iceberg/ManifestReader.java

   * @param file an InputFile
   * @return a manifest reader
   */
  public static ManifestReader read(InputFile file) {


@rdblue, am I correct that using read(InputFile file) is not safe only when we update the table schema and use some partition filters (as ManifestReader will build a partition spec based on the old schema stored in the metadata of the manifest)?

And why is it actually safe to use it without filters? Won't we have the wrong partition in DataFile?

Yes, that's correct. The issue is when the manifest is read using the schema at the time the manifest was written. If a column used by a partition transform is renamed, then expression binding can fail.

It's okay to use it without filters because partition tuples are accessed by position, not by name. So an expression will be bound to the current partition schema.

core/src/main/java/org/apache/iceberg/DataTableScan.java

core/src/main/java/org/apache/iceberg/MergeAppend.java

core/src/main/java/org/apache/iceberg/MergingSnapshotProducer.java

rdblue · 2019-12-18T23:50:41Z

core/src/main/java/org/apache/iceberg/MergingSnapshotProducer.java

+
+    // keep reference of the first appended manifest, so that we can avoid merging first bin(s)
+    // which has the first appended manifest and have not crossed the limit of minManifestsCountToMerge
+    if (firstAppendedManifest == null) {


+1 for moving this out of the try/catch.

core/src/main/java/org/apache/iceberg/InheritableMetadata.java

core/src/main/java/org/apache/iceberg/RemoveSnapshots.java

chenjunjiedada · 2019-12-15T02:50:58Z

core/src/main/java/org/apache/iceberg/FastAppend.java

      deleteFile(newManifest.path());
    }
-
-    for (ManifestFile manifest : appendManifests) {


Do we need to clean appendManifestsWithMetadata in case of uncommitted? Or leave to the caller as you comment below?

The appended manifests become part of the table, so there is no need to delete them.

Do you mean even manifests that are not committed are part of the table? Or all appended manifests must be committed successfully so we don't have to care for this?

@chenjunjiedada, yes, all appended manifests must be successfully committed. They are not generated by Iceberg anymore. Instead, they are produced by users.

One case we need to consider is when we add a manifest using merge append and it gets combined with other manifests/files in apply. In that case, the added manifest will never be part of the table and can become orphan.

It seems reasonable to clean up the external manifest if the commit is successful but that manifest is not part of the table metadata in MergingSnapshotProducer. Otherwise, the caller will have to detect which of the appended was merged and delete only those to ensure we don't have orphan manifests. It is different from what we have right now because all manifests are copied before they are appended to the metadata. What do you think, guys?

Sounds good to me.

chenjunjiedada · 2019-12-15T14:49:18Z

core/src/main/java/org/apache/iceberg/FastAppend.java

+    Preconditions.checkArgument(manifest.snapshotId() == null, "Snapshot id must be assigned during commit");
+
+    // TODO: avoid reading manifests to simply get stats
+    try (ManifestReader reader = ManifestReader.read(manifest, ops.io(), ops.current().specsById())) {


Do you mean to collect more metadata to build the summary without reading the passing manifest from file system?

core/src/main/java/org/apache/iceberg/ManifestEntry.java

core/src/main/java/org/apache/iceberg/GenericPartitionFieldSummary.java

core/src/test/java/org/apache/iceberg/TestMergeAppend.java

aokolnychyi · 2020-01-07T10:01:13Z

I've updated this PR with some recent progress. I am working on tests. In addition, there are a couple of open points to discuss.

rdblue · 2020-01-07T18:56:42Z

Looks like this conflicts with the recent update to rewrite manifests. Can you update, @aokolnychyi?

core/src/test/java/org/apache/iceberg/TestMergeAppend.java

aokolnychyi · 2020-01-08T16:23:27Z

This PR is ready for a closer look.

chenjunjiedada

LGTM

aokolnychyi · 2020-01-09T15:23:44Z

I run a few basic benchmarks locally (e.g. add/read 100 manifests that are 275KB in size and contain 10000 entries).

With this PR:

Benchmark                                      Mode  Cnt  Score   Error  Units
AppendManifestsBenchmark.fastAppendManifests     ss    5  2.468 ± 0.087   s/op
AppendManifestsBenchmark.mergeAppendManifests    ss    5  4.489 ± 0.103   s/op
ReadManifestsBenchmark.basicPlanFiles            ss    5  0.489 ± 0.103   s/op

Before this PR:

Benchmark                                      Mode  Cnt  Score   Error  Units
AppendManifestsBenchmark.fastAppendManifests     ss    5  6.387 ± 1.068   s/op
AppendManifestsBenchmark.mergeAppendManifests    ss    5  8.007 ± 0.217   s/op
ReadManifestsBenchmark.basicPlanFiles            ss    5  0.496 ± 0.051   s/op

Right now, we still read manifests to get stats. I think we can address that in a follow-up PR. The important part is there is no performance degradation on read.

rdblue · 2020-01-13T19:05:01Z

@aokolnychyi, the implementation looks correct to me. My only remaining concern is our forward-compatibility guarantee that older readers will continue to be able to read tables written by future versions within the format version in metadata.

Technically, this breaks that guarantee for tables that are appended to using manifests. I think that means we should test that older readers can read tables written with this that don't have appended manifests (I think Avro will throw a runtime error if it encounters a null snapshot ID, but will allow attempting to read). We should also probably add a feature flag to turn on this breaking behavior -- that way you can opt into using manifest files without a snapshot ID, knowing that it will break older readers. What do you think?

rdblue · 2020-01-13T19:27:46Z

I started a thread on the dev list to discuss breaking changes to the format and how to handle them. I proposed the feature flag approach.

aokolnychyi · 2020-01-14T10:26:55Z

The feature flag approach makes sense to me in this case. Then what about exposing a table property to enable metadata inheritance (false by default) and rewriting appended manifests if that property is false? That will complicate the logic for cleanup but it will keep the format forward-compatible by default.

rdblue · 2020-01-14T16:56:56Z

Then what about exposing a table property ... and rewriting appended manifests if that property is false?

That's what I was thinking. I don't think we should add a flag that turns on or off inheritance, though. When reading, we must always add the inherited data. Otherwise we can corrupt other places in metadata. I'd prefer a specific flag to allow writing manifests without snapshot IDs. If it is allowed, then we can append manifests and otherwise we have to rewrite them.

aokolnychyi · 2020-01-14T18:03:20Z

Sorry, that’s what I meant by enabling inheritance as well.

chenjunjiedada · 2020-01-20T12:47:16Z

core/src/main/java/org/apache/iceberg/ManifestReader.java

+   * @param specsById a Map from spec ID to partition spec
+   * @return a manifest reader
+   */
+  public static ManifestReader read(ManifestFile manifest, FileIO io, Map<Integer, PartitionSpec> specsById) {


Can we extract this API in a separated PR and merge first?

I have implemented the feature flag locally. As soon as #738 is in, I’ll update this PR and I think we will merge it quickly as well. If this is already blocking your work, I can extract the API changes in ManifestReader. Let me know.

It's in! Sorry for not getting it in sooner since it was blocking this one.

@aokolnychyi , Thanks for the update. I have merged with some of this patch locally and it works well for now. Please take the time and go ahead with this PR. I may need some time to finish mine since we are on holiday season:)

aokolnychyi · 2020-01-24T10:01:35Z

@chenjunjiedada @rdblue, I've updated the PR and got rid of reading manifests if we can inherit the snapshot id. Also, there is a feature flag now. Let me know what you think.

rdblue · 2020-01-24T17:29:01Z

core/src/main/java/org/apache/iceberg/BaseRewriteManifests.java

  protected void cleanUncommitted(Set<ManifestFile> committed) {
    cleanUncommitted(newManifests, committed);
-    cleanUncommitted(addedManifests, committed);
+    if (!snapshotIdInheritanceEnabled) {


Is this correct? I thought that added manifests are owned by the table only if inheritance is allowed. That would mean that added manifests are only removed if inheritance is allowed, right?

If inheritance is enabled, addedManifests contains original manifests that should not be removed no matter what the operation outcome is. If the commit fails, the caller can retry. If the commit succeeds, the manifests are part of the metadata now.

If inheritance is not enabled, addedManifests will contain manifest copies. Those must be always cleaned up as the caller doesn't have access to them.

I tried to summarize that in the description to AppendFiles.

By default, the manifest will be rewritten to assign all entries this update's snapshot ID.
In that case, it is always the responsibility of the caller to manage the lifecycle of
the original manifest.

If manifest entries are allowed to inherit the snapshot ID assigned on commit, the manifest
should never be deleted manually if the commit succeeds as it will become part of the table
metadata and will be cleaned up on expiry. If the manifest gets merged with others while
preparing a new snapshot, it will be deleted automatically if this operation is successful.
If the commit fails, the manifest will never be deleted and it is up to the caller whether
to delete or reuse it.

Like in the other cleanup method, should this check whether committed is empty?

Okay, so I think the difference between this and the logic in MergingSnapshotProducer is that this won't compact and rewrite the committed manifests. That means that if the commit succeeded, then all the added manifests are part of the table (so we know they will all be in the committed list).

Here's a quick summary:

Inheritance is enabled

Commit succeeded - do not delete, the files are owned and part of the table

Commit failed - do not delete, the files are not owned

Inheritance is not enabled - added manifests are rewritten and are owned

Commit succeeded - run normal manifest cleanup (rely on the committed set)

Commit failed - run normal manifest cleanup (committed set will be empty)

Yes, the summary is correct. In MergingSnapshotProducer, though, we can merge the appended manifest while preparing a new snapshot and it will never be part of the table metadata. That's why we have an extra loop through added manifests that were not rewritten.

core/src/main/java/org/apache/iceberg/BaseRewriteManifests.java

core/src/main/java/org/apache/iceberg/DataFilesTable.java

core/src/main/java/org/apache/iceberg/MergingSnapshotProducer.java

aokolnychyi · 2020-01-28T15:47:59Z

core/src/main/java/org/apache/iceberg/MergingSnapshotProducer.java

+      appendedManifest = copiedManifest;
+    }
+
+    // keep reference of the first appended manifest, so that we can avoid merging first bin(s)


I'll need to double-check this logic after I introduced separate lists for rewritten manifests.

It should work fine.

Actually, I just realized that the appended manifests are added to metadata before the rewritten append manifests. That means that this should actually be the first appended manifest or the first rewritten if all were rewritten. The only case we have to worry about is when the first manifest is rewritten, but a manifest with a null snapshot ID is added later.

It's probably okay to move on since it would be extremely rare and the only problem would be that a bin might get merged when it otherwise wouldn't have been. Not a big problem.

Yes, I had that in mind but decided to go for simplicity as the use case would be extremely rare and there will be no correctness/performance issue.

core/src/test/java/org/apache/iceberg/TestRewriteManifests.java

aokolnychyi · 2020-01-29T14:30:18Z

core/src/main/java/org/apache/iceberg/BaseRewriteManifests.java

    validateFilesCounts();

+    // TODO: add sequence numbers here
+    Iterable<ManifestFile> newManifestsWithMetadata = Iterables.transform(


This place requires attention. When we introduce sequence numbers, we will have to iterate through all new manifests. For now, iterating through newManifests and rewrittenAddedManifests is redundant. However, I expect we will add sequence number quickly and we will simply need to change the closure.

core/src/main/java/org/apache/iceberg/BaseRewriteManifests.java

core/src/main/java/org/apache/iceberg/FastAppend.java

rdblue · 2020-01-31T01:29:21Z

core/src/main/java/org/apache/iceberg/ManifestEntry.java

  private final org.apache.avro.Schema schema;
  private Status status = Status.EXISTING;
-  private long snapshotId = 0L;
+  private Long snapshotId = null;


Just to make sure: have you checked that older readers can read files produced with optional instead of required snapshotIds?

Yes, I tested this locally.

core/src/main/java/org/apache/iceberg/ManifestReader.java

rdblue · 2020-01-31T01:34:49Z

core/src/main/java/org/apache/iceberg/MergingSnapshotProducer.java

-        firstAppendedManifest = manifestFile;
-      }
+    ManifestFile appendedManifest;
+    if (snapshotIdInheritanceEnabled && manifest.snapshotId() == null) {


Looks like snapshots that were written with a version that used -1 will still rewrite those snapshots so we preserve the existing behavior. +1.

core/src/main/java/org/apache/iceberg/TableProperties.java

core/src/test/java/org/apache/iceberg/TableTestBase.java

site/docs/configuration.md

rdblue · 2020-01-31T01:49:20Z

+1

I had a few minor comments and I'd like to change the property name, but I think that this is ready to go overall. @aokolnychyi, can you make the property name change and merge?

aokolnychyi · 2020-01-31T14:03:58Z

I resolved the remaining comments. I'll do some additional testing. If everything is fine, I'll merge it.

Let me know if there are any other comments in the meantime.

aokolnychyi · 2020-02-03T22:51:59Z

I am going to merge this one.

aokolnychyi · 2020-02-03T22:54:11Z

Thanks for the review, @rdblue and @chenjunjiedada!

rdblue · 2020-02-03T22:55:52Z

Thank you for building the framework for inheritance! This is really helpful.

aokolnychyi mentioned this pull request Nov 29, 2019

[WIP] Add sequence number for supporting row level delete #588

Closed

aokolnychyi commented Nov 29, 2019

View reviewed changes

aokolnychyi commented Nov 30, 2019

View reviewed changes

rdblue reviewed Dec 18, 2019

View reviewed changes

core/src/main/java/org/apache/iceberg/DataTableScan.java Outdated Show resolved Hide resolved

rdblue reviewed Dec 18, 2019

View reviewed changes

core/src/main/java/org/apache/iceberg/MergeAppend.java Outdated Show resolved Hide resolved

rdblue reviewed Dec 18, 2019

View reviewed changes

core/src/main/java/org/apache/iceberg/MergingSnapshotProducer.java Outdated Show resolved Hide resolved

rdblue reviewed Dec 18, 2019

View reviewed changes

core/src/main/java/org/apache/iceberg/InheritableMetadata.java Outdated Show resolved Hide resolved

rdblue reviewed Dec 19, 2019

View reviewed changes

core/src/main/java/org/apache/iceberg/RemoveSnapshots.java Outdated Show resolved Hide resolved

chenjunjiedada reviewed Dec 19, 2019

View reviewed changes

aokolnychyi commented Jan 6, 2020

View reviewed changes

core/src/main/java/org/apache/iceberg/GenericPartitionFieldSummary.java Show resolved Hide resolved

aokolnychyi commented Jan 7, 2020

View reviewed changes

core/src/test/java/org/apache/iceberg/TestMergeAppend.java Show resolved Hide resolved

aokolnychyi mentioned this pull request Jan 8, 2020

Update GenericDataFile to return ByteBuffer for keyMetadata in get #728

Merged

aokolnychyi force-pushed the inherit-snapshot-id branch from 53abdbd to e960b64 Compare January 8, 2020 15:51

aokolnychyi changed the title ~~[WIP] Inherit snapshot ids~~ Inherit snapshot ids for manifest entries Jan 8, 2020

aokolnychyi commented Jan 8, 2020

View reviewed changes

core/src/test/java/org/apache/iceberg/TestMergeAppend.java Outdated Show resolved Hide resolved

chenjunjiedada reviewed Jan 9, 2020

View reviewed changes

aokolnychyi mentioned this pull request Jan 10, 2020

Collect additional metadata while writing manifets #733

Closed

aokolnychyi mentioned this pull request Jan 16, 2020

Collect row stats while writing manifests #738

Merged

chenjunjiedada reviewed Jan 20, 2020

View reviewed changes

aokolnychyi force-pushed the inherit-snapshot-id branch from 99cb763 to 18e0d98 Compare January 24, 2020 08:41

rdblue reviewed Jan 24, 2020

View reviewed changes

core/src/main/java/org/apache/iceberg/BaseRewriteManifests.java Outdated Show resolved Hide resolved

rdblue reviewed Jan 24, 2020

View reviewed changes

core/src/main/java/org/apache/iceberg/DataFilesTable.java Outdated Show resolved Hide resolved

rdblue reviewed Jan 24, 2020

View reviewed changes

core/src/main/java/org/apache/iceberg/MergingSnapshotProducer.java Outdated Show resolved Hide resolved

aokolnychyi commented Jan 28, 2020

View reviewed changes

core/src/test/java/org/apache/iceberg/TestRewriteManifests.java Outdated Show resolved Hide resolved

aokolnychyi commented Jan 29, 2020

View reviewed changes