Revert #2960 and commit no-op partition replacement operations #3043

hankfanchiu · 2021-08-27T22:51:40Z

Summary

Partially revert e4df91e (from #2960) and allow a no-op partition replacement operation to be committed.

Motivation

#2895 encountered an exception when attempting to insert overwrite with an empty dataset from Spark.

#2960 addressed the issue above by skipping the commit operation entirely (in both Spark 2 and Spark 3).

However, we need to be able to differentiate between a no-op commit vs. a lack of attempt to commit.

Concretely, we have scheduled Spark pipelines that use Iceberg metadata to track commits and read targeted Iceberg snapshots. We additionally set some snapshot-property.<custom key> to externally "name" each snapshot.

With #2960, an upstream Spark application skipping a commit would cause the downstream Spark application to fail to find and read the expected Iceberg snapshot by the custom snapshot property.

Testing

The test case introduced by #2960 still passes:

iceberg/spark/src/test/java/org/apache/iceberg/spark/source/TestSparkDataWrite.java

Lines 192 to 233 in 7d6f692

    
           @Test 
        
           public void testEmptyOverwrite() throws IOException { 
        
             File parent = temp.newFolder(format.toString()); 
        
             File location = new File(parent, "test"); 
        
             HadoopTables tables = new HadoopTables(CONF); 
        
             PartitionSpec spec = PartitionSpec.builderFor(SCHEMA).identity("id").build(); 
        
             Table table = tables.create(SCHEMA, spec, location.toString()); 
        
             List<SimpleRecord> records = Lists.newArrayList( 
        
                 new SimpleRecord(1, "a"), 
        
                 new SimpleRecord(2, "b"), 
        
                 new SimpleRecord(3, "c") 
        
             ); 
        
             List<SimpleRecord> expected = records; 
        
             Dataset<Row> df = spark.createDataFrame(records, SimpleRecord.class); 
        
             df.select("id", "data").write() 
        
                 .format("iceberg") 
        
                 .option(SparkWriteOptions.WRITE_FORMAT, format.toString()) 
        
                 .mode(SaveMode.Append) 
        
                 .save(location.toString()); 
        
             Dataset<Row> empty = spark.createDataFrame(ImmutableList.of(), SimpleRecord.class); 
        
             empty.select("id", "data").write() 
        
                 .format("iceberg") 
        
                 .option(SparkWriteOptions.WRITE_FORMAT, format.toString()) 
        
                 .mode(SaveMode.Overwrite) 
        
                 .option("overwrite-mode", "dynamic") 
        
                 .save(location.toString()); 
        
             table.refresh(); 
        
             Dataset<Row> result = spark.read() 
        
                 .format("iceberg") 
        
                 .load(location.toString()); 
        
             List<SimpleRecord> actual = result.orderBy("id").as(Encoders.bean(SimpleRecord.class)).collectAsList(); 
        
             Assert.assertEquals("Number of rows should match", expected.size(), actual.size()); 
        
             Assert.assertEquals("Result rows should match", expected, actual); 
        
           }

On Spark 2, I've also run an application that saves an empty Dataset in overwrite mode, resulting in a new but no-op snapshot:

  "snapshots" : [ {
    "snapshot-id" : 1680973636538102330,
    "timestamp-ms" : 1630102232337,
    "summary" : {
      "operation" : "overwrite",
      "spark.app.id" : "<omitted>",
      "replace-partitions" : "true",
      "<custom key>" : "<omitted>",
      "changed-partition-count" : "0",
      "total-records" : "0",
      "total-files-size" : "0",
      "total-data-files" : "0",
      "total-delete-files" : "0",
      "total-position-deletes" : "0",
      "total-equality-deletes" : "0"
    },
    "manifest-list" : "<omitted>.avro",
    "schema-id" : 0
  } ],

RussellSpitzer

I think the consensus in Slack was that we don't want this to be the default behavior. We only want to allow empty commits if a flag is set, otherwise we want it to be a no-op.

So we need to add a new table option, and check it on whether or not to make the commit in the MergingSnapshotProducer apply method.

hankfanchiu · 2021-09-07T21:19:52Z

we need to add a new table option, and check it on whether or not to make the commit in the MergingSnapshotProducer apply method.

What would you suggest as the name of this configuration option? How about one of the following?

commit.allow-empty.enabled, default: false
commit.skip-empty.enabled, default: true
commit.omit-empty.enabled, default: true

Having the verb first is different than an existing option that has the affected entity followed by the action, i.e. noun and then verb:

iceberg/core/src/main/java/org/apache/iceberg/TableProperties.java

Lines 83 to 84 in 5f90476

    
           public static final String MANIFEST_MERGE_ENABLED = "commit.manifest-merge.enabled"; 
        
           public static final boolean MANIFEST_MERGE_ENABLED_DEFAULT = true;

Some other options:

empty-allow or empty-skip or empty-omit seems unintuitive?
empty-allowed.enabled or empty-skipped.enabled or empty-omitted.enabled stutters a bit?
empty.enabled might be ambiguous?

rdblue · 2021-09-12T16:44:31Z

Allow isn't a good term because it implies failure if something is not allowed, not skipping. Skip and omit are okay, but I think that we want the default to be false so that adding the setting is positive: keep empty commits vs skip empty commits. That helps the default seem less surprising.

So what I'm leaning toward is commit.keep-empty.enabled. Does that sound alright to everyone?

RussellSpitzer · 2021-09-12T17:19:42Z

I don't think we need "enabled" but I'm fine with anything really.

…

On Sun, Sep 12, 2021 at 11:44 AM Ryan Blue ***@***.***> wrote: Allow isn't a good term because it implies failure if something is not allowed, not skipping. Skip and omit are okay, but I think that we want the default to be false so that adding the setting is positive: keep empty commits vs skip empty commits. That helps the default seem less surprising. So what I'm leaning toward is commit.keep-empty.enabled. Does that sound alright to everyone? — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#3043 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AADE2YI7MRQ4LQ3MN67IU5LUBTKHVANCNFSM5C6NFAYQ> .

github-actions · 2024-07-19T01:09:20Z

This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the [email protected] list. Thank you for your contributions.

github-actions · 2024-07-28T00:14:36Z

This pull request has been closed due to lack of activity. This is not a judgement on the merit of the PR in any way. It is just a way of keeping the PR queue manageable. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time.

Revert apache#2960 and commit no-op partition replacement operations

77f33db

github-actions bot added core spark labels Aug 27, 2021

RussellSpitzer reviewed Aug 30, 2021

View reviewed changes

github-actions bot added the stale label Jul 19, 2024

github-actions bot closed this Jul 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Revert #2960 and commit no-op partition replacement operations #3043

Revert #2960 and commit no-op partition replacement operations #3043

Uh oh!

hankfanchiu commented Aug 27, 2021

Uh oh!

RussellSpitzer left a comment •

edited

Loading

Uh oh!

hankfanchiu commented Sep 7, 2021

Uh oh!

rdblue commented Sep 12, 2021

Uh oh!

RussellSpitzer commented Sep 12, 2021 via email

Uh oh!

github-actions bot commented Jul 19, 2024

Uh oh!

github-actions bot commented Jul 28, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	@Test
	public void testEmptyOverwrite() throws IOException {
	File parent = temp.newFolder(format.toString());
	File location = new File(parent, "test");

	HadoopTables tables = new HadoopTables(CONF);
	PartitionSpec spec = PartitionSpec.builderFor(SCHEMA).identity("id").build();
	Table table = tables.create(SCHEMA, spec, location.toString());

	List<SimpleRecord> records = Lists.newArrayList(
	new SimpleRecord(1, "a"),
	new SimpleRecord(2, "b"),
	new SimpleRecord(3, "c")
	);

	List<SimpleRecord> expected = records;
	Dataset<Row> df = spark.createDataFrame(records, SimpleRecord.class);

	df.select("id", "data").write()
	.format("iceberg")
	.option(SparkWriteOptions.WRITE_FORMAT, format.toString())
	.mode(SaveMode.Append)
	.save(location.toString());

	Dataset<Row> empty = spark.createDataFrame(ImmutableList.of(), SimpleRecord.class);
	empty.select("id", "data").write()
	.format("iceberg")
	.option(SparkWriteOptions.WRITE_FORMAT, format.toString())
	.mode(SaveMode.Overwrite)
	.option("overwrite-mode", "dynamic")
	.save(location.toString());

	table.refresh();

	Dataset<Row> result = spark.read()
	.format("iceberg")
	.load(location.toString());

	List<SimpleRecord> actual = result.orderBy("id").as(Encoders.bean(SimpleRecord.class)).collectAsList();
	Assert.assertEquals("Number of rows should match", expected.size(), actual.size());
	Assert.assertEquals("Result rows should match", expected, actual);
	}

Revert #2960 and commit no-op partition replacement operations #3043

Revert #2960 and commit no-op partition replacement operations #3043

Uh oh!

Conversation

hankfanchiu commented Aug 27, 2021

Summary

Motivation

Testing

Uh oh!

RussellSpitzer left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hankfanchiu commented Sep 7, 2021

Uh oh!

rdblue commented Sep 12, 2021

Uh oh!

RussellSpitzer commented Sep 12, 2021 via email

Uh oh!

github-actions bot commented Jul 19, 2024

Uh oh!

github-actions bot commented Jul 28, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

RussellSpitzer left a comment •

edited

Loading