Skip to content

Conversation

@szehon-ho
Copy link
Member

Adding a test to demonstrate how to age off delete files and eventually remove them by expire snapshots.

@github-actions github-actions bot added the spark label Feb 16, 2022
sql("DELETE FROM %s WHERE id=1", tableName);

Table table = validationCatalog.loadTable(tableIdent);
table.refresh();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't need to refresh here

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed

sql("INSERT INTO TABLE %s VALUES (1, 'a'), (2, 'b'), (3, 'c'), (4, 'd')", tableName);
sql("DELETE FROM %s WHERE id=1", tableName);

Table table = validationCatalog.loadTable(tableIdent);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You know it may be easier to always just use the SparkUtil loadSparkTable class now. Wouldn't ever have to refresh

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Member

@RussellSpitzer RussellSpitzer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Few small things, thanks for adding the test!

@aokolnychyi
Copy link
Contributor

Let me take a look tomorrow too. Sorry for the delay!

@szehon-ho szehon-ho force-pushed the expire_delete_master branch from af3c6b5 to aa9398f Compare February 18, 2022 19:43
@szehon-ho
Copy link
Member Author

@RussellSpitzer replied to comments , when you have time for another look. @aokolnychyi no problem, thanks for reviewing!

Copy link
Contributor

@kbendick kbendick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1. LGTM. Thank you @szehon-ho!

Copy link
Contributor

@aokolnychyi aokolnychyi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, just a few nits

Path deleteManifestPath = new Path(deleteManifests(table).iterator().next().path());
Path deleteFilePath = new Path(String.valueOf(deleteFiles(table).iterator().next().path()));

sql("CALL %s.system.rewrite_data_files(table => '%s', options => map" +
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I think it would be easier to read if the args were on separate lines like in a few other places.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

sql("CREATE TABLE %s (id bigint NOT NULL, data string) USING iceberg TBLPROPERTIES" +
"('format-version'='2', 'write.delete.mode'='merge-on-read')", tableName);

sql("INSERT INTO TABLE %s VALUES (1, 'a'), (2, 'b'), (3, 'c'), (4, 'd')", tableName);
Copy link
Contributor

@aokolnychyi aokolnychyi Feb 19, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am afraid we don't know the number of files that this insert will produce. That's why ID = 1 may end up in a separate file (unlikely but possible). If we write just a single file with ID = 1, the DELETE operation below will be a metadata operation and the test will fail.

I think it would be safer to use a typed Dataset and SimpleRecord. That way, we can call coalesce(1) before writing to make sure we produce only 1 file and the subsequent DELETE operation will produce a delete file.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, good catch, done.

Assert.assertFalse("Delete file should be removed", localFs.exists(deleteFilePath));
}

private Set<ManifestFile> deleteManifests(Table table) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: this may be simplified a bit?

  private List<ManifestFile> deleteManifests(Table table) {
    return table.currentSnapshot().deleteManifests();
  }

  private Set<DeleteFile> deleteFiles(Table table) {
    Set<DeleteFile> deleteFiles = Sets.newHashSet();

    for (FileScanTask task : table.newScan().planFiles()) {
      deleteFiles.addAll(task.deletes());
    }

    return deleteFiles;
  }

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@szehon-ho szehon-ho force-pushed the expire_delete_master branch from f723e51 to 04365e3 Compare February 19, 2022 02:41
@rdblue rdblue merged commit 701199f into apache:master Feb 20, 2022
@rdblue
Copy link
Contributor

rdblue commented Feb 20, 2022

Looks like everything has been addressed so I'll merge this. Thanks, @szehon-ho!

@szehon-ho
Copy link
Member Author

Thanks @RussellSpitzer , @kbendick , @aokolnychyi for reviews, and @rdblue for the weekend merge !

@chenwyi2
Copy link

whichi version suppose ExpireSnapshot delete with DeleteFile? i use iceberg 0.14.0, ExpireSnapshot cannot delete DeleteFile

@szehon-ho
Copy link
Member Author

Hi, when i tested this, i think the issue is not that expireSnapshot doesnt remove delete files. Its that the delete files dont get removed from current snapshot. Issue is described here: #4127

Can you check that? You can query 'files' table, does that have the delete files? If so then its still on current snapshot).

Im working on a design for fix for this, but havent had time yet. Hopefully next week will put a design doc up.

@chenwyi2
Copy link

yes, when i query files, there are delete files, but data files which are referenced are remove

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants