Fix remaining issues with snapshot expiration for branching and tagging #1

rdblue · 2022-06-14T03:16:39Z

This updates apache#4578 to fix a few remaining issues.

@amogh-jahagirdar, I was going to comment on the last issue about how to compute the unreferenced snapshots to retain, but I thought it was easier to demonstrate with a PR instead.

I also went ahead and expanded a few tests and fixed problems in the tests, like creating branches from hard-coded snapshot IDs.

amogh-jahagirdar · 2022-06-14T04:16:07Z

core/src/main/java/org/apache/iceberg/TableMetadataParser.java

      refs = refsFromJson(node.get(REFS));
+    } else if (currentSnapshotId != -1) {
+      // initialize the main branch if there are no refs
+      refs = ImmutableMap.of(SnapshotRef.MAIN_BRANCH, SnapshotRef.branchBuilder(currentSnapshotId).build());


I guess it's not directly related to the expiration logic, but Is initializing the main branch when parsing sufficient for setting the main ref on the next commit? I was thinking it would have to be set in the builder.

We should follow up with a more thorough fix, but this is a good start.

amogh-jahagirdar · 2022-06-14T04:31:27Z

core/src/main/java/org/apache/iceberg/RemoveSnapshots.java

+  private Set<Long> unreferencedSnapshotsToRetain(Collection<SnapshotRef> refs) {
+    Set<Long> referencedSnapshots = Sets.newHashSet();
+    for (SnapshotRef ref : refs) {
+      for (Snapshot snapshot : SnapshotUtil.ancestorsOf(ref.snapshotId(), base::snapshot)) {


Yeah this works too, I built the complex map before hand so that the ancestors didn't need to be traversed again and could be maintained in memory for the later computation, but maybe that's a overkill (and also consumes even more memory)

This isn't that expensive. We could also pass a set into the branch logic to update, but that would require changes there and would use an input variable as output, so I'd prefer not to.

amogh-jahagirdar · 2022-06-14T04:33:35Z

Thanks @rdblue for cleaning it up, yeah I think the complex map is not really needed. I had it because I thought it would be more efficient to avoid having traverse different branches multiple times. once a branch's lineage was traversed it could reside in memory. But this may be overkill and not the dimension to optimize for. Thanks for adding more tests as well!

Core: Pass in parent snapshot to apply and validate.

@sfc-gh-mparmar

…flake-managed Iceberg tables (apache#6428) * Initial read-only Snowflake Catalog implementation by @sfc-gh-mparmar (#1) Initial read-only Snowflake Catalog implementation built on top of the Snowflake JDBC driver, providing support for basic listing of namespaces, listing of tables, and loading/reads of tables. Auth options are passthrough to the JDBC driver. Co-authored-by: Maninder Parmar <[email protected]> Co-authored-by: Maninder Parmar <[email protected]> Co-authored-by: Dennis Huo <[email protected]> * Add JdbcSnowflakeClientTest using mocks (#2) Add JdbcSnowflakeClientTest using mocks; provides full coverage of JdbcSnowflakeClient and entities' ResultSetHandler logic. Also update target Spark runtime versions to be included. * Add test { useJUnitPlatform() } tuple to iceberg-snowflake for consistency and future interoperability with inheriting from abstact unittest base classes. * Extract versions into versions.props per PR review * Misc test-related refactors per review suggestions -Convert unittests to all use assertj/Assertions for "fluent assertions" -Refactor test injection into overloaded initialize() method -Add test cases for close() propagation -Use CloseableGroup. * Fix unsupported behaviors of loadNamedpaceMetadata and defaultWarehouseLocation * Move TableIdentifier checks out of newTableOps into the SnowflakTableOperations class itself, add test case. * Refactor out any Namespace-related business logic from the lower SnowflakeClient/JdbcSnowflakeClient layers and merge SnowflakeTable and SnowflakeSchema into a single SnowflakeIdentifier that also encompasses ROOT and DATABASE level identifiers. A SnowflakeIdentifier thus functions like a type-checked/constrained Iceberg TableIdentifier, and eliminates any tight coupling between a SnowflakeClient and Catalog business logic. Parsing of Namespace numerical levels into a SnowflakeIdentifier is now fully encapsulated in NamespaceHelpers so that callsites don't duplicate namespace-handling/validation logic. * Finish migrating JdbcSnowflakeClientTest off any usage of org.junit.Assert in favor of assertj's Assertions. * Style refactorings from review comments, expanded and moved InMemoryFileIO into core with its own unittest. * Fix behavior of getNamespaceMetadata to throw when the namespace doesn't exist. Refactor for naming conventions and consolidating identifier handling into NamespaceHandlers. Make FileIO instantiated fresh for each newTableOps call. * Move private constructor to top, add assertion to test case. * Define minimal ResultSetParser/QueryHarness classes to fully replace any use of commons-dbutils; refactor ResultSet handling fully into JdbcSnowflakeClient.java. * Update snowflake/src/main/java/org/apache/iceberg/snowflake/SnowflakeTableOperations.java Co-authored-by: Eduard Tudenhöfner <[email protected]> * Refactor style suggestions; remove debug-level logging, arguments in exceptions, private members if not accessed outside, move precondition checks, add test for NamespaceHelpers. * Fix precondition messages, remove getConf() * Clean up varargs. * Make data members final, include rawJsonVal in toString for debuggability. * Combine some small test cases into roundtrip test cases, misc cleanup * Add comment for why a factory class is exposed for testing purposes. Co-authored-by: Dennis Huo <[email protected]> Co-authored-by: Maninder Parmar <[email protected]> Co-authored-by: Maninder Parmar <[email protected]> Co-authored-by: Eduard Tudenhöfner <[email protected]>

rdblue added 3 commits June 13, 2022 19:39

Retain unreferenced snapshots using table settings.

0c08d16

Improve tests and default the main branch.

c3ae8b0

Fix accidental change.

24548e3

github-actions bot added the CORE label Jun 14, 2022

rdblue mentioned this pull request Jun 14, 2022

Core: Update Remove Snapshots procedure for branching and tagging apache/iceberg#4578

Merged

amogh-jahagirdar reviewed Jun 14, 2022

View reviewed changes

amogh-jahagirdar merged commit ee1479d into amogh-jahagirdar:snapshot-ref-retention Jun 14, 2022

namrathamyske added a commit that referenced this pull request Aug 7, 2022

Merge pull request #1 from amogh-jahagirdar/branch-snapshot-producing

9c1a2d2

Core: Pass in parent snapshot to apply and validate.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix remaining issues with snapshot expiration for branching and tagging #1

Fix remaining issues with snapshot expiration for branching and tagging #1

Uh oh!

rdblue commented Jun 14, 2022

Uh oh!

amogh-jahagirdar Jun 14, 2022

Uh oh!

rdblue Jun 14, 2022

Uh oh!

amogh-jahagirdar Jun 14, 2022

Uh oh!

rdblue Jun 14, 2022

Uh oh!

amogh-jahagirdar commented Jun 14, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Fix remaining issues with snapshot expiration for branching and tagging #1

Fix remaining issues with snapshot expiration for branching and tagging #1

Uh oh!

Conversation

rdblue commented Jun 14, 2022

Uh oh!

amogh-jahagirdar Jun 14, 2022

Choose a reason for hiding this comment

Uh oh!

rdblue Jun 14, 2022

Choose a reason for hiding this comment

Uh oh!

amogh-jahagirdar Jun 14, 2022

Choose a reason for hiding this comment

Uh oh!

rdblue Jun 14, 2022

Choose a reason for hiding this comment

Uh oh!

amogh-jahagirdar commented Jun 14, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants