Deprecate Redundant Identifier Support in TableIdentifier, and row_filter #994

sungwy · 2024-08-03T23:36:10Z

Today in PyIceberg, we have support for identifier parsing in public APIs
belonging to two different classes:

Catalog class: load_table, purge_table, drop_table
Table class: scan

These APIs currently have optional support for the identifier that the
instance itself belongs to.

For example, the catalog class APIs support:

catalog = load_catalog(“animals”, **properties)
catalog.load_table(“cats.whiskers”)

But it also supports:

catalog.load_table(“animals.cats.whiskers”)

Which is redundant, because the catalog.name is already “animals”.

Similarly, row_filter in the Table scan API supports:

table = catalog.load_table(“cats.whiskers”)
table.scan(row_filter=”n_legs== 4”)

But we also support

table.scan(row_filter=”whiskers.n_legs == 4”)

Which is also redundant, because the table name is already “whiskers” (or
cats.whiskers)

The benefits of this change are as follows:

As observed above, specifying instance-level identifier in these APIs
is redundant
This optional support adds a lot of complexity to the code base and
leads to issues like: Rest Catalog: catalog.name should not be part of namespace #742
Rest Catalog: catalog.name should not be part of namespace #742 It would be really
great to clean this up before as we prepare for a 1.0.0 later this year
The optional support opens up the possibility of resulting in
correctness issues if there exists a name in the level below as the
instance-level identifier.
For example, if in the above catalog, we have a table namespace
named “animals.lower” catalog.load(“animals.lower.cats”) can be construed
as table name “cats” in the namespace “animals.lower” but it will be
interpreted as table name “cats” in the namespace “lower” which is
erroneous.
We would see a similar issue with tables and field names as well.
Field name parsing is already complicated because we have to represent
nested fields as flat representations. So it would be great to remove one
unnecessary level of complication here

We had opened the mail list thread in order to discuss the potential impact of this change, but it didn't get much traction. Based on the very few responses we got from the Slack channel, there is support in making this change.

Note: This PR takes a much more conservative approach than the previous PR (#963) by marking the usage of these redundant identifiers with a deprecation message. In 0.8.0 this usage pattern will log DeprecationWarnings, and in 0.9.0 we will plan to remove these and clean up the impacted functions.

Mail list Discussion: https://lists.apache.org/thread/9zr19hxnbt3hg7lt55t6dfg6otv7zjz2

ndrluis · 2024-08-04T01:18:36Z

@sungwy Awesome work!

Do you know if these changes could impact the multi-level namespace case? I believe it would be nice for us to have some integration tests with Polaris. I think it’s the only catalog that implements the nested namespace.

https://polaris.io/#tag/Polaris-Catalog-Entities/Namespace

sungwy · 2024-08-05T14:32:45Z

@sungwy Awesome work!

Do you know if these changes could impact the multi-level namespace case? I believe it would be nice for us to have some integration tests with Polaris. I think it’s the only catalog that implements the nested namespace.

https://polaris.io/#tag/Polaris-Catalog-Entities/Namespace

Hi @ndrluis - thank you for bringing up this point. I'm of the opinion that having a 'representative' implementation of the REST catalog would be good enough to serve this purpose. Currently we are using a tabular rest catalog image for our tests, but I think it would be a good idea to explore other alternatives.

I've created a new test to validate the behavior against the REST catalog in our integration test suite, but I believe we also have tests using multilevel namespaces in test_base.py as well.

sungwy · 2024-08-05T14:50:47Z

pyiceberg/catalog/__init__.py

@@ -613,6 +613,11 @@ def update_namespace_properties(
            ValueError: If removals and updates have overlapping keys.
        """

+    @deprecated(


This deprecates the public function

sungwy · 2024-08-05T14:51:27Z

pyiceberg/catalog/__init__.py

@@ -627,6 +632,25 @@ def identifier_to_tuple_without_catalog(self, identifier: Union[str, Identifier]
            identifier_tuple = identifier_tuple[1:]
        return identifier_tuple

+    def _identifier_to_tuple_without_catalog(self, identifier: Union[str, Identifier]) -> Identifier:


where as this is now called by the PyIceberg functions and only prints the deprecation warning message if we are using the unsupported naming convention

kevinjqliu · 2024-08-07T16:58:17Z

Do you know if these changes could impact the multi-level namespace case?

I think Polaris is using the same concept of namespace as we do in PyIceberg.

Here's my mental model
Catalog == container of namespaces.
Namespaces can be hierarchical
For example:
catalog1.foo.bar
catalog1.a.b.c.d.e

name[0] == catalog name (catalog1)
name[-1] == table name (e)
name[1:-1] == namespace (a.b.c.d)

kevinjqliu

Thanks for making this change! I think it'll be beneficial to the overall project going forward.
Having a deprecation message for an entire minor version (0.8.0 -> 0.9.0) would give enough time to users to migrate.

kevinjqliu · 2024-08-07T16:58:35Z

tests/integration/test_writes/test_writes.py

+
+
+@pytest.mark.integration
+def test_table_v1_with_null_nested_namespace(session_catalog: Catalog, arrow_table_with_null: pa.Table) -> None:


what is this test for?

It's for testing against nested / multilevel namespaces, it was a helpful review suggestion from @ndrluis

Fokko · 2024-08-08T11:00:09Z

pyiceberg/expressions/parser.py

+            help_message="Parsing expressions with table name is deprecated. Only provide field names in the row_filter.",
+        )
+    # TODO: Once this is removed, we will no longer take just the last index of parsed column result
+    # And introduce support for parsing filter expressions with nested fields.


I came here to point this out indeed. Seems to fail on several levels; I don't think this was ever properly tested.

Fokko

I'm in favor of this change. We should have done it a long time ago 👍 Thanks for doing this @sungwy

Fokko · 2024-08-08T11:01:45Z

pyiceberg/table/__init__.py

        self.metadata = fresh.metadata
        self.io = fresh.io
        self.metadata_location = fresh.metadata_location
        return self

+    @property
+    def identifier(self) -> Identifier:


Fokko · 2024-08-08T11:02:36Z

tests/integration/test_writes/test_writes.py

+    identifier = "default.lower.table_v1_with_null_nested_namespace"
+    tbl = _create_table(session_catalog, identifier, {"format-version": "1"}, [arrow_table_with_null])
+    assert tbl.format_version == 1, f"Expected v1, got: v{tbl.format_version}"
+    print(session_catalog.list_tables("default"))


Let's remove this print before merging

sungwy · 2024-08-08T14:56:58Z

Merged! 🙌 Thank you for the reviews @ndrluis @kevinjqliu and @Fokko !

sungwy added 2 commits August 3, 2024 19:32

deprecate usage of catalog in table identifier

9ccc116

deprecate usage of table name in row_filter expression

f7c8427

sungwy requested review from Fokko, kevinjqliu and HonahX August 3, 2024 23:40

sungwy mentioned this pull request Aug 3, 2024

Remove support for catalog_name in table identifier string #963

Draft

sungwy added 2 commits August 5, 2024 10:13

add a test with nested namespace

3a94fcc

better deprecation strategy

9e3e2e8

lint

5929c2c

sungwy commented Aug 5, 2024

View reviewed changes

kevinjqliu approved these changes Aug 7, 2024

View reviewed changes

kevinjqliu added this to the PyIceberg 0.8.0 release milestone Aug 7, 2024

Fokko reviewed Aug 8, 2024

View reviewed changes

Fokko approved these changes Aug 8, 2024

View reviewed changes

sungwy added 2 commits August 8, 2024 14:33

Merge branch 'main' into depr-higher-identifier-support

3ce3b65

stray print

3907222

sungwy merged commit eca9870 into apache:main Aug 8, 2024
7 checks passed

sungwy deleted the depr-higher-identifier-support branch August 8, 2024 15:44

kevinjqliu mentioned this pull request Sep 13, 2024

[bug] [REST] Dont remove identifier root #1172

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deprecate Redundant Identifier Support in TableIdentifier, and row_filter #994

Deprecate Redundant Identifier Support in TableIdentifier, and row_filter #994

sungwy commented Aug 3, 2024 •

edited

Loading

ndrluis commented Aug 4, 2024

sungwy commented Aug 5, 2024

sungwy Aug 5, 2024

sungwy Aug 5, 2024

kevinjqliu commented Aug 7, 2024

kevinjqliu left a comment

kevinjqliu Aug 7, 2024

sungwy Aug 7, 2024

Fokko Aug 8, 2024

Fokko left a comment

Fokko Aug 8, 2024

Fokko Aug 8, 2024

sungwy commented Aug 8, 2024



		@pytest.mark.integration
		def test_table_v1_with_null_nested_namespace(session_catalog: Catalog, arrow_table_with_null: pa.Table) -> None:

Deprecate Redundant Identifier Support in TableIdentifier, and row_filter #994

Deprecate Redundant Identifier Support in TableIdentifier, and row_filter #994

Conversation

sungwy commented Aug 3, 2024 • edited Loading

ndrluis commented Aug 4, 2024

sungwy commented Aug 5, 2024

sungwy Aug 5, 2024

Choose a reason for hiding this comment

sungwy Aug 5, 2024

Choose a reason for hiding this comment

kevinjqliu commented Aug 7, 2024

kevinjqliu left a comment

Choose a reason for hiding this comment

kevinjqliu Aug 7, 2024

Choose a reason for hiding this comment

sungwy Aug 7, 2024

Choose a reason for hiding this comment

Fokko Aug 8, 2024

Choose a reason for hiding this comment

Fokko left a comment

Choose a reason for hiding this comment

Fokko Aug 8, 2024

Choose a reason for hiding this comment

Fokko Aug 8, 2024

Choose a reason for hiding this comment

sungwy commented Aug 8, 2024

sungwy commented Aug 3, 2024 •

edited

Loading