Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deprecate Redundant Identifier Support in TableIdentifier, and row_filter #994

Merged
merged 7 commits into from
Aug 8, 2024

Conversation

sungwy
Copy link
Collaborator

@sungwy sungwy commented Aug 3, 2024

Today in PyIceberg, we have support for identifier parsing in public APIs
belonging to two different classes:

  • Catalog class: load_table, purge_table, drop_table
  • Table class: scan

These APIs currently have optional support for the identifier that the
instance itself belongs to.

For example, the catalog class APIs support:

catalog = load_catalog(“animals”, **properties)
catalog.load_table(“cats.whiskers”)

But it also supports:

catalog.load_table(“animals.cats.whiskers”)

Which is redundant, because the catalog.name is already “animals”.

Similarly, row_filter in the Table scan API supports:

table = catalog.load_table(“cats.whiskers”)
table.scan(row_filter=”n_legs== 4”)

But we also support

table.scan(row_filter=”whiskers.n_legs == 4”)

Which is also redundant, because the table name is already “whiskers” (or
cats.whiskers)

The benefits of this change are as follows:

  • As observed above, specifying instance-level identifier in these APIs
    is redundant
  • This optional support adds a lot of complexity to the code base and
    leads to issues like: Rest Catalog: catalog.name should not be part of namespace #742
    Rest Catalog: catalog.name should not be part of namespace #742 It would be really
    great to clean this up before as we prepare for a 1.0.0 later this year
  • The optional support opens up the possibility of resulting in
    correctness issues if there exists a name in the level below as the
    instance-level identifier.
  • For example, if in the above catalog, we have a table namespace
    named “animals.lower” catalog.load(“animals.lower.cats”) can be construed
    as table name “cats” in the namespace “animals.lower” but it will be
    interpreted as table name “cats” in the namespace “lower” which is
    erroneous.
  • We would see a similar issue with tables and field names as well.
    Field name parsing is already complicated because we have to represent
    nested fields as flat representations. So it would be great to remove one
    unnecessary level of complication here

We had opened the mail list thread in order to discuss the potential impact of this change, but it didn't get much traction. Based on the very few responses we got from the Slack channel, there is support in making this change.

Note: This PR takes a much more conservative approach than the previous PR (#963) by marking the usage of these redundant identifiers with a deprecation message. In 0.8.0 this usage pattern will log DeprecationWarnings, and in 0.9.0 we will plan to remove these and clean up the impacted functions.

Mail list Discussion: https://lists.apache.org/thread/9zr19hxnbt3hg7lt55t6dfg6otv7zjz2

@ndrluis
Copy link
Collaborator

ndrluis commented Aug 4, 2024

@sungwy Awesome work!

Do you know if these changes could impact the multi-level namespace case? I believe it would be nice for us to have some integration tests with Polaris. I think it’s the only catalog that implements the nested namespace.

https://polaris.io/#tag/Polaris-Catalog-Entities/Namespace

@sungwy
Copy link
Collaborator Author

sungwy commented Aug 5, 2024

@sungwy Awesome work!

Do you know if these changes could impact the multi-level namespace case? I believe it would be nice for us to have some integration tests with Polaris. I think it’s the only catalog that implements the nested namespace.

https://polaris.io/#tag/Polaris-Catalog-Entities/Namespace

Hi @ndrluis - thank you for bringing up this point. I'm of the opinion that having a 'representative' implementation of the REST catalog would be good enough to serve this purpose. Currently we are using a tabular rest catalog image for our tests, but I think it would be a good idea to explore other alternatives.

I've created a new test to validate the behavior against the REST catalog in our integration test suite, but I believe we also have tests using multilevel namespaces in test_base.py as well.

@@ -613,6 +613,11 @@ def update_namespace_properties(
ValueError: If removals and updates have overlapping keys.
"""

@deprecated(
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This deprecates the public function

@@ -627,6 +632,25 @@ def identifier_to_tuple_without_catalog(self, identifier: Union[str, Identifier]
identifier_tuple = identifier_tuple[1:]
return identifier_tuple

def _identifier_to_tuple_without_catalog(self, identifier: Union[str, Identifier]) -> Identifier:
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where as this is now called by the PyIceberg functions and only prints the deprecation warning message if we are using the unsupported naming convention

@kevinjqliu
Copy link
Contributor

Do you know if these changes could impact the multi-level namespace case?

I think Polaris is using the same concept of namespace as we do in PyIceberg.

Here's my mental model
Catalog == container of namespaces.
Namespaces can be hierarchical
For example:
catalog1.foo.bar
catalog1.a.b.c.d.e

name[0] == catalog name (catalog1)
name[-1] == table name (e)
name[1:-1] == namespace (a.b.c.d)

Copy link
Contributor

@kevinjqliu kevinjqliu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for making this change! I think it'll be beneficial to the overall project going forward.
Having a deprecation message for an entire minor version (0.8.0 -> 0.9.0) would give enough time to users to migrate.



@pytest.mark.integration
def test_table_v1_with_null_nested_namespace(session_catalog: Catalog, arrow_table_with_null: pa.Table) -> None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is this test for?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's for testing against nested / multilevel namespaces, it was a helpful review suggestion from @ndrluis

@kevinjqliu kevinjqliu added this to the PyIceberg 0.8.0 release milestone Aug 7, 2024
help_message="Parsing expressions with table name is deprecated. Only provide field names in the row_filter.",
)
# TODO: Once this is removed, we will no longer take just the last index of parsed column result
# And introduce support for parsing filter expressions with nested fields.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I came here to point this out indeed. Seems to fail on several levels; I don't think this was ever properly tested.

image

Copy link
Contributor

@Fokko Fokko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm in favor of this change. We should have done it a long time ago 👍 Thanks for doing this @sungwy

self.metadata = fresh.metadata
self.io = fresh.io
self.metadata_location = fresh.metadata_location
return self

@property
def identifier(self) -> Identifier:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

identifier = "default.lower.table_v1_with_null_nested_namespace"
tbl = _create_table(session_catalog, identifier, {"format-version": "1"}, [arrow_table_with_null])
assert tbl.format_version == 1, f"Expected v1, got: v{tbl.format_version}"
print(session_catalog.list_tables("default"))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's remove this print before merging

@sungwy sungwy merged commit eca9870 into apache:main Aug 8, 2024
7 checks passed
@sungwy
Copy link
Collaborator Author

sungwy commented Aug 8, 2024

Merged! 🙌 Thank you for the reviews @ndrluis @kevinjqliu and @Fokko !

@sungwy sungwy deleted the depr-higher-identifier-support branch August 8, 2024 15:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants