Python: Add updates, moves and deletes #8374

Fokko · 2023-08-22T18:54:55Z

No description provided.

python/mkdocs/docs/api.md

… fd-followup

…ollowup

rdblue · 2023-08-24T19:53:36Z

python/pyiceberg/table/__init__.py

        # Strip the catalog name
        if len(self._updates) > 0:
-            response = self._table.catalog._commit_table(  # pylint: disable=W0212
+            self._table._do_commit(  # pylint: disable=W0212


I'm fine with this being private for now, but we may want to actually expose it later.

python/pyiceberg/table/__init__.py

rdblue · 2023-08-25T00:30:48Z

python/pyiceberg/table/__init__.py

+        exists = False
+        try:
+            exists = self._schema.find_field(full_name, self._case_sensitive) is not None
+        except ValueError:


Why would find_field throw a ValueError? Shouldn't it return None instead?

def find_field(self, name_or_id: Union[str, int], case_sensitive: bool = True) -> NestedField: """Find a field using a field name or field ID. Args: name_or_id (Union[str, int]): Either a field name or a field ID. case_sensitive (bool, optional): Whether to perform a case-sensitive lookup using a field name. Defaults to True. Raises: ValueError: When the value cannot be found. Returns: NestedField: The matched NestedField. """

It throws an error indeed.

This was in there before any PyIceberg release. This follows the Easier to Ask Forgiveness Than Permission style, which is quite popular in Python. I personally like to avoid this pattern and avoid the exceptions. The more now we have the walrus := operator.

Agreed. I guess it's established already though.

rdblue · 2023-08-25T00:33:29Z

python/pyiceberg/table/__init__.py

+        new_type = assign_fresh_schema_ids(field_type, self.assign_new_column_id)
+        field = NestedField(field_id=new_id, name=name, field_type=new_type, required=required, doc=doc)
+
+        self._adds[parent_id] = self._adds.get(parent_id, []) + [field]


Maybe it's me, but using + makes me wonder about the argument types. I find this more clear:

self._adds[parent_id] = tuple(*self._adds.get(parent_id, []), field)

I'm open to making it more readable, but that doesn't work on my machine:

➜ python git:(fd-followup) python3 Python 3.11.4 (main, Jul 25 2023, 17:36:13) [Clang 14.0.3 (clang-1403.0.22.14.1)] on darwin Type "help", "copyright", "credits" or "license" for more information. >>> tuple(*[1, 2], 3)

What does work is:

self._adds[parent_id] = self._adds.get(parent_id, tuple()), + (field,)

But then we end up with the same thing as before, but then a tuple. I would suggest refactoring it as:

if parent_id in self._adds: self._adds[parent_id].append(field) else: self._adds[parent_id] = [field]

Then we don't recreate the collections all the time (but this isn't the hottest path, probably)

python/pyiceberg/table/__init__.py

rdblue · 2023-09-01T23:24:42Z

python/pyiceberg/table/__init__.py

+
+            if self._transaction is not None:
+                self._transaction._append_updates(*updates)  # pylint: disable=W0212
+                self._transaction._append_requirements(*requirements)  # pylint: disable=W0212


Not a problem, but note that if you run multiple schema updates in the same transaction, the transaction needs to ensure it sends only one AssertCurrentSchemaId. Right now that will fail but in the future we'll want to suppress the additional ones so that only the first schema ID is validated.

Great observation. I checked and we only allow a single AddSchemaUpdate in a transaction. So I think we're okay for now. I've added a test to capture the behavior.

rdblue · 2023-09-01T23:31:55Z

python/pyiceberg/table/__init__.py

+            raise ValueError(f"Cannot update a column that will be deleted: {full_name}")
+
+        if field_type is not None:
+            if not self._allow_incompatible_changes and field.field_type != field_type:


The Java equivalent ensures that only primitive fields are updated because the only type that can be passed in is a PrimitiveType. I think we need to have an additional check here that this is not replacing a nested type.

It may be that promote already does this for us, but I want to highlight that we cannot allow calling this to update a struct, map, or list type.

Schema evolution catches this:

Cannot change column type: foo: struct<2: bar: optional string> -> string

I've added a check to make the error more specific error:

Cannot change column type: struct<2: bar: optional string> is not a primitive

python/pyiceberg/table/__init__.py

rdblue · 2023-09-01T23:38:54Z

python/tests/test_integration_schema.py

+
+
+@pytest.fixture()
+def catalog() -> Catalog:


Why are all of these tests done in integration tests? In Java, we test the schema update API by calling apply to produce the schema that will be committed. Only the commit needs to be done in an integration test that way.

For me it adds another layer of testing since the REST catalog will also reject certain things (for example, the requirements on the identifier fields), this caught already a few bugs. I also like to follow the same path that the end user would take

Okay, I get it now. Seems alright, although the REST catalog really doesn't check much. It would probably be good to move most of the schema evolution behavior tests to unit tests, but the important thing is that they're happening somewhere.

python/tests/test_integration_schema.py

rdblue

Looks really good to me! I think the only serious issues are:

Schema ID is not being assigned in commit, which corrupts metadata
Tests are run as integration tests, but most could be converted to unit tests that call _apply
update_column always creates an entry in changes even if the change should be a noop

rdblue · 2023-09-01T23:57:42Z

Awesome work, @Fokko! This is really close.

Fokko · 2023-09-03T21:12:38Z

Schema ID is not being assigned in commit, which corrupts metadata

This was a great catch, updated this! The REST Catalog actually handled this correctly, but now it is also fixed and tested outside of the rest catalog.

Tests are run as integration tests, but most could be converted to unit tests that call _apply

I ported some tests outside of the integration tests as well. As mentioned above I really like testing against the REST catalog, since uses the same public API that the customer uses (which is a good thing from an OOP perspective), but we're testing much more code than we actually need. I also ported some test back to avoid having only integration tests, but the majority is still IT tests because I prefer them. Let me know if you want me to port the rest back as well.

update_column always creates an entry in changes even if the change should be a noop

Added and tested as well, also removed the make_required and set_doc operations.

Fixes apache#8374

Fokko · 2023-09-03T21:41:18Z

I accidentally tagged this PR in another PR that was closed.

rdblue · 2023-09-04T15:44:04Z

python/mkdocs/docs/api.md

    update.move_after("bid", "ask")
-    # In a struct, only the new name field
-    update.move_before("details.exchange", "properties.created_by")
+    # This will move `confirmed_by` before `exchange`


Nit: created_by is moved, not confirmed_by.

rdblue

Looks good to me. I don't understand why so many schema tests are run as integration tests, but at least they run somewhere. We can get this in to unblock the release at least.

Fokko added 2 commits August 22, 2023 14:52

Test windows release

f47a987

Python: Follow up on update schema

b0c0e1d

github-actions bot added the python label Aug 22, 2023

rdblue reviewed Aug 22, 2023

View reviewed changes

python/mkdocs/docs/api.md Outdated Show resolved Hide resolved

Cleanup

ec5ed0b

Fokko force-pushed the fd-followup branch from b74a489 to ec5ed0b Compare August 23, 2023 05:42

Fokko added 5 commits August 23, 2023 07:43

Update python/mkdocs/docs/api.md

61019cb

WIP

48a2057

Fix nested fields

31b06e4

Merge branch 'fd-followup' of github.com:Fokko/incubator-iceberg into…

7b81998

… fd-followup

Merge branch 'master' of github.com:Fokko/incubator-iceberg into fd-f…

794d04a

…ollowup

Fokko mentioned this pull request Aug 24, 2023

Python: Support for adding columns #8174

Merged

Fokko changed the title ~~Python: Follow up on update schema~~ Python: Add updates, moves and deletes Aug 24, 2023

Fokko added 2 commits August 24, 2023 14:13

First the code, next the tests

1ff0bec

Step into the right direction

f9fc305

rdblue reviewed Aug 24, 2023

View reviewed changes

python/pyiceberg/table/__init__.py Outdated Show resolved Hide resolved

rdblue reviewed Aug 24, 2023

View reviewed changes

python/pyiceberg/table/__init__.py Outdated Show resolved Hide resolved

Fokko added 2 commits August 24, 2023 22:08

MOAR Tests

0c4405d

MOAR tests for renaming of columns

641d61e

rdblue reviewed Aug 25, 2023

View reviewed changes

python/pyiceberg/table/__init__.py Outdated Show resolved Hide resolved

rdblue reviewed Aug 25, 2023

View reviewed changes

python/pyiceberg/table/__init__.py Outdated Show resolved Hide resolved

rdblue reviewed Aug 25, 2023

View reviewed changes

python/pyiceberg/table/__init__.py Outdated Show resolved Hide resolved

rdblue reviewed Aug 25, 2023

View reviewed changes

python/pyiceberg/table/__init__.py Outdated Show resolved Hide resolved

rdblue reviewed Aug 25, 2023

View reviewed changes

python/pyiceberg/table/__init__.py Outdated Show resolved Hide resolved

rdblue reviewed Aug 25, 2023

View reviewed changes

python/pyiceberg/table/__init__.py Outdated Show resolved Hide resolved

rdblue reviewed Aug 25, 2023

View reviewed changes

python/pyiceberg/table/__init__.py Show resolved Hide resolved

rdblue reviewed Aug 25, 2023

View reviewed changes

python/pyiceberg/table/__init__.py Show resolved Hide resolved