Skip to content

Conversation

@Fokko
Copy link
Contributor

@Fokko Fokko commented Aug 22, 2023

No description provided.

@Fokko Fokko changed the title Python: Follow up on update schema Python: Add updates, moves and deletes Aug 24, 2023
# Strip the catalog name
if len(self._updates) > 0:
response = self._table.catalog._commit_table( # pylint: disable=W0212
self._table._do_commit( # pylint: disable=W0212
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with this being private for now, but we may want to actually expose it later.

exists = False
try:
exists = self._schema.find_field(full_name, self._case_sensitive) is not None
except ValueError:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why would find_field throw a ValueError? Shouldn't it return None instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

def find_field(self, name_or_id: Union[str, int], case_sensitive: bool = True) -> NestedField:
    """Find a field using a field name or field ID.

    Args:
        name_or_id (Union[str, int]): Either a field name or a field ID.
        case_sensitive (bool, optional): Whether to perform a case-sensitive lookup using a field name. Defaults to True.

    Raises:
        ValueError: When the value cannot be found.

    Returns:
        NestedField: The matched NestedField.
    """

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It throws an error indeed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was in there before any PyIceberg release. This follows the Easier to Ask Forgiveness Than Permission style, which is quite popular in Python. I personally like to avoid this pattern and avoid the exceptions. The more now we have the walrus := operator.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. I guess it's established already though.

new_type = assign_fresh_schema_ids(field_type, self.assign_new_column_id)
field = NestedField(field_id=new_id, name=name, field_type=new_type, required=required, doc=doc)

self._adds[parent_id] = self._adds.get(parent_id, []) + [field]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it's me, but using + makes me wonder about the argument types. I find this more clear:

self._adds[parent_id] = tuple(*self._adds.get(parent_id, []), field)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm open to making it more readable, but that doesn't work on my machine:

python git:(fd-followup) python3
Python 3.11.4 (main, Jul 25 2023, 17:36:13) [Clang 14.0.3 (clang-1403.0.22.14.1)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> tuple(*[1, 2], 3)

What does work is:

self._adds[parent_id] = self._adds.get(parent_id, tuple()), + (field,)

But then we end up with the same thing as before, but then a tuple. I would suggest refactoring it as:

if parent_id in self._adds:
    self._adds[parent_id].append(field)
else:
    self._adds[parent_id] = [field]

Then we don't recreate the collections all the time (but this isn't the hottest path, probably)


if self._transaction is not None:
self._transaction._append_updates(*updates) # pylint: disable=W0212
self._transaction._append_requirements(*requirements) # pylint: disable=W0212
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a problem, but note that if you run multiple schema updates in the same transaction, the transaction needs to ensure it sends only one AssertCurrentSchemaId. Right now that will fail but in the future we'll want to suppress the additional ones so that only the first schema ID is validated.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great observation. I checked and we only allow a single AddSchemaUpdate in a transaction. So I think we're okay for now. I've added a test to capture the behavior.

raise ValueError(f"Cannot update a column that will be deleted: {full_name}")

if field_type is not None:
if not self._allow_incompatible_changes and field.field_type != field_type:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Java equivalent ensures that only primitive fields are updated because the only type that can be passed in is a PrimitiveType. I think we need to have an additional check here that this is not replacing a nested type.

It may be that promote already does this for us, but I want to highlight that we cannot allow calling this to update a struct, map, or list type.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Schema evolution catches this:

Cannot change column type: foo: struct<2: bar: optional string> -> string

I've added a check to make the error more specific error:

Cannot change column type: struct<2: bar: optional string> is not a primitive



@pytest.fixture()
def catalog() -> Catalog:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are all of these tests done in integration tests? In Java, we test the schema update API by calling apply to produce the schema that will be committed. Only the commit needs to be done in an integration test that way.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For me it adds another layer of testing since the REST catalog will also reject certain things (for example, the requirements on the identifier fields), this caught already a few bugs. I also like to follow the same path that the end user would take

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I get it now. Seems alright, although the REST catalog really doesn't check much. It would probably be good to move most of the schema evolution behavior tests to unit tests, but the important thing is that they're happening somewhere.

Copy link
Contributor

@rdblue rdblue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks really good to me! I think the only serious issues are:

  • Schema ID is not being assigned in commit, which corrupts metadata
  • Tests are run as integration tests, but most could be converted to unit tests that call _apply
  • update_column always creates an entry in changes even if the change should be a noop

@rdblue
Copy link
Contributor

rdblue commented Sep 1, 2023

Awesome work, @Fokko! This is really close.

@Fokko
Copy link
Contributor Author

Fokko commented Sep 3, 2023

Schema ID is not being assigned in commit, which corrupts metadata

This was a great catch, updated this! The REST Catalog actually handled this correctly, but now it is also fixed and tested outside of the rest catalog.

Tests are run as integration tests, but most could be converted to unit tests that call _apply

I ported some tests outside of the integration tests as well. As mentioned above I really like testing against the REST catalog, since uses the same public API that the customer uses (which is a good thing from an OOP perspective), but we're testing much more code than we actually need. I also ported some test back to avoid having only integration tests, but the majority is still IT tests because I prefer them. Let me know if you want me to port the rest back as well.

update_column always creates an entry in changes even if the change should be a noop

Added and tested as well, also removed the make_required and set_doc operations.

Fokko added a commit to Fokko/iceberg that referenced this pull request Sep 3, 2023
Fokko added a commit to Fokko/iceberg that referenced this pull request Sep 3, 2023
@Fokko Fokko mentioned this pull request Sep 3, 2023
@Fokko Fokko reopened this Sep 3, 2023
@Fokko
Copy link
Contributor Author

Fokko commented Sep 3, 2023

I accidentally tagged this PR in another PR that was closed.

@Fokko Fokko mentioned this pull request Sep 3, 2023
update.move_after("bid", "ask")
# In a struct, only the new name field
update.move_before("details.exchange", "properties.created_by")
# This will move `confirmed_by` before `exchange`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: created_by is moved, not confirmed_by.

Copy link
Contributor

@rdblue rdblue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. I don't understand why so many schema tests are run as integration tests, but at least they run somewhere. We can get this in to unblock the release at least.

@rdblue rdblue merged commit 751d153 into apache:master Sep 4, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants