Python: Support for adding columns #8174

hililiwei · 2023-07-28T08:39:50Z

Under the umbrella of #7875

What is the purpose of the change

Added the ability to add columns to PyIceberg.

python/pyiceberg/schema.py

python/pyiceberg/table/__init__.py

python/tests/conftest.py

Fokko · 2023-08-01T15:39:41Z

I forgot, but could you also add docs under python/mkdocs/. It would be a pity if people aren't aware of this functionality.

python/tests/test_schema.py

python/pyiceberg/table/__init__.py

python/mkdocs/docs/api.md

python/pyiceberg/table/__init__.py

python/pyiceberg/schema.py

python/pyiceberg/table/__init__.py

python/mkdocs/docs/api.md

python/pyiceberg/schema.py

python/pyiceberg/table/__init__.py

Fokko · 2023-08-22T12:05:57Z

Thanks @hililiwei for working on this 👍🏻

rdblue · 2023-08-22T16:07:44Z

python/mkdocs/docs/api.md

+
+Add new columns through the `Transaction` or `UpdateSchema` API:
+
+Use the Transaction API:


@Fokko, I don't think that we should encourage people to use the transaction API for single operations like this. It's much simpler to avoid using the transaction right?

Alright, we can swap them around.

rdblue · 2023-08-22T16:12:27Z

python/pyiceberg/schema.py

    @property
    def highest_field_id(self) -> int:
-        return visit(self.as_struct(), _FindLastFieldId())
+        return max(self._lazy_id_to_name.keys(), default=0)


This seems unrelated, though fine.

Should have been a separate PR indeed, but it is a very welcome change

The old way didn't get the highest field id correctly, so I changed it.

rdblue · 2023-08-22T16:12:59Z

python/pyiceberg/schema.py

 # specific language governing permissions and limitations
 # under the License.
 # pylint: disable=W0511
+from __future__ import annotations


@Fokko, what is our policy on using this vs using strings? How do we choose?

I can add that to the Type annotations section. I prefer the from __future__ import annotations since it introduces less visual noise.

rdblue · 2023-08-22T16:13:22Z

python/pyiceberg/schema.py

        self._short_field_names: List[str] = []

+    def before_map_key(self, key: NestedField) -> None:
+        self.before_field(key)


Isn't this the default implementation?

Yes, looks like it. Let me clean that up

rdblue · 2023-08-22T16:15:21Z

python/pyiceberg/schema.py

-        self.counter = itertools.count(start)
+    def __init__(self, next_id_func: Optional[Callable[[], int]] = None) -> None:
        self.reserved_ids = {}
+        counter = itertools.count(1)


Why does this remove start? Isn't that an unnecessary incompatible change?

The start argument wasn't actually exposed:

def assign_fresh_schema_ids(schema: Schema) -> Schema: """Traverses the schema, and sets new IDs.""" return pre_order_visit(schema, _SetFreshIDs())

The _SetFreshIDs() is private to the module, so we're safe here.

I see, sounds good!

rdblue · 2023-08-22T16:17:11Z

python/pyiceberg/table/__init__.py

+        """
+        for requirement in new_requirements:
+            type_new_requirement = type(requirement)
+            if any(type(update) == type_new_requirement for update in self._updates):


Should this be _updates or _requirements?

auch, copy-paste. Thanks!

rdblue · 2023-08-22T16:25:44Z

python/pyiceberg/table/__init__.py

+        return self
+
+    def add_column(
+        self, name: str, type_var: IcebergType, doc: Optional[str] = None, parent: Optional[str] = None, required: bool = False


Should we support Tuple[str] for name? That way we don't need parent.

A Tuple[str, ...] would work. I like that

I would suggest Union[str, Tuple[str, ...]] so you can still pass in a string for adding a field to the root

rdblue · 2023-08-22T16:33:36Z

python/pyiceberg/table/__init__.py

+        self._internal_add_column(parent, name, not required, type_var, doc)
+        return self
+
+    def allow_incompatible_changes(self) -> UpdateSchema:


What about adding these boolean options to the update_schema(...) method?

table.update_schema(allow_incompatible_changes=True, case_sensitive=False).add_column(...).commit()

That's a nice touch, added

rdblue · 2023-08-22T16:34:35Z

python/pyiceberg/table/__init__.py

+                CommitTableRequest(identifier=self._table.identifier[1:], updates=updates, requirements=requirements)
+            )
+            self._table.metadata = table_update_response.metadata
+            self._table.metadata_location = table_update_response.metadata_location


I don't think these should be managed externally. We don't want code blocks that keep metadata and metadata_location in sync on the table all over the place. Can _table keep track of these and run the commit instead?

rdblue · 2023-08-23T17:30:42Z

python/pyiceberg/table/__init__.py

+                parent_field = parent_type.element_field
+
+            if not parent_field.field_type.is_struct:
+                raise ValueError(f"Cannot add column to non-struct type: {parent}")


I think this should also include the parent field name.

rdblue · 2023-08-23T17:32:35Z

python/pyiceberg/table/__init__.py

+            if exist_field:
+                raise ValueError(f"Cannot add column, name already exists: {parent}.{name}")
+
+            full_name = parent_field.name + "." + name


This isn't the full name. The parent field's name is the local name, not its full name.

I think this has been fixed in the refactored version where you need to supply the full path. Let me add some additional tests to make sure that it works as expected.

rdblue · 2023-08-23T17:36:28Z

python/pyiceberg/table/__init__.py

+        new_type = assign_fresh_schema_ids(type_var, self.assign_new_column_id)
+        field = NestedField(new_id, name, new_type, not is_optional, doc)
+
+        self._adds.setdefault(parent_id, []).append(field)


@Fokko, looks like this doesn't update _id_to_parent but the Java implementation does. Is that because this is only adding columns right now? I think we should probably keep everything in sync with the reference implementation so we don't have bugs later.

Yes, it is currently only appending columns. To keep things smaller, @hililiwei decided to break it up into smaller chunks, which I think is a good idea.

rdblue · 2023-08-23T17:59:40Z

python/pyiceberg/table/__init__.py

+
+    def schema(self, schema: Schema, struct_result: IcebergType) -> IcebergType:
+        fields = _ApplyChanges.add_fields(schema.as_struct().fields, self.adds.get(TABLE_ROOT_ID))
+        if len(fields) > 0:


This is not the same as the check in Java. In Java, this checks whether there were any changes to apply. It isn't clear what an empty list is here, but I definitely prefer using None to signal that no changes were necessary rather than an empty list.

I agree, this has already been updated in: #8374

rdblue · 2023-08-23T18:13:58Z

python/pyiceberg/table/__init__.py

+
+        is_value_optional = not map_type.value_required
+
+        if is_value_optional != value_field.required and map_type.value_type == value_type:


I find this a bit confusing since it negates value_required above and then checks != here.

Yes, this is only adding columns currently, so this is a bit redundant. It will not be useful until there is an update field. Sorry , it's not clean enough here.

In #8374 I've removed the temporary variable. I that might introduce confusion.

rdblue · 2023-08-23T18:15:15Z

python/pyiceberg/table/__init__.py

+        if value_type is None:
+            raise ValueError(f"Cannot delete value type from map: {value_field}")
+
+        is_value_optional = not map_type.value_required


This doesn't come from the right place. In Java, this comes from updates where the value may have been made optional.

Yes, similar to the previous comment.

rdblue · 2023-08-23T18:19:12Z

python/pyiceberg/table/__init__.py

+        if is_element_optional == element_field.required and list_type.element_type == element_type:
+            return list_type
+
+        return ListType(list_type.element_id, element_type, is_element_optional)


@Fokko: This is is_element_required so this accidentally flips the boolean. I think we should work with is_required throughout this code to avoid needing to negate and accidentally changing whether types are optional.

Yes, this is correct, this was already fixed in #8374 and I'll make sure to add some more checks

rdblue · 2023-08-23T18:20:02Z

python/pyiceberg/table/__init__.py

+        if element_type is None:
+            raise ValueError(f"Cannot delete element type from list: {element_field}")
+
+        is_element_optional = not list_type.element_required


I think this is unnecessary without updates. It can remain as a noop check and placeholder, I guess? In any case, the logic should be fixed below.

rdblue · 2023-08-23T18:20:19Z

python/pyiceberg/table/__init__.py

+
+        is_element_optional = not list_type.element_required
+
+        if is_element_optional == element_field.required and list_type.element_type == element_type:


This looks wrong: is_optional == is_required.

python/pyiceberg/table/__init__.py

rdblue · 2023-08-23T18:25:52Z

python/pyiceberg/table/__init__.py

+        new_fields.extend(fields)
+        if adds:
+            new_fields.extend(adds)
+        return new_fields


What about return tuple(*fields, *adds)?

Like it! Updated it to:

@staticmethod def add_fields(fields: Tuple[NestedField, ...], adds: List[NestedField]) -> Optional[List[NestedField]]: return None if len(adds) == 0 else tuple(*fields, *adds)

rdblue · 2023-08-23T18:26:40Z

python/pyiceberg/table/__init__.py

+        return primitive
+
+    @staticmethod
+    def add_fields(fields: Tuple[NestedField, ...], adds: Optional[List[NestedField]]) -> List[NestedField]:


If adds is None, then I think this should return None.

See comment above :)

rdblue · 2023-08-23T18:29:12Z

python/pyiceberg/table/__init__.py

+            new_fields = self.adds[field_id]
+            if len(new_fields) > 0:
+                fields = _ApplyChanges.add_fields(field_result.fields, new_fields)
+                if len(fields) > 0:


I think this should be simpler:

new_fields = _ApplyChanges.add_fields(field_result.fields, self.adds.get(field_id)) if new_fields is None: return field_result return StructType(*new_fields)

Why no walrus?

if new_fields := _ApplyChanges.add_fields(field_result.fields, self.adds.get(field_id, [])): return field_result return StructType(*new_fields)

rdblue · 2023-08-23T18:30:11Z

python/pyiceberg/table/__init__.py

+        new_fields: List[NestedField] = []
+        for i in range(len(field_results)):
+            type_: Optional[IcebergType] = field_results[i]
+            if type_ is None:


This is implementing drop column. I'm not sure it makes sense to have part of the code, but not implement it.

Yes, this also confused me a bit

rdblue · 2023-08-23T18:31:41Z

python/pyiceberg/table/__init__.py

+                continue
+
+            field: NestedField = struct.fields[i]
+            new_fields.append(field)


Shouldn't this use the type from field_results?

rdblue · 2023-08-23T18:33:58Z

python/pyiceberg/table/__init__.py

+                has_change = True
+                continue
+
+            field: NestedField = struct.fields[i]


@Fokko, we also need to be careful in places like this. This lookup is incorrect because it is matching the field results with the original fields by position. That must be done by field ID.

hi @rdblue, I don't understand something. The field_results are obtained from the struct. And when we get None, the loop will continue. It seems that getting from the position does not cause confusion?

rdblue · 2023-08-23T18:34:41Z

python/pyiceberg/table/__init__.py

+
+    def field(self, field: NestedField, field_result: IcebergType) -> IcebergType:
+        field_id: int = field.field_id
+        if field_id in self.adds:


Should this also check that the field is a struct?

That's a good check, but I think it went okay since _internal_add_column only create mappings where the parent is a struct.

github-actions bot added the python label Jul 28, 2023

hililiwei mentioned this pull request Jul 28, 2023

Python: Add update schema #7875

Closed

Fokko reviewed Aug 1, 2023

View reviewed changes

hililiwei force-pushed the p_add_column branch 6 times, most recently from 78842c7 to 1d50871 Compare August 3, 2023 08:58

hililiwei commented Aug 3, 2023

View reviewed changes

python/tests/test_schema.py Show resolved Hide resolved

hililiwei requested a review from Fokko August 5, 2023 02:02

Fokko mentioned this pull request Aug 8, 2023

Support commit operations in pyiceberg #7259

Closed

hililiwei force-pushed the p_add_column branch 2 times, most recently from 9ad74c1 to b383de6 Compare August 9, 2023 03:09

Fokko reviewed Aug 9, 2023

View reviewed changes

hililiwei force-pushed the p_add_column branch 2 times, most recently from 04e4dcf to 139039b Compare August 11, 2023 09:17

Fokko reviewed Aug 14, 2023

View reviewed changes

hililiwei force-pushed the p_add_column branch 5 times, most recently from 6d376c4 to 6d90d6f Compare August 21, 2023 12:59

hililiwei and others added 2 commits August 22, 2023 14:30

Python: Support add column

90bc51b

Add integration tests (#264)

f6b0ad4

hililiwei force-pushed the p_add_column branch from 6d90d6f to 449f23c Compare August 22, 2023 08:06

Python: Support add column

32db8b1

hililiwei force-pushed the p_add_column branch from 449f23c to 32db8b1 Compare August 22, 2023 08:17

Add the requirement (#265)

0258046

Python: Support add column

d42b1c7

Fokko approved these changes Aug 22, 2023

View reviewed changes

Fokko merged commit b7fb007 into apache:master Aug 22, 2023

rdblue reviewed Aug 22, 2023

View reviewed changes

rdblue reviewed Aug 23, 2023

View reviewed changes

python/pyiceberg/table/__init__.py Show resolved Hide resolved

rdblue reviewed Aug 23, 2023

View reviewed changes


		Add new columns through the `Transaction` or `UpdateSchema` API:

		Use the Transaction API:


		is_value_optional = not map_type.value_required

		if is_value_optional != value_field.required and map_type.value_type == value_type:


		is_element_optional = not list_type.element_required

		if is_element_optional == element_field.required and list_type.element_type == element_type:

Python: Support for adding columns #8174

Python: Support for adding columns #8174

Uh oh!

Conversation

hililiwei commented Jul 28, 2023

What is the purpose of the change

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Fokko commented Aug 1, 2023

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Fokko commented Aug 22, 2023

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rdblue Aug 22, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rdblue Aug 23, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

rdblue Aug 22, 2023 •

edited

Loading

rdblue Aug 23, 2023 •

edited

Loading

Fokko Aug 24, 2023 •

edited

Loading