Python: Refactor unary and set expressions #5362

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Merged

danielcweeks merged 9 commits into apache:master from rdblue:python-refactor-unary-types

Jul 29, 2022

Contributor

rdblue commented Jul 26, 2022 •

edited

Loading

This refactors unary and set expressions to use base classes. In addition, this updates binding to produce AlwaysTrue() or AlwaysFalse() values based on the bound type. For example, IsNull(a) when bound to a required field a will produce AlwaysFalse() because the value is required.

This also adds missing tests.


          Python: Refactor unary expressions.

d6f31a6

rdblue requested a review from samredai

July 26, 2022 22:38

github-actions bot added the python label

rdblue commented

View reviewed changes

python/pyiceberg/expressions/base.py Outdated

		)


		# TODO: this should still exist as a parent, but term should be abstract

Contributor Author

rdblue Jul 26, 2022

The UnaryPredicate and BoundUnaryPredicate classes introduced in this PR should implement the UnboundPredicate and BoundPredicate interfaces, but I didn't want to refactor the entire set of expressions in this PR so I added this TODO as a reminder.

rdblue commented

View reviewed changes

python/pyiceberg/expressions/base.py

               @dataclass(frozen=True)
-              class AlwaysTrue(BooleanExpression, ABC, Singleton):
+              class AlwaysTrue(BooleanExpression, Singleton):

Contributor Author

rdblue Jul 26, 2022

These aren't actually abstract.

rdblue commented

View reviewed changes

python/pyiceberg/expressions/base.py Outdated Show resolved Hide resolved


          Python: Add SetPredicate and BoundSetPredicate to simply in cases.

03b622a

Fokko reviewed

View reviewed changes

python/pyiceberg/expressions/base.py Outdated Show resolved Hide resolved

python/pyiceberg/expressions/base.py Outdated Show resolved Hide resolved

python/pyiceberg/expressions/base.py Outdated Show resolved Hide resolved

rdblue added 2 commits

July 27, 2022 11:27


          Update for review comments.

c0416c2


          Fix type checks.

ed6f679

rdblue changed the title ~~Python: Refactor unary expressions~~ Python: Refactor unary and set expressions


          Fix linting.

5eaafff

Fokko reviewed

View reviewed changes

Contributor

Fokko left a comment

Hey @rdblue looking good! It is a bit more verbose than the previous approach, but the mypy errors shows that it is able to infer types 👍🏻

python/pyiceberg/expressions/base.py

+              class BoundIsNaN(BoundUnaryPredicate[T]):
+                  def __new__(cls, term: BoundTerm[T]):
+                      bound_type = term.ref().field.field_type

Contributor

Fokko Jul 28, 2022

I like how this is done, by checking this in the __new__ method.

Contributor Author

rdblue Jul 28, 2022

I'm actually not sure about this. There are mypy errors if you declare the return type from __new__ to be correct. This is from In.__new__ for example:

451: error: Incompatible return type for "__new__" (returns "BooleanExpression", but must return a subtype of "BoundIn[Any]")

Contributor

Fokko Jul 28, 2022

Yes, the __new__ expects to return to the class it belongs to. I think we can ignore this one because I like the trick of returning a different type very much.

python/pyiceberg/expressions/base.py Outdated Show resolved Hide resolved

python/pyiceberg/expressions/base.py Outdated Show resolved Hide resolved

python/pyiceberg/expressions/base.py Outdated Show resolved Hide resolved

python/pyiceberg/expressions/base.py Outdated Show resolved Hide resolved

python/pyiceberg/expressions/base.py Outdated Show resolved Hide resolved

python/pyiceberg/expressions/base.py Outdated

-                  def bind(self, schema: Schema, case_sensitive: bool) -> BoundNotIn[T]:
-                      bound_ref = self.term.bind(schema, case_sensitive)
-                      return BoundNotIn(bound_ref, *tuple(lit.to(bound_ref.field.field_type) for lit in self.literals))  # type: ignore
+                  def __new__(cls, term: UnboundTerm[T], literals: tuple[Literal[T], ...]) -> BooleanExpression:

Contributor

Fokko Jul 28, 2022

Suggested change

      
                def __new__(cls, term: UnboundTerm[T], literals: tuple[Literal[T], ...]) -> BooleanExpression:
          
                def __new__(cls, term: UnboundTerm[T], literals: *Literal[T]) -> BooleanExpression:

See the comment at class In(SetPredicate[T]):

python/pyiceberg/expressions/base.py Outdated Show resolved Hide resolved

python/pyiceberg/expressions/base.py Show resolved Hide resolved

python/pyiceberg/expressions/base.py Outdated

-                  def __invert__(self) -> BoundNotIn[T]:
-                      return BoundNotIn(self.term, *self.literals)
+              class BoundIn(BoundSetPredicate[T]):
+                  def __new__(cls, term: BoundTerm[T], literals: set[Literal[T]]) -> BooleanExpression:

Contributor

Fokko Jul 28, 2022

We could also consider wrapping it into positional arguments:

class BoundIn(BoundSetPredicate[T]):
    def __new__(cls, term: BoundTerm[T], *literals: Literal[T]) -> BooleanExpression:
        literals_set = set(literals)
        count = len(literals)
        if count == 0:
            return AlwaysFalse()
        if count == 1:
            return BoundEq(term, literals.pop())
        else:
            return super().__new__(cls)

    def __invert__(self) -> BooleanExpression:
        return BoundNotIn(self.term, self.literals)

This way we also don't have to care about mutating the set. See comment below 👍🏻

This way we don't have to pass in a set:

BoundIn[str](
    base.BoundReference(
        field=NestedField(field_id=1, name="foo", field_type=StringType(), required=False),
        accessor=Accessor(position=0, inner=None),
    ),
    StringLiteral("foo"), StringLiteral("bar"), StringLiteral("baz"),
)

Instead of:

BoundIn[str](
    base.BoundReference(
        field=NestedField(field_id=1, name="foo", field_type=StringType(), required=False),
        accessor=Accessor(position=0, inner=None),
    ),
    {StringLiteral("foo"), StringLiteral("bar"), StringLiteral("baz")},
)

rdblue added 3 commits

July 28, 2022 11:09


          Fix mypy issues.

2bf12be


          Add LiteralPredicate and refactor.

aa73775


          Rename inequalities.

d8aa0b7

samredai reviewed

View reviewed changes

Contributor

samredai left a comment

A few small comments but otherwise this LGTM!

python/pyiceberg/expressions/base.py Outdated

+              @dataclass(frozen=True)
+              class BoundNotNull(BoundUnaryPredicate[T]):
+                  def __new__(cls, term: BoundTerm[T]):

Contributor

samredai Jul 29, 2022 •

edited

Loading

pylint for the 3.8 CI seems to not like that the argument signature is different between this and the default super().__new__. You can change this to match the *args, **kwargs from super, but then you'd have to pull out the kwargs and it won't look as clean plus won't allow positional arguments.

@dataclass(frozen=True)
class BoundNotNull(BoundUnaryPredicate[T]):
    def __new__(cls, *args, **kwargs):
        term: BoundTerm[T] = kwargs["term"]
        if term.ref().field.required:
            return AlwaysTrue()
        return super().__new__(cls)

This is just a linting error though and the assumptions made when using the super type hold well so I think it'd be better to just ignore it for these lines.

@dataclass(frozen=True)
class BoundNotNull(BoundUnaryPredicate[T]):
    def __new__(cls, term: BoundTerm[T]):  # pylint: disable=W0221

Contributor Author

rdblue Jul 29, 2022

Thanks! I suppressed the false positives.

python/pyiceberg/expressions/base.py

-                      return IsNull(self.term)
+              @dataclass(frozen=True)
+              class UnaryPredicate(UnboundPredicate[T]):
+                  def bind(self, schema: Schema, case_sensitive: bool = True) -> BooleanExpression:

Contributor

samredai Jul 29, 2022

Should this return type be BoundUnaryPredicate? Same question for SetPredicate.bind(...) and LiteralPredicate.bind(...) which can have return types of BoundSetPredicate and BoundLiteralPredicate.

Contributor Author

rdblue Jul 29, 2022

Binding may produce a different type of expression. For example, if IsNull is bound to a required type, binding will return AlwaysFalse

python/tests/expressions/test_expressions_base.py



		def test_bind_dedup(request):
		schema = request.getfixturevalue("table_schema_simple")

Contributor

samredai Jul 29, 2022

It's definitely not required but instead of using request here you can just use the fixture name and pytest will automatically grab it for you.

def test_bind_dedup(table_schema_simple):
    bound = base.BoundIn(
        base.BoundReference(table_schema_simple.find_field(1), table_schema_simple.accessor_for_field(1)), {literal("hello"), literal("world")}
    )
    assert base.In(base.Reference("foo"), (literal("hello"), literal("world"), literal("world"))).bind(table_schema_simple) == bound


          Python: Ignore false positive pylint W0221 for 3.8.

947056d

danielcweeks approved these changes

View reviewed changes

danielcweeks merged commit 0b84579 into apache:master

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels