-
Notifications
You must be signed in to change notification settings - Fork 3k
Python: Refactor unary and set expressions #5362
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Python: Refactor unary and set expressions #5362
Conversation
python/pyiceberg/expressions/base.py
Outdated
| ) | ||
|
|
||
|
|
||
| # TODO: this should still exist as a parent, but term should be abstract |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The UnaryPredicate and BoundUnaryPredicate classes introduced in this PR should implement the UnboundPredicate and BoundPredicate interfaces, but I didn't want to refactor the entire set of expressions in this PR so I added this TODO as a reminder.
|
|
||
| @dataclass(frozen=True) | ||
| class AlwaysTrue(BooleanExpression, ABC, Singleton): | ||
| class AlwaysTrue(BooleanExpression, Singleton): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These aren't actually abstract.
Fokko
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @rdblue looking good! It is a bit more verbose than the previous approach, but the mypy errors shows that it is able to infer types 👍🏻
|
|
||
| class BoundIsNaN(BoundUnaryPredicate[T]): | ||
| def __new__(cls, term: BoundTerm[T]): | ||
| bound_type = term.ref().field.field_type |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like how this is done, by checking this in the __new__ method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm actually not sure about this. There are mypy errors if you declare the return type from __new__ to be correct. This is from In.__new__ for example:
451: error: Incompatible return type for "__new__" (returns "BooleanExpression", but must return a subtype of "BoundIn[Any]")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the __new__ expects to return to the class it belongs to. I think we can ignore this one because I like the trick of returning a different type very much.
python/pyiceberg/expressions/base.py
Outdated
| def bind(self, schema: Schema, case_sensitive: bool) -> BoundNotIn[T]: | ||
| bound_ref = self.term.bind(schema, case_sensitive) | ||
| return BoundNotIn(bound_ref, *tuple(lit.to(bound_ref.field.field_type) for lit in self.literals)) # type: ignore | ||
| def __new__(cls, term: UnboundTerm[T], literals: tuple[Literal[T], ...]) -> BooleanExpression: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| def __new__(cls, term: UnboundTerm[T], literals: tuple[Literal[T], ...]) -> BooleanExpression: | |
| def __new__(cls, term: UnboundTerm[T], literals: *Literal[T]) -> BooleanExpression: |
See the comment at class In(SetPredicate[T]):
python/pyiceberg/expressions/base.py
Outdated
| def __invert__(self) -> BoundNotIn[T]: | ||
| return BoundNotIn(self.term, *self.literals) | ||
| class BoundIn(BoundSetPredicate[T]): | ||
| def __new__(cls, term: BoundTerm[T], literals: set[Literal[T]]) -> BooleanExpression: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could also consider wrapping it into positional arguments:
class BoundIn(BoundSetPredicate[T]):
def __new__(cls, term: BoundTerm[T], *literals: Literal[T]) -> BooleanExpression:
literals_set = set(literals)
count = len(literals)
if count == 0:
return AlwaysFalse()
if count == 1:
return BoundEq(term, literals.pop())
else:
return super().__new__(cls)
def __invert__(self) -> BooleanExpression:
return BoundNotIn(self.term, self.literals)This way we also don't have to care about mutating the set. See comment below 👍🏻
This way we don't have to pass in a set:
BoundIn[str](
base.BoundReference(
field=NestedField(field_id=1, name="foo", field_type=StringType(), required=False),
accessor=Accessor(position=0, inner=None),
),
StringLiteral("foo"), StringLiteral("bar"), StringLiteral("baz"),
)Instead of:
BoundIn[str](
base.BoundReference(
field=NestedField(field_id=1, name="foo", field_type=StringType(), required=False),
accessor=Accessor(position=0, inner=None),
),
{StringLiteral("foo"), StringLiteral("bar"), StringLiteral("baz")},
)
samredai
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few small comments but otherwise this LGTM!
python/pyiceberg/expressions/base.py
Outdated
|
|
||
| @dataclass(frozen=True) | ||
| class BoundNotNull(BoundUnaryPredicate[T]): | ||
| def __new__(cls, term: BoundTerm[T]): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pylint for the 3.8 CI seems to not like that the argument signature is different between this and the default super().__new__. You can change this to match the *args, **kwargs from super, but then you'd have to pull out the kwargs and it won't look as clean plus won't allow positional arguments.
@dataclass(frozen=True)
class BoundNotNull(BoundUnaryPredicate[T]):
def __new__(cls, *args, **kwargs):
term: BoundTerm[T] = kwargs["term"]
if term.ref().field.required:
return AlwaysTrue()
return super().__new__(cls)This is just a linting error though and the assumptions made when using the super type hold well so I think it'd be better to just ignore it for these lines.
@dataclass(frozen=True)
class BoundNotNull(BoundUnaryPredicate[T]):
def __new__(cls, term: BoundTerm[T]): # pylint: disable=W0221There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! I suppressed the false positives.
| return IsNull(self.term) | ||
| @dataclass(frozen=True) | ||
| class UnaryPredicate(UnboundPredicate[T]): | ||
| def bind(self, schema: Schema, case_sensitive: bool = True) -> BooleanExpression: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this return type be BoundUnaryPredicate? Same question for SetPredicate.bind(...) and LiteralPredicate.bind(...) which can have return types of BoundSetPredicate and BoundLiteralPredicate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Binding may produce a different type of expression. For example, if IsNull is bound to a required type, binding will return AlwaysFalse
|
|
||
|
|
||
| def test_bind_dedup(request): | ||
| schema = request.getfixturevalue("table_schema_simple") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's definitely not required but instead of using request here you can just use the fixture name and pytest will automatically grab it for you.
def test_bind_dedup(table_schema_simple):
bound = base.BoundIn(
base.BoundReference(table_schema_simple.find_field(1), table_schema_simple.accessor_for_field(1)), {literal("hello"), literal("world")}
)
assert base.In(base.Reference("foo"), (literal("hello"), literal("world"), literal("world"))).bind(table_schema_simple) == bound
This refactors unary and set expressions to use base classes. In addition, this updates binding to produce
AlwaysTrue()orAlwaysFalse()values based on the bound type. For example,IsNull(a)when bound to a required field a will produceAlwaysFalse()because the value is required.This also adds missing tests.