-
Notifications
You must be signed in to change notification settings - Fork 932
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add compute_column_expression to pylibcudf for transform.compute_column #17279
Add compute_column_expression to pylibcudf for transform.compute_column #17279
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @mroeschke, I think because pylibcudf expressions now keep their children alive, we can simplify this quite a bit. I've made some suggestions.
This visitor is designed to handle AST nodes that have libcudf equivalents. | ||
It constructs column references from names and literals from constants, | ||
then builds up operations. The final result can be accessed using the | ||
`expression` property. The visitor must be kept in scope for as long as the | ||
expression is needed because all of the underlying libcudf expressions will | ||
be destroyed when the libcudfASTVisitor is. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In pylibcudf, Expression
objects now hold onto their children, so all of this stack handling can be removed.
""" | ||
visitor = libcudfASTVisitor(col_names) | ||
visitor.visit(ast.parse(expr)) | ||
return visitor.expression |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to note, if it had been the case that we didn't hold on to Expression
children, this would have been incorrect (previously the visitor
needed to remain alive as long as the expression did)
self.stack = [] | ||
self.nodes = [] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
self.stack = [] | |
self.nodes = [] |
|
||
@property | ||
def expression(self): | ||
"""Expression: The result of parsing an AST.""" | ||
assert len(self.stack) == 1 | ||
return self.stack[-1] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@property | |
def expression(self): | |
"""Expression: The result of parsing an AST.""" | |
assert len(self.stack) == 1 | |
return self.stack[-1] |
col_id = self.col_names.index(node.id) | ||
except ValueError: | ||
raise ValueError(f"Unknown column name {node.id}") | ||
self.stack.append(ColumnReference(col_id)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
self.stack.append(ColumnReference(col_id)) | |
return ColumnReference(col_id) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we passed a mapping from column names to expressions, this could be:
def visit_Name(self, node):
try:
return self.column_mapping[node.id]
except KeyError:
raise ValueError(f"Unknown column name {node.id}")
visitor = libcudfASTVisitor(col_names) | ||
visitor.visit(ast.parse(expr)) | ||
return visitor.expression |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
visitor = libcudfASTVisitor(col_names) | |
visitor.visit(ast.parse(expr)) | |
return visitor.expression | |
visitor = libcudfASTVisitor(col_names) | |
return visitor.visit(ast.parse(expr)) |
Parameters | ||
---------- | ||
expr : str | ||
The expression to evaluate. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The expression to evaluate. | |
The expression to evaluate. In (restricted) Python syntax. |
def _visit_BoolOp_Compare(self, operators, operands, has_multiple_ops): | ||
# Helper function handling the common components of parsing BoolOp and | ||
# Compare AST nodes. These two types of nodes both support chaining | ||
# (e.g. `a > b > c` is equivalent to `a > b and b > c`, so this | ||
# function helps standardize that. | ||
|
||
# TODO: Whether And/Or and BitAnd/BitOr actually correspond to | ||
# logical or bitwise operators depends on the data types that they | ||
# are applied to. We'll need to add logic to map to that. | ||
inner_ops = [] | ||
for op, (left, right) in zip(operators, operands): | ||
# Note that this will lead to duplicate nodes, e.g. if | ||
# the comparison is `a < b < c` that will be encoded as | ||
# `a < b and b < c`. We could potentially optimize by caching | ||
# expressions by name so that we only construct them once. | ||
self.visit(left) | ||
self.visit(right) | ||
|
||
self.nodes.append(self.stack.pop()) | ||
self.nodes.append(self.stack.pop()) | ||
|
||
op = _python_cudf_operator_map[type(op)] | ||
inner_ops.append(Operation(op, self.nodes[-1], self.nodes[-2])) | ||
|
||
self.nodes.extend(inner_ops) | ||
|
||
# If we have more than one comparator, we need to link them | ||
# together with LOGICAL_AND operators. | ||
if has_multiple_ops: | ||
op = ASTOperator.LOGICAL_AND | ||
|
||
def _combine_compare_ops(left, right): | ||
self.nodes.append(Operation(op, left, right)) | ||
return self.nodes[-1] | ||
|
||
functools.reduce(_combine_compare_ops, inner_ops) | ||
|
||
self.stack.append(self.nodes[-1]) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
def _visit_BoolOp_Compare(self, operators, operands, has_multiple_ops): | |
# Helper function handling the common components of parsing BoolOp and | |
# Compare AST nodes. These two types of nodes both support chaining | |
# (e.g. `a > b > c` is equivalent to `a > b and b > c`, so this | |
# function helps standardize that. | |
# TODO: Whether And/Or and BitAnd/BitOr actually correspond to | |
# logical or bitwise operators depends on the data types that they | |
# are applied to. We'll need to add logic to map to that. | |
inner_ops = [] | |
for op, (left, right) in zip(operators, operands): | |
# Note that this will lead to duplicate nodes, e.g. if | |
# the comparison is `a < b < c` that will be encoded as | |
# `a < b and b < c`. We could potentially optimize by caching | |
# expressions by name so that we only construct them once. | |
self.visit(left) | |
self.visit(right) | |
self.nodes.append(self.stack.pop()) | |
self.nodes.append(self.stack.pop()) | |
op = _python_cudf_operator_map[type(op)] | |
inner_ops.append(Operation(op, self.nodes[-1], self.nodes[-2])) | |
self.nodes.extend(inner_ops) | |
# If we have more than one comparator, we need to link them | |
# together with LOGICAL_AND operators. | |
if has_multiple_ops: | |
op = ASTOperator.LOGICAL_AND | |
def _combine_compare_ops(left, right): | |
self.nodes.append(Operation(op, left, right)) | |
return self.nodes[-1] | |
functools.reduce(_combine_compare_ops, inner_ops) | |
self.stack.append(self.nodes[-1]) |
|
||
|
||
@functools.lru_cache(256) | ||
def compute_column_expression(str expr, tuple col_names): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This name suggests what the expression can be used for, but is it a good one? Note that if we used ColumnNameReference
rather than ColumnReference
in the mapping of column names, the result could be used in parquet filtering as well.
So perhaps that suggests something like:
def to_expression(str expr, dict column_mapping):
Where column_mapping
is either a dict[str, ColumnReference]
or a dict[str, ColumnNameReference]
.
Expression | ||
Cached Expression for the given expr and col_names | ||
""" | ||
visitor = libcudfASTVisitor(col_names) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about removing libcudf
from the name. ExpressionTransformer
?
…rm/compute_column_expression
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I provided some confusing suggestions around column names.
…rm/compute_column_expression
@@ -77,3 +77,5 @@ class Operation(Expression): | |||
left: Expression, | |||
right: Expression | None = None, | |||
): ... | |||
|
|||
def to_expression(expr: str, column_names: tuple[str]) -> Expression: ... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
def to_expression(expr: str, column_names: tuple[str]) -> Expression: ... | |
def to_expression(expr: str, column_names: tuple[str, ...]) -> Expression: ... |
A tuple[str] is a single-element tuple
I suspect part of your problem (and a cause of at least some of the seg faults) is that you are relying on |
That must mean that previously, the old |
Lawrence raises a good point here. The main thing that the old code handles that the new code does not is cases where simply naively traversing down into unhandled nodes and parsing their children produces answers. That is important because in cases like
If you generically traverse down, everything works because:
It turns out that this logic correctly handles the expressions that we care about. Unfortunately, as Lawrence points out it is also very lenient, so for instance this is also currently valid:
If we want to tighten this up, we should probably try to rewrite the parser in place to use return values first, and then port to pylyibcudf. @mroeschke do you feel comfortable doing that? If not, let me know and I'm happy to help out here. |
Explicitly enumerate the expressions we handle so we don't pass through invalid expressions and just pick out the bits we do understand. Previously df.eval("def f(): a*b") would be treated the same as df.eval("a*b") Now we instead raise a ValueError.
Pushed some more fixes that hopefully DTRT now. |
@vyasr I requested a review because although I have said "OK", I have also now written all the code. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh geez I'm sorry Lawrence, I thought after the last meeting's discussion that I was on the hook for writing this and I just hadn't gotten to it yet. Thanks for taking this on. Yes, the new implementation looks correct. I like that you've made the generic visitor an error case as well so that we can easily catch unhandled nodes in the future.
/merge |
Description
Follow up to #16760
transform.compute_column
(backing.eval
) requires anExpression
object created by a private routine in cudf Python. Since this routine will be needed for any user of the publictransform.compute_column
, moving it to pylibcudf.Checklist