Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add compute_column_expression to pylibcudf for transform.compute_column #17279

Conversation

mroeschke
Copy link
Contributor

Description

Follow up to #16760

transform.compute_column (backing .eval) requires an Expression object created by a private routine in cudf Python. Since this routine will be needed for any user of the public transform.compute_column, moving it to pylibcudf.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@mroeschke mroeschke added improvement Improvement / enhancement to an existing function non-breaking Non-breaking change pylibcudf Issues specific to the pylibcudf package labels Nov 8, 2024
@mroeschke mroeschke requested a review from a team as a code owner November 8, 2024 01:13
@mroeschke mroeschke requested review from wence- and Matt711 November 8, 2024 01:13
@github-actions github-actions bot added the Python Affects Python cuDF API. label Nov 8, 2024
Copy link
Contributor

@wence- wence- left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @mroeschke, I think because pylibcudf expressions now keep their children alive, we can simplify this quite a bit. I've made some suggestions.

Comment on lines 336 to 341
This visitor is designed to handle AST nodes that have libcudf equivalents.
It constructs column references from names and literals from constants,
then builds up operations. The final result can be accessed using the
`expression` property. The visitor must be kept in scope for as long as the
expression is needed because all of the underlying libcudf expressions will
be destroyed when the libcudfASTVisitor is.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In pylibcudf, Expression objects now hold onto their children, so all of this stack handling can be removed.

"""
visitor = libcudfASTVisitor(col_names)
visitor.visit(ast.parse(expr))
return visitor.expression
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to note, if it had been the case that we didn't hold on to Expression children, this would have been incorrect (previously the visitor needed to remain alive as long as the expression did)

Comment on lines 350 to 351
self.stack = []
self.nodes = []
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
self.stack = []
self.nodes = []

Comment on lines 353 to 358

@property
def expression(self):
"""Expression: The result of parsing an AST."""
assert len(self.stack) == 1
return self.stack[-1]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
@property
def expression(self):
"""Expression: The result of parsing an AST."""
assert len(self.stack) == 1
return self.stack[-1]

col_id = self.col_names.index(node.id)
except ValueError:
raise ValueError(f"Unknown column name {node.id}")
self.stack.append(ColumnReference(col_id))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
self.stack.append(ColumnReference(col_id))
return ColumnReference(col_id)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we passed a mapping from column names to expressions, this could be:

def visit_Name(self, node):
    try:
        return self.column_mapping[node.id]
    except KeyError:
        raise ValueError(f"Unknown column name {node.id}")

Comment on lines 488 to 490
visitor = libcudfASTVisitor(col_names)
visitor.visit(ast.parse(expr))
return visitor.expression
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
visitor = libcudfASTVisitor(col_names)
visitor.visit(ast.parse(expr))
return visitor.expression
visitor = libcudfASTVisitor(col_names)
return visitor.visit(ast.parse(expr))

Parameters
----------
expr : str
The expression to evaluate.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The expression to evaluate.
The expression to evaluate. In (restricted) Python syntax.

Comment on lines 403 to 441
def _visit_BoolOp_Compare(self, operators, operands, has_multiple_ops):
# Helper function handling the common components of parsing BoolOp and
# Compare AST nodes. These two types of nodes both support chaining
# (e.g. `a > b > c` is equivalent to `a > b and b > c`, so this
# function helps standardize that.

# TODO: Whether And/Or and BitAnd/BitOr actually correspond to
# logical or bitwise operators depends on the data types that they
# are applied to. We'll need to add logic to map to that.
inner_ops = []
for op, (left, right) in zip(operators, operands):
# Note that this will lead to duplicate nodes, e.g. if
# the comparison is `a < b < c` that will be encoded as
# `a < b and b < c`. We could potentially optimize by caching
# expressions by name so that we only construct them once.
self.visit(left)
self.visit(right)

self.nodes.append(self.stack.pop())
self.nodes.append(self.stack.pop())

op = _python_cudf_operator_map[type(op)]
inner_ops.append(Operation(op, self.nodes[-1], self.nodes[-2]))

self.nodes.extend(inner_ops)

# If we have more than one comparator, we need to link them
# together with LOGICAL_AND operators.
if has_multiple_ops:
op = ASTOperator.LOGICAL_AND

def _combine_compare_ops(left, right):
self.nodes.append(Operation(op, left, right))
return self.nodes[-1]

functools.reduce(_combine_compare_ops, inner_ops)

self.stack.append(self.nodes[-1])

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def _visit_BoolOp_Compare(self, operators, operands, has_multiple_ops):
# Helper function handling the common components of parsing BoolOp and
# Compare AST nodes. These two types of nodes both support chaining
# (e.g. `a > b > c` is equivalent to `a > b and b > c`, so this
# function helps standardize that.
# TODO: Whether And/Or and BitAnd/BitOr actually correspond to
# logical or bitwise operators depends on the data types that they
# are applied to. We'll need to add logic to map to that.
inner_ops = []
for op, (left, right) in zip(operators, operands):
# Note that this will lead to duplicate nodes, e.g. if
# the comparison is `a < b < c` that will be encoded as
# `a < b and b < c`. We could potentially optimize by caching
# expressions by name so that we only construct them once.
self.visit(left)
self.visit(right)
self.nodes.append(self.stack.pop())
self.nodes.append(self.stack.pop())
op = _python_cudf_operator_map[type(op)]
inner_ops.append(Operation(op, self.nodes[-1], self.nodes[-2]))
self.nodes.extend(inner_ops)
# If we have more than one comparator, we need to link them
# together with LOGICAL_AND operators.
if has_multiple_ops:
op = ASTOperator.LOGICAL_AND
def _combine_compare_ops(left, right):
self.nodes.append(Operation(op, left, right))
return self.nodes[-1]
functools.reduce(_combine_compare_ops, inner_ops)
self.stack.append(self.nodes[-1])



@functools.lru_cache(256)
def compute_column_expression(str expr, tuple col_names):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This name suggests what the expression can be used for, but is it a good one? Note that if we used ColumnNameReference rather than ColumnReference in the mapping of column names, the result could be used in parquet filtering as well.

So perhaps that suggests something like:

def to_expression(str expr, dict column_mapping):

Where column_mapping is either a dict[str, ColumnReference] or a dict[str, ColumnNameReference].

Expression
Cached Expression for the given expr and col_names
"""
visitor = libcudfASTVisitor(col_names)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about removing libcudf from the name. ExpressionTransformer?

Copy link
Contributor

@wence- wence- left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I provided some confusing suggestions around column names.

@@ -77,3 +77,5 @@ class Operation(Expression):
left: Expression,
right: Expression | None = None,
): ...

def to_expression(expr: str, column_names: tuple[str]) -> Expression: ...
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def to_expression(expr: str, column_names: tuple[str]) -> Expression: ...
def to_expression(expr: str, column_names: tuple[str, ...]) -> Expression: ...

A tuple[str] is a single-element tuple

@vyasr
Copy link
Contributor

vyasr commented Nov 13, 2024

I suspect part of your problem (and a cause of at least some of the seg faults) is that you are relying on visit_* returning a value, which is fine as long as you cover every node type. Unfortunately, we do not do that, so you fall back to the NodeVisitor's generic_visit implementation, which does not return (it can't because it doesn't know how to combine the return values from visiting each of its children). IOW ultimately I think your visitor.visit call is returning None, not the value that you want.

@wence-
Copy link
Contributor

wence- commented Nov 14, 2024

I suspect part of your problem (and a cause of at least some of the seg faults) is that you are relying on visit_* returning a value, which is fine as long as you cover every node type. Unfortunately, we do not do that, so you fall back to the NodeVisitor's generic_visit implementation, which does not return (it can't because it doesn't know how to combine the return values from visiting each of its children). IOW ultimately I think your visitor.visit call is returning None, not the value that you want.

That must mean that previously, the old NodeVisitor silently just didn't treat nodes that didn't have a specific visit registered (other than to traverse into children). How can that even work?

@vyasr
Copy link
Contributor

vyasr commented Nov 15, 2024

Lawrence raises a good point here. The main thing that the old code handles that the new code does not is cases where simply naively traversing down into unhandled nodes and parsing their children produces answers. That is important because in cases like df.eval("a") you get something like this

>>> ast.dump(ast.parse("a"))
"Module(body=[Expr(value=Name(id='a', ctx=Load()))], type_ignores=[])"

If you generically traverse down, everything works because:

  • Module always has a single field (body)
  • For expressions of the sort we care about, body always has length one.
  • The only member is always an Expr node
  • Expr nodes have a single Value
  • You wind up getting a Name

It turns out that this logic correctly handles the expressions that we care about. Unfortunately, as Lawrence points out it is also very lenient, so for instance this is also currently valid:

>>> df = cudf.DataFrame({'a': [1], 'b': [3]})
>>> df.eval('def f(): a')
0    1
dtype: int64

If we want to tighten this up, we should probably try to rewrite the parser in place to use return values first, and then port to pylyibcudf.

@mroeschke do you feel comfortable doing that? If not, let me know and I'm happy to help out here.

Explicitly enumerate the expressions we handle so we don't pass
through invalid expressions and just pick out the bits we do
understand.

Previously

    df.eval("def f(): a*b")

would be treated the same as

    df.eval("a*b")

Now we instead raise a ValueError.
@wence-
Copy link
Contributor

wence- commented Nov 15, 2024

Lawrence raises a good point here. The main thing that the old code handles that the new code does not is cases where simply naively traversing down into unhandled nodes and parsing their children produces answers. That is important because in cases like df.eval("a") you get something like this

>>> ast.dump(ast.parse("a"))
"Module(body=[Expr(value=Name(id='a', ctx=Load()))], type_ignores=[])"

If you generically traverse down, everything works because:

* Module always has a single field (`body`)

* For expressions of the sort we care about, body always has length one.

* The only member is always an Expr node

* Expr nodes have a single Value

* You wind up getting a Name

It turns out that this logic correctly handles the expressions that we care about. Unfortunately, as Lawrence points out it is also very lenient, so for instance this is also currently valid:

>>> df = cudf.DataFrame({'a': [1], 'b': [3]})
>>> df.eval('def f(): a')
0    1
dtype: int64

If we want to tighten this up, we should probably try to rewrite the parser in place to use return values first, and then port to pylyibcudf.

@mroeschke do you feel comfortable doing that? If not, let me know and I'm happy to help out here.

Pushed some more fixes that hopefully DTRT now.

@wence- wence- requested a review from vyasr November 15, 2024 16:09
@wence-
Copy link
Contributor

wence- commented Nov 15, 2024

@vyasr I requested a review because although I have said "OK", I have also now written all the code.

Copy link
Contributor

@vyasr vyasr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh geez I'm sorry Lawrence, I thought after the last meeting's discussion that I was on the hook for writing this and I just hadn't gotten to it yet. Thanks for taking this on. Yes, the new implementation looks correct. I like that you've made the generic visitor an error case as well so that we can easily catch unhandled nodes in the future.

@vyasr
Copy link
Contributor

vyasr commented Nov 20, 2024

/merge

@rapids-bot rapids-bot bot merged commit 7158ee0 into rapidsai:branch-24.12 Nov 20, 2024
102 checks passed
@mroeschke mroeschke deleted the plc/transform/compute_column_expression branch November 20, 2024 00:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
improvement Improvement / enhancement to an existing function non-breaking Non-breaking change pylibcudf Issues specific to the pylibcudf package Python Affects Python cuDF API.
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

3 participants