Add compute_column_expression to pylibcudf for transform.compute_column #17279

mroeschke · 2024-11-08T01:13:54Z

Description

Follow up to #16760

transform.compute_column (backing .eval) requires an Expression object created by a private routine in cudf Python. Since this routine will be needed for any user of the public transform.compute_column, moving it to pylibcudf.

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

wence-

Thanks @mroeschke, I think because pylibcudf expressions now keep their children alive, we can simplify this quite a bit. I've made some suggestions.

wence- · 2024-11-08T09:02:13Z

python/pylibcudf/pylibcudf/expressions.pyx

+    This visitor is designed to handle AST nodes that have libcudf equivalents.
+    It constructs column references from names and literals from constants,
+    then builds up operations. The final result can be accessed using the
+    `expression` property. The visitor must be kept in scope for as long as the
+    expression is needed because all of the underlying libcudf expressions will
+    be destroyed when the libcudfASTVisitor is.


In pylibcudf, Expression objects now hold onto their children, so all of this stack handling can be removed.

wence- · 2024-11-08T09:03:05Z

python/pylibcudf/pylibcudf/expressions.pyx

+    """
+    visitor = libcudfASTVisitor(col_names)
+    visitor.visit(ast.parse(expr))
+    return visitor.expression


Just to note, if it had been the case that we didn't hold on to Expression children, this would have been incorrect (previously the visitor needed to remain alive as long as the expression did)

wence- · 2024-11-08T09:03:17Z

python/pylibcudf/pylibcudf/expressions.pyx

+        self.stack = []
+        self.nodes = []


Suggested change

self.stack = []

self.nodes = []

wence- · 2024-11-08T09:03:28Z

python/pylibcudf/pylibcudf/expressions.pyx

+
+    @property
+    def expression(self):
+        """Expression: The result of parsing an AST."""
+        assert len(self.stack) == 1
+        return self.stack[-1]


Suggested change

@property

def expression(self):

"""Expression: The result of parsing an AST."""

assert len(self.stack) == 1

return self.stack[-1]

wence- · 2024-11-08T09:03:51Z

python/pylibcudf/pylibcudf/expressions.pyx

+            col_id = self.col_names.index(node.id)
+        except ValueError:
+            raise ValueError(f"Unknown column name {node.id}")
+        self.stack.append(ColumnReference(col_id))


Suggested change

self.stack.append(ColumnReference(col_id))

return ColumnReference(col_id)

If we passed a mapping from column names to expressions, this could be:

def visit_Name(self, node): try: return self.column_mapping[node.id] except KeyError: raise ValueError(f"Unknown column name {node.id}")

wence- · 2024-11-08T09:18:37Z

python/pylibcudf/pylibcudf/expressions.pyx

+    visitor = libcudfASTVisitor(col_names)
+    visitor.visit(ast.parse(expr))
+    return visitor.expression


Suggested change

visitor = libcudfASTVisitor(col_names)

visitor.visit(ast.parse(expr))

return visitor.expression

visitor = libcudfASTVisitor(col_names)

return visitor.visit(ast.parse(expr))

wence- · 2024-11-08T09:19:06Z

python/pylibcudf/pylibcudf/expressions.pyx

+    Parameters
+    ----------
+    expr : str
+        The expression to evaluate.


Suggested change

The expression to evaluate.

The expression to evaluate. In (restricted) Python syntax.

wence- · 2024-11-08T09:20:09Z

python/pylibcudf/pylibcudf/expressions.pyx

+    def _visit_BoolOp_Compare(self, operators, operands, has_multiple_ops):
+        # Helper function handling the common components of parsing BoolOp and
+        # Compare AST nodes. These two types of nodes both support chaining
+        # (e.g. `a > b > c` is equivalent to `a > b and b > c`, so this
+        # function helps standardize that.
+
+        # TODO: Whether And/Or and BitAnd/BitOr actually correspond to
+        # logical or bitwise operators depends on the data types that they
+        # are applied to. We'll need to add logic to map to that.
+        inner_ops = []
+        for op, (left, right) in zip(operators, operands):
+            # Note that this will lead to duplicate nodes, e.g. if
+            # the comparison is `a < b < c` that will be encoded as
+            # `a < b and b < c`. We could potentially optimize by caching
+            # expressions by name so that we only construct them once.
+            self.visit(left)
+            self.visit(right)
+
+            self.nodes.append(self.stack.pop())
+            self.nodes.append(self.stack.pop())
+
+            op = _python_cudf_operator_map[type(op)]
+            inner_ops.append(Operation(op, self.nodes[-1], self.nodes[-2]))
+
+        self.nodes.extend(inner_ops)
+
+        # If we have more than one comparator, we need to link them
+        # together with LOGICAL_AND operators.
+        if has_multiple_ops:
+            op = ASTOperator.LOGICAL_AND
+
+            def _combine_compare_ops(left, right):
+                self.nodes.append(Operation(op, left, right))
+                return self.nodes[-1]
+
+            functools.reduce(_combine_compare_ops, inner_ops)
+
+        self.stack.append(self.nodes[-1])
+


Suggested change

def _visit_BoolOp_Compare(self, operators, operands, has_multiple_ops):

# Helper function handling the common components of parsing BoolOp and

# Compare AST nodes. These two types of nodes both support chaining

# (e.g. `a > b > c` is equivalent to `a > b and b > c`, so this

# function helps standardize that.

# TODO: Whether And/Or and BitAnd/BitOr actually correspond to

# logical or bitwise operators depends on the data types that they

# are applied to. We'll need to add logic to map to that.

inner_ops = []

for op, (left, right) in zip(operators, operands):

# Note that this will lead to duplicate nodes, e.g. if

# the comparison is `a < b < c` that will be encoded as

# `a < b and b < c`. We could potentially optimize by caching

# expressions by name so that we only construct them once.

self.visit(left)

self.visit(right)

self.nodes.append(self.stack.pop())

self.nodes.append(self.stack.pop())

op = _python_cudf_operator_map[type(op)]

inner_ops.append(Operation(op, self.nodes[-1], self.nodes[-2]))

self.nodes.extend(inner_ops)

# If we have more than one comparator, we need to link them

# together with LOGICAL_AND operators.

if has_multiple_ops:

op = ASTOperator.LOGICAL_AND

def _combine_compare_ops(left, right):

self.nodes.append(Operation(op, left, right))

return self.nodes[-1]

functools.reduce(_combine_compare_ops, inner_ops)

self.stack.append(self.nodes[-1])

wence- · 2024-11-08T09:31:37Z

python/pylibcudf/pylibcudf/expressions.pyx

+
+
+@functools.lru_cache(256)
+def compute_column_expression(str expr, tuple col_names):


This name suggests what the expression can be used for, but is it a good one? Note that if we used ColumnNameReference rather than ColumnReference in the mapping of column names, the result could be used in parquet filtering as well.

So perhaps that suggests something like:

def to_expression(str expr, dict column_mapping):

Where column_mapping is either a dict[str, ColumnReference] or a dict[str, ColumnNameReference].

wence- · 2024-11-08T09:32:11Z

python/pylibcudf/pylibcudf/expressions.pyx

+    Expression
+        Cached Expression for the given expr and col_names
+    """
+    visitor = libcudfASTVisitor(col_names)


How about removing libcudf from the name. ExpressionTransformer?

…rm/compute_column_expression

wence-

Sorry, I provided some confusing suggestions around column names.

python/cudf/cudf/_lib/transform.pyx

python/cudf/cudf/core/dataframe.py

python/pylibcudf/pylibcudf/expressions.pyx

…rm/compute_column_expression

python/pylibcudf/pylibcudf/expressions.pyx

wence- · 2024-11-12T23:04:05Z

python/pylibcudf/pylibcudf/expressions.pyi

@@ -77,3 +77,5 @@ class Operation(Expression):
        left: Expression,
        right: Expression | None = None,
    ): ...
+
+def to_expression(expr: str, column_names: tuple[str]) -> Expression: ...


Suggested change

def to_expression(expr: str, column_names: tuple[str]) -> Expression: ...

def to_expression(expr: str, column_names: tuple[str, ...]) -> Expression: ...

A tuple[str] is a single-element tuple

…rm/compute_column_expression

python/pylibcudf/pylibcudf/expressions.pyx

vyasr · 2024-11-13T18:26:01Z

I suspect part of your problem (and a cause of at least some of the seg faults) is that you are relying on visit_* returning a value, which is fine as long as you cover every node type. Unfortunately, we do not do that, so you fall back to the NodeVisitor's generic_visit implementation, which does not return (it can't because it doesn't know how to combine the return values from visiting each of its children). IOW ultimately I think your visitor.visit call is returning None, not the value that you want.

wence- · 2024-11-14T15:48:29Z

I suspect part of your problem (and a cause of at least some of the seg faults) is that you are relying on visit_* returning a value, which is fine as long as you cover every node type. Unfortunately, we do not do that, so you fall back to the NodeVisitor's generic_visit implementation, which does not return (it can't because it doesn't know how to combine the return values from visiting each of its children). IOW ultimately I think your visitor.visit call is returning None, not the value that you want.

That must mean that previously, the old NodeVisitor silently just didn't treat nodes that didn't have a specific visit registered (other than to traverse into children). How can that even work?

vyasr · 2024-11-15T01:36:38Z

Lawrence raises a good point here. The main thing that the old code handles that the new code does not is cases where simply naively traversing down into unhandled nodes and parsing their children produces answers. That is important because in cases like df.eval("a") you get something like this

>>> ast.dump(ast.parse("a"))
"Module(body=[Expr(value=Name(id='a', ctx=Load()))], type_ignores=[])"

If you generically traverse down, everything works because:

Module always has a single field (body)
For expressions of the sort we care about, body always has length one.
The only member is always an Expr node
Expr nodes have a single Value
You wind up getting a Name

It turns out that this logic correctly handles the expressions that we care about. Unfortunately, as Lawrence points out it is also very lenient, so for instance this is also currently valid:

>>> df = cudf.DataFrame({'a': [1], 'b': [3]})
>>> df.eval('def f(): a')
0    1
dtype: int64

If we want to tighten this up, we should probably try to rewrite the parser in place to use return values first, and then port to pylyibcudf.

@mroeschke do you feel comfortable doing that? If not, let me know and I'm happy to help out here.

Explicitly enumerate the expressions we handle so we don't pass through invalid expressions and just pick out the bits we do understand. Previously df.eval("def f(): a*b") would be treated the same as df.eval("a*b") Now we instead raise a ValueError.

wence- · 2024-11-15T11:57:40Z

Lawrence raises a good point here. The main thing that the old code handles that the new code does not is cases where simply naively traversing down into unhandled nodes and parsing their children produces answers. That is important because in cases like df.eval("a") you get something like this
>>> ast.dump(ast.parse("a"))
"Module(body=[Expr(value=Name(id='a', ctx=Load()))], type_ignores=[])"
If you generically traverse down, everything works because:
* Module always has a single field (`body`)

* For expressions of the sort we care about, body always has length one.

* The only member is always an Expr node

* Expr nodes have a single Value

* You wind up getting a Name
It turns out that this logic correctly handles the expressions that we care about. Unfortunately, as Lawrence points out it is also very lenient, so for instance this is also currently valid:
>>> df = cudf.DataFrame({'a': [1], 'b': [3]})
>>> df.eval('def f(): a')
0    1
dtype: int64
If we want to tighten this up, we should probably try to rewrite the parser in place to use return values first, and then port to pylyibcudf.

@mroeschke do you feel comfortable doing that? If not, let me know and I'm happy to help out here.

Pushed some more fixes that hopefully DTRT now.

wence- · 2024-11-15T16:10:01Z

@vyasr I requested a review because although I have said "OK", I have also now written all the code.

vyasr

Oh geez I'm sorry Lawrence, I thought after the last meeting's discussion that I was on the hook for writing this and I just hadn't gotten to it yet. Thanks for taking this on. Yes, the new implementation looks correct. I like that you've made the generic visitor an error case as well so that we can easily catch unhandled nodes in the future.

vyasr · 2024-11-20T00:17:22Z

/merge

Move cudf._lib.expressions to pylibcudf for transform.compute_column

2d2bff9

mroeschke added improvement Improvement / enhancement to an existing function non-breaking Non-breaking change pylibcudf Issues specific to the pylibcudf package labels Nov 8, 2024

mroeschke requested a review from a team as a code owner November 8, 2024 01:13

mroeschke requested review from wence- and Matt711 November 8, 2024 01:13

github-actions bot added the Python Affects Python cuDF API. label Nov 8, 2024

github-actions bot assigned mroeschke Nov 8, 2024

wence- requested changes Nov 8, 2024

View reviewed changes

mroeschke added 2 commits November 8, 2024 15:36

Merge remote-tracking branch 'upstream/branch-24.12' into plc/transfo…

c6631aa

…rm/compute_column_expression

Clean up implementation after review

9e7fddd

wence- reviewed Nov 11, 2024

View reviewed changes

Back out column mapping

a192132

github-actions bot assigned wence- Nov 11, 2024

wence- approved these changes Nov 11, 2024

View reviewed changes

mroeschke added 3 commits November 12, 2024 12:52

Merge remote-tracking branch 'upstream/branch-24.12' into plc/transfo…

63f2e0d

…rm/compute_column_expression

run pre-commit

6c33122

add stub for to_expression

a5d028a

wence- reviewed Nov 12, 2024

View reviewed changes

python/pylibcudf/pylibcudf/expressions.pyx Show resolved Hide resolved

wence- reviewed Nov 12, 2024

View reviewed changes

mroeschke added 3 commits November 12, 2024 15:27

Merge remote-tracking branch 'upstream/branch-24.12' into plc/transfo…

f8667ce

…rm/compute_column_expression

Run pre-commit --all-files

a17ce44

Fix annotation and add to all

7f780e3

wence- reviewed Nov 13, 2024

View reviewed changes

python/pylibcudf/pylibcudf/expressions.pyx Outdated Show resolved Hide resolved

wence- reviewed Nov 13, 2024

View reviewed changes

python/pylibcudf/pylibcudf/expressions.pyx Outdated Show resolved Hide resolved

Translate ops in reduction

7567ec5

wence- added 2 commits November 15, 2024 10:54

Merge branch 'branch-24.12' into HEAD

961dd24

Raise on unhandled expressions

c036273

Explicitly enumerate the expressions we handle so we don't pass through invalid expressions and just pick out the bits we do understand. Previously df.eval("def f(): a*b") would be treated the same as df.eval("a*b") Now we instead raise a ValueError.

wence- requested a review from vyasr November 15, 2024 16:09

vyasr approved these changes Nov 20, 2024

View reviewed changes

rapids-bot bot merged commit 7158ee0 into rapidsai:branch-24.12 Nov 20, 2024
102 checks passed

mroeschke deleted the plc/transform/compute_column_expression branch November 20, 2024 00:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add compute_column_expression to pylibcudf for transform.compute_column #17279

Add compute_column_expression to pylibcudf for transform.compute_column #17279

mroeschke commented Nov 8, 2024

wence- left a comment

wence- Nov 8, 2024

wence- Nov 8, 2024

wence- Nov 8, 2024

wence- Nov 8, 2024

wence- Nov 8, 2024

wence- Nov 8, 2024

wence- Nov 8, 2024

wence- Nov 8, 2024

wence- Nov 8, 2024

wence- Nov 8, 2024

wence- Nov 8, 2024

wence- left a comment

wence- Nov 12, 2024

vyasr commented Nov 13, 2024

wence- commented Nov 14, 2024

vyasr commented Nov 15, 2024

wence- commented Nov 15, 2024

wence- commented Nov 15, 2024

vyasr left a comment

vyasr commented Nov 20, 2024

	self.stack.append(ColumnReference(col_id))
	return ColumnReference(col_id)

	The expression to evaluate.
	The expression to evaluate. In (restricted) Python syntax.



		@functools.lru_cache(256)
		def compute_column_expression(str expr, tuple col_names):

	def to_expression(expr: str, column_names: tuple[str]) -> Expression: ...
	def to_expression(expr: str, column_names: tuple[str, ...]) -> Expression: ...

Add compute_column_expression to pylibcudf for transform.compute_column #17279

Add compute_column_expression to pylibcudf for transform.compute_column #17279

Conversation

mroeschke commented Nov 8, 2024

Description

Checklist

wence- left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wence- left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vyasr commented Nov 13, 2024

wence- commented Nov 14, 2024

vyasr commented Nov 15, 2024

wence- commented Nov 15, 2024

wence- commented Nov 15, 2024

vyasr left a comment

Choose a reason for hiding this comment

vyasr commented Nov 20, 2024