Skip to content

Conversation

@goutamvenkat-anyscale
Copy link
Contributor

@goutamvenkat-anyscale goutamvenkat-anyscale commented Oct 1, 2025

Why are these changes needed?

Makes the following changes:

  1. Move Ray data's expression evaluator to an internal module
  2. Use the visitor pattern for native expressions
  3. Use expressions for rename_columns, with_column, select_columns and remove cols and cols_rename in Project
  4. Modify Projection Pushdown to work with combinations of the above operators correctly

Related issue number

Closes #56878, #57700

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run pre-commit jobs to lint the changes in this PR. (pre-commit setup)
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Note

Switch Project/column ops to named expression lists with preserve_existing, move/evolve expression evaluation to internal native Block visitor, and implement robust Project fusion/pushdown; plus Arrow/Pandas block fixes and test updates.

  • Expressions & Evaluation:
    • Move expression evaluator to _internal/planner/plan_expression/expression_evaluator.py; add native Block/BlockColumn visitor (NativeExpressionEvaluator) and centralize op maps.
    • Update all call sites to use new internal eval_expr (Arrow/Pandas blocks, UDF planner, tests).
  • Project Operator & APIs:
    • Redesign Project to accept exprs: List[Expr] with enforced names and preserve_existing flag; remove cols/cols_rename paths.
    • Rework planner (plan_project_op) to compute expressions, fill/upsert columns, and finalize schema based on preserve_existing.
    • Migrate Dataset APIs: with_column, select_columns, rename_columns to build named expressions; count uses empty exprs with preserve_existing=False.
  • Projection Pushdown/Fusion:
    • Replace rule with expression-aware fusion of consecutive Projects (handles renames, substitution, and visibility via preserve_existing).
    • Push down required columns into reads only for selection-style projects.
  • Arrow/Pandas Block & Shuffle:
    • Arrow concat: preserve rows for 0-column tables via stub column; fill_column uses upsert_column; num_rows returns raw num_rows.
    • Remove row-count assert in shuffle map; minor filter import path changes.
  • Expressions minor:
    • Fix UnaryExpr default data_type initialization.
  • Tests/Plans:
    • Update tests to new expression APIs and plan expectations; add comprehensive projection fusion tests; adjust stats/explain outputs.

Written by Cursor Bugbot for commit dbf609a. This will update automatically on new commits. Configure here.

@goutamvenkat-anyscale goutamvenkat-anyscale requested a review from a team as a code owner October 1, 2025 03:52
@goutamvenkat-anyscale
Copy link
Contributor Author

/gemini review

@goutamvenkat-anyscale goutamvenkat-anyscale added the data Ray Data-related issues label Oct 1, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request is a significant refactoring to unify projection-like operations (select_columns, with_column, rename_columns) under a single Project logical operator that uses expressions. This is a great improvement for consistency and maintainability. The changes include moving the expression evaluator, implementing a visitor pattern for evaluation, and rewriting the projection pushdown logic to correctly fuse chains of projection operations. The new logic is more robust and handles complex cases like column swaps and chained renames correctly. I've found a few minor issues with type hints and docstrings that could be improved for clarity. Overall, this is a high-quality contribution.

cursor[bot]

This comment was marked as outdated.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request is a significant and well-executed refactoring that unifies the select_columns, rename_columns, and with_column operations into a single expression-based Project operator. This change enhances the planner's optimization capabilities. The implementation is solid, introducing a visitor pattern for expression evaluation and rewriting the projection pushdown logic to correctly handle complex chains of operations. My review focuses on the new, complex logic in the planner and expression evaluator, with suggestions to improve type hint accuracy and simplify some logic for better long-term maintainability.

Signed-off-by: Goutam V. <[email protected]>
Signed-off-by: Goutam V. <[email protected]>
Signed-off-by: Goutam V. <[email protected]>
Signed-off-by: Goutam V. <[email protected]>
@goutamvenkat-anyscale
Copy link
Contributor Author

/gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request is an excellent and significant refactoring that unifies the Project operator to be fully expression-driven. The changes, including moving the expression evaluator to an internal module, adopting the visitor pattern for evaluation, and overhauling the projection pushdown logic, greatly improve the code's structure, maintainability, and correctness. The new ProjectionPushdown implementation is particularly impressive in how it handles complex chains of select, with_column, and rename operations. The addition of comprehensive tests is also a major plus. I've found one minor issue related to error handling that could be improved.

cursor[bot]

This comment was marked as outdated.

Signed-off-by: Goutam V. <[email protected]>
cursor[bot]

This comment was marked as outdated.

cursor[bot]

This comment was marked as outdated.

Signed-off-by: Goutam V. <[email protected]>
Signed-off-by: Goutam V. <[email protected]>
Signed-off-by: Goutam V. <[email protected]>
@goutamvenkat-anyscale
Copy link
Contributor Author

/gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request is a major and well-executed refactoring that unifies select_columns, with_column, and rename_columns under a single expression-based Project logical operator. This is a significant improvement that simplifies the logical plan and enables more powerful and robust optimizations, particularly for projection fusion and pushdown. The changes are extensive, touching the logical operators, the physical planner, the expression evaluation logic, and the public Dataset API. The new implementation of projection pushdown is much cleaner and correctly handles complex chains of operations. The introduction of a visitor-based expression evaluator is also a solid design choice. I've found one minor issue with error handling in a helper function, but overall this is an excellent refactoring.

cursor[bot]

This comment was marked as outdated.

Signed-off-by: Goutam <[email protected]>
Signed-off-by: Goutam <[email protected]>
@goutamvenkat-anyscale
Copy link
Contributor Author

/gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request is a significant and well-executed refactoring that unifies various column operations (with_column, select_columns, rename_columns) under a single expression-based Project operator. Moving the expression evaluation logic to an internal module and adopting the visitor pattern is a great architectural improvement. The introduction of StarColumnsExpr and star() provides a clean mechanism for additive column operations. The rewrite of the projection pushdown rule is particularly impressive, enabling more powerful and robust fusion of consecutive Project operators. Overall, these changes greatly improve the clarity, consistency, and maintainability of the Ray Data API and its underlying implementation. I've included a few suggestions for further improvement.

Signed-off-by: Goutam <[email protected]>
elif _is_pa_string_like(x) and isinstance(x, (pa.Array, pa.ChunkedArray)):
x = _pa_decode_dict_string_array(x)
else:
raise
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Unconditional Raising Causes Runtime Error

A bare raise statement without an explicit exception type is present. This attempts to re-raise the last active exception, but if no exception is currently being handled, it will cause a RuntimeError.

Fix in Cursor Fix in Web

Comment on lines 31 to 33
# If any expression is star(), we need all columns
if any(isinstance(expr, StarColumnsExpr) for expr in exprs):
return None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This encoding (none -> all) is implicit -- let's make it explicit.

To make it explicit we'd need to add an expression resolution (against schema) step. We don't need to do in this PR, but let's cut a ticket and link it here

"""
referenced_columns: Set[str] = set()

def visit_expr(expr: Expr) -> None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Make all traversals a corresponding Visitor (and make visitors w/o state singletons)

@ray-project ray-project deleted a comment from alexeykudinkin Oct 15, 2025

logger = logging.getLogger(__name__)

class _ColumnReferenceCollector(_ExprVisitor):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file is impossible to review now

operand: Expr

data_type: DataType = field(init=False)
data_type: DataType = field(default_factory=lambda: DataType(object), init=False)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does this default mean?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added comment in new PR. It's the default return dtype. For UnaryExpr it should actually be bool not object.

Without this defined, further expressions can't be chained. If I have expr1.is_not_null().alias('x') then we need to ensure that AliasExpr gets the datatype propagated from the UnaryExpr

Comment on lines +727 to +728
if len(exprs) == 1 and isinstance(exprs[0], StarColumnsExpr):
return block
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Expand stars before continuing processing (will later move this into resolving stage)

Copy link
Contributor Author

@goutamvenkat-anyscale goutamvenkat-anyscale Oct 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean creating a list of col expressions?

  1. That requires the schema
  2. If I have 100k+ columns, isn't that slow?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed this in new PR

Comment on lines +737 to +738
if isinstance(expr, StarColumnsExpr):
continue
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comment above

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in new PR

@goutamvenkat-anyscale
Copy link
Contributor Author

Closing in favor of #57855

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

data Ray Data-related issues go add ONLY when ready to merge, run all tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Ray Data] Add a force overwrite option to the rename_columns()

4 participants