Skip to content

Conversation

@goutamvenkat-anyscale
Copy link
Contributor

@goutamvenkat-anyscale goutamvenkat-anyscale commented Oct 31, 2025

Description

Replace map_batches and numpy invocations with with_column and arrow kernels

Release test: https://buildkite.com/ray-project/release/builds/66243#019a37da-4d9d-4f19-9180-e3f3dc3f8043

Related issues

Link related issues: "Fixes #1234", "Closes #1234", or "Related to #1234".

Additional information

Optional: Add implementation details, API changes, usage examples, screenshots, etc.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request is a great improvement, replacing map_batches with with_column and Arrow expressions for the TPC-H Q1 benchmark. This change enhances both performance and readability, making the implementation much cleaner. I've identified a couple of opportunities to further improve the code by reusing newly created float columns, which will help avoid redundant computations and increase clarity.

Signed-off-by: Goutam <[email protected]>
@goutamvenkat-anyscale goutamvenkat-anyscale added data Ray Data-related issues go add ONLY when ready to merge, run all tests labels Oct 31, 2025
Signed-off-by: Goutam <[email protected]>

# Build float views + derived columns
ds = (
ds.with_column("l_quantity_f", to_f64(col("column04")))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You'll need to rename the columns first as our dataset has bogus column names

Signed-off-by: Goutam <[email protected]>
Comment on lines +39 to +57
.rename_columns(
{
"column00": "l_orderkey",
"column02": "l_suppkey",
"column03": "l_linenumber",
"column04": "l_quantity",
"column05": "l_extendedprice",
"column06": "l_discount",
"column07": "l_tax",
"column08": "l_returnflag",
"column09": "l_linestatus",
"column10": "l_shipdate",
"column11": "l_commitdate",
"column12": "l_receiptdate",
"column13": "l_shipinstruct",
"column14": "l_shipmode",
"column15": "l_comment",
}
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@iamjustinhsu
Copy link
Contributor

@goutamvenkat-anyscale how were u able to get around the empty partition?

@alexeykudinkin alexeykudinkin merged commit 66c857c into ray-project:master Oct 31, 2025
6 checks passed
YoussefEssDS pushed a commit to YoussefEssDS/ray that referenced this pull request Nov 8, 2025
## Description
Replace `map_batches` and numpy invocations with `with_column` and arrow
kernels

Release test:
https://buildkite.com/ray-project/release/builds/66243#019a37da-4d9d-4f19-9180-e3f3dc3f8043

## Related issues
> Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to
ray-project#1234".

## Additional information
> Optional: Add implementation details, API changes, usage examples,
screenshots, etc.

---------

Signed-off-by: Goutam <[email protected]>
landscapepainter pushed a commit to landscapepainter/ray that referenced this pull request Nov 17, 2025
## Description
Replace `map_batches` and numpy invocations with `with_column` and arrow
kernels

Release test:
https://buildkite.com/ray-project/release/builds/66243#019a37da-4d9d-4f19-9180-e3f3dc3f8043

## Related issues
> Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to
ray-project#1234".

## Additional information
> Optional: Add implementation details, API changes, usage examples,
screenshots, etc.

---------

Signed-off-by: Goutam <[email protected]>
Aydin-ab pushed a commit to Aydin-ab/ray-aydin that referenced this pull request Nov 19, 2025
## Description
Replace `map_batches` and numpy invocations with `with_column` and arrow
kernels

Release test:
https://buildkite.com/ray-project/release/builds/66243#019a37da-4d9d-4f19-9180-e3f3dc3f8043

## Related issues
> Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to
ray-project#1234".

## Additional information
> Optional: Add implementation details, API changes, usage examples,
screenshots, etc.

---------

Signed-off-by: Goutam <[email protected]>
Signed-off-by: Aydin Abiar <[email protected]>
SheldonTsen pushed a commit to SheldonTsen/ray that referenced this pull request Dec 1, 2025
## Description
Replace `map_batches` and numpy invocations with `with_column` and arrow
kernels

Release test:
https://buildkite.com/ray-project/release/builds/66243#019a37da-4d9d-4f19-9180-e3f3dc3f8043

## Related issues
> Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to
ray-project#1234".

## Additional information
> Optional: Add implementation details, API changes, usage examples,
screenshots, etc.

---------

Signed-off-by: Goutam <[email protected]>
Future-Outlier pushed a commit to Future-Outlier/ray that referenced this pull request Dec 7, 2025
## Description
Replace `map_batches` and numpy invocations with `with_column` and arrow
kernels

Release test:
https://buildkite.com/ray-project/release/builds/66243#019a37da-4d9d-4f19-9180-e3f3dc3f8043

## Related issues
> Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to
ray-project#1234".

## Additional information
> Optional: Add implementation details, API changes, usage examples,
screenshots, etc.

---------

Signed-off-by: Goutam <[email protected]>
Signed-off-by: Future-Outlier <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

data Ray Data-related issues go add ONLY when ready to merge, run all tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Ray fails to serialize self-reference objects

3 participants