Skip to content

[Data] - Ray Data Compute Expressions #58674

@goutamvenkat-anyscale

Description

@goutamvenkat-anyscale

Ray Data’s Expression Namespace System

Summary

Ray Data recently added type-specific expression namespaces that use PyArrow compute functions under the hood:

  • .str — string operations
  • .list — list / variable-length array operations
  • .struct — struct field access

These significantly improve usability, readability, and discoverability of expressions.

This proposal outlines the next phase: completing the namespace system by adding:

  • .dt — datetime operations
  • .arr — fixed-size array operations
  • .map — map/dict-like operations
  • .image — image-specific operations (blur, etc.)
  • .uridownload(), upload()
    ...

Proposed Additions

Namespace / Location Category Functions
Expr (direct) Arithmetic negate, sign, power, abs
Rounding ceil, floor, round, trunc
Logarithmic ln, log10, log2, exp
Trigonometric sin, cos, tan, asin, acos, atan
Null Handling fill_null, is_finite, is_inf, is_nan
Type Conversion cast
.str Predicates is_alpha, is_digit, is_lower, is_upper
Transforms upper, lower, capitalize, reverse
Manipulation replace, strip, slice, repeat
Padding lpad, rpad, center
Splitting split, split_pattern
Extraction extract, find, count_substring
.dt Extraction year, month, day, hour, minute, second
Formatting strftime
Timezone assume_timezone, tz_convert
.list Operations len, get, slice, sort, flatten
Aggregations sum, mean, min, max
Set Operations union, intersection, difference
.struct Operations field, field_by_index
.map Operations keys, values
.arr Operations flatten, to_list
.uri Multimodal download, upload
.image Multimodal resize, gaussian_blur (and others)

Caveat

Some of these expressions require multiple stages (ex. .uri.download())

Example Usage

Datetime

ds = ds.with_column("year", col("ts").dt.year())
ds = ds.with_column("pretty", col("ts").dt.strftime("%Y-%m-%d"))
ds = ds.with_column("next_hour", col("ts").dt.ceil("hour"))

Metadata

Metadata

Assignees

No one assigned

    Labels

    dataRay Data-related issuesenhancementRequest for new feature and/or capabilitygood-first-issueGreat starter issue for someone just starting to contribute to RaytriageNeeds triage (eg: priority, bug/not-bug, and owning component)usability

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions