-
Notifications
You must be signed in to change notification settings - Fork 7k
Open
Labels
dataRay Data-related issuesRay Data-related issuesenhancementRequest for new feature and/or capabilityRequest for new feature and/or capabilitygood-first-issueGreat starter issue for someone just starting to contribute to RayGreat starter issue for someone just starting to contribute to RaytriageNeeds triage (eg: priority, bug/not-bug, and owning component)Needs triage (eg: priority, bug/not-bug, and owning component)usability
Description
Ray Data’s Expression Namespace System
Summary
Ray Data recently added type-specific expression namespaces that use PyArrow compute functions under the hood:
.str— string operations.list— list / variable-length array operations.struct— struct field access
These significantly improve usability, readability, and discoverability of expressions.
This proposal outlines the next phase: completing the namespace system by adding:
.dt— datetime operations.arr— fixed-size array operations.map— map/dict-like operations.image— image-specific operations (blur, etc.).uri—download(),upload()
...
Proposed Additions
| Namespace / Location | Category | Functions |
|---|---|---|
| Expr (direct) | Arithmetic | negate, sign, power, abs |
| Rounding | ceil, floor, round, trunc |
|
| Logarithmic | ln, log10, log2, exp |
|
| Trigonometric | sin, cos, tan, asin, acos, atan |
|
| Null Handling | fill_null, is_finite, is_inf, is_nan |
|
| Type Conversion | cast |
|
| .str | Predicates | is_alpha, is_digit, is_lower, is_upper |
| Transforms | upper, lower, capitalize, reverse |
|
| Manipulation | replace, strip, slice, repeat |
|
| Padding | lpad, rpad, center |
|
| Splitting | split, split_pattern |
|
| Extraction | extract, find, count_substring |
|
| .dt | Extraction | year, month, day, hour, minute, second |
| Formatting | strftime |
|
| Timezone | assume_timezone, tz_convert |
|
| .list | Operations | len, get, slice, sort, flatten |
| Aggregations | sum, mean, min, max |
|
| Set Operations | union, intersection, difference |
|
| .struct | Operations | field, field_by_index |
| .map | Operations | keys, values |
| .arr | Operations | flatten, to_list |
| .uri | Multimodal | download, upload |
| .image | Multimodal | resize, gaussian_blur (and others) |
Caveat
Some of these expressions require multiple stages (ex. .uri.download())
Example Usage
Datetime
ds = ds.with_column("year", col("ts").dt.year())
ds = ds.with_column("pretty", col("ts").dt.strftime("%Y-%m-%d"))
ds = ds.with_column("next_hour", col("ts").dt.ceil("hour"))codingl2k1 and my-vegetable-has-exploded400Ping, richardliaw and gvspraveen
Metadata
Metadata
Assignees
Labels
dataRay Data-related issuesRay Data-related issuesenhancementRequest for new feature and/or capabilityRequest for new feature and/or capabilitygood-first-issueGreat starter issue for someone just starting to contribute to RayGreat starter issue for someone just starting to contribute to RaytriageNeeds triage (eg: priority, bug/not-bug, and owning component)Needs triage (eg: priority, bug/not-bug, and owning component)usability