[FEA] Accelerate conversion from `arrow::StringViewType` to `arrow::StringType` in libcudf interop #15298

GregoryKimball · 2024-03-13T23:40:41Z

Is your feature request related to a problem? Please describe.
The Arrow 15 specification includes a definition of "arrow::StringViewType" - an alternate representation of the "arrow::StringType". You may find "String view" also referred to as Umbra string or prefix string.

A string view consists of two columns:

A column of 16 byte fixed-width elements. First 4 bytes contain the string size

If size < 12, then the string is stored inline in the remaining 12 bytes (short string optimization)
If size > 12, then the string is stored separately in the second column. Remaining 12 bytes are 8 bytes for pointer to the string + 4 bytes for the first 4 chars of the string

A column of characters storing the suffix strings

String view type enables some performance optimizations:

ability to slice strings (e.g. left(10)) in place without a copy
ability to replace with smaller strings (e.g. replace("aa", "a")) in place without a copy
inlined strings can be written in any order and without knowing the column size
better memory access patterns for the first 4 bytes (e.g. startswith("a"))

Describe the solution you'd like
Let's add interop support for string view in from_arrow with CUDA C++ code to accept string views and convert them to libcudf strings columns. We may also want to add string view compatibility to to_arrow, so we can hand off libcudf strings columns to host libraries that expect string views. We should be able to write CUDA C++ code to efficiently transform arrow::StringViewType buffers in to arrow::StringType buffers.

Describe alternatives you've considered
Force libcudf users to convert their string views into strings on the host before passing the data to the device.

Additional context
Velox supports a string view type (ref1, ref2), Polars has switched to a string view representation, and DuckDB supports string view.

We may choose to investigate using string views in libcudf at some point, but for the foreseeable future string view refactoring will be lower priority than supporting large strings and improving performance with long strings.

The text was updated successfully, but these errors were encountered:

JayjeetAtGithub · 2024-08-08T05:40:43Z

Interop example for arrow::StringViewArray to cudf::column in #16498 . We can integrate this example into the interop module once nanoarrow supports string view types (discussion).

GregoryKimball · 2024-10-09T16:24:02Z

This may be unblocked by apache/arrow-nanoarrow#596 now

GregoryKimball added feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. labels Mar 13, 2024

GregoryKimball added this to the Helps libcudf C++ integrations milestone Mar 13, 2024

GregoryKimball added this to libcudf Mar 13, 2024

GregoryKimball added the Spark Functionality that helps Spark RAPIDS label Mar 22, 2024

JayjeetAtGithub self-assigned this Jul 22, 2024

JayjeetAtGithub removed their assignment Aug 8, 2024

GregoryKimball mentioned this issue Oct 9, 2024

Add interop example for arrow::StringViewArray to cudf::column #16498

Merged

3 tasks

davidwendt mentioned this issue Mar 3, 2025

Add interop support from arrow StringView to libcudf strings column #18107

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Accelerate conversion from `arrow::StringViewType` to `arrow::StringType` in libcudf interop #15298

[FEA] Accelerate conversion from `arrow::StringViewType` to `arrow::StringType` in libcudf interop #15298

GregoryKimball commented Mar 13, 2024 •

edited

Loading

JayjeetAtGithub commented Aug 8, 2024 •

edited

Loading

GregoryKimball commented Oct 9, 2024

[FEA] Accelerate conversion from arrow::StringViewType to arrow::StringType in libcudf interop #15298

[FEA] Accelerate conversion from arrow::StringViewType to arrow::StringType in libcudf interop #15298

Comments

GregoryKimball commented Mar 13, 2024 • edited Loading

JayjeetAtGithub commented Aug 8, 2024 • edited Loading

GregoryKimball commented Oct 9, 2024

[FEA] Accelerate conversion from `arrow::StringViewType` to `arrow::StringType` in libcudf interop #15298

[FEA] Accelerate conversion from `arrow::StringViewType` to `arrow::StringType` in libcudf interop #15298

GregoryKimball commented Mar 13, 2024 •

edited

Loading

JayjeetAtGithub commented Aug 8, 2024 •

edited

Loading