[FEA] Accelerate conversion from arrow::StringViewType
to arrow::StringType
in libcudf interop
#15298
Labels
feature request
New feature or request
libcudf
Affects libcudf (C++/CUDA) code.
Spark
Functionality that helps Spark RAPIDS
Milestone
Is your feature request related to a problem? Please describe.
The Arrow 15 specification includes a definition of "arrow::StringViewType" - an alternate representation of the "arrow::StringType". You may find "String view" also referred to as Umbra string or prefix string.
A string view consists of two columns:
String view type enables some performance optimizations:
left(10)
) in place without a copyreplace("aa", "a")
) in place without a copystartswith("a")
)Describe the solution you'd like
Let's add interop support for string view in
from_arrow
with CUDA C++ code to accept string views and convert them to libcudf strings columns. We may also want to add string view compatibility toto_arrow
, so we can hand off libcudf strings columns to host libraries that expect string views. We should be able to write CUDA C++ code to efficiently transformarrow::StringViewType
buffers in toarrow::StringType
buffers.Describe alternatives you've considered
Force libcudf users to convert their string views into strings on the host before passing the data to the device.
Additional context
Velox supports a string view type (ref1, ref2), Polars has switched to a string view representation, and DuckDB supports string view.
We may choose to investigate using string views in libcudf at some point, but for the foreseeable future string view refactoring will be lower priority than supporting large strings and improving performance with long strings.
The text was updated successfully, but these errors were encountered: