Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Accelerate conversion from arrow::StringViewType to arrow::StringType in libcudf interop #15298

Open
GregoryKimball opened this issue Mar 13, 2024 · 2 comments
Labels
feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. Spark Functionality that helps Spark RAPIDS

Comments

@GregoryKimball
Copy link
Contributor

GregoryKimball commented Mar 13, 2024

Is your feature request related to a problem? Please describe.
The Arrow 15 specification includes a definition of "arrow::StringViewType" - an alternate representation of the "arrow::StringType". You may find "String view" also referred to as Umbra string or prefix string.

A string view consists of two columns:

  1. A column of 16 byte fixed-width elements. First 4 bytes contain the string size
  • If size < 12, then the string is stored inline in the remaining 12 bytes (short string optimization)
  • If size > 12, then the string is stored separately in the second column. Remaining 12 bytes are 8 bytes for pointer to the string + 4 bytes for the first 4 chars of the string
  1. A column of characters storing the suffix strings

String view type enables some performance optimizations:

  • ability to slice strings (e.g. left(10)) in place without a copy
  • ability to replace with smaller strings (e.g. replace("aa", "a")) in place without a copy
  • inlined strings can be written in any order and without knowing the column size
  • better memory access patterns for the first 4 bytes (e.g. startswith("a"))

Describe the solution you'd like
Let's add interop support for string view in from_arrow with CUDA C++ code to accept string views and convert them to libcudf strings columns. We may also want to add string view compatibility to to_arrow, so we can hand off libcudf strings columns to host libraries that expect string views. We should be able to write CUDA C++ code to efficiently transform arrow::StringViewType buffers in to arrow::StringType buffers.

Describe alternatives you've considered
Force libcudf users to convert their string views into strings on the host before passing the data to the device.

Additional context
Velox supports a string view type (ref1, ref2), Polars has switched to a string view representation, and DuckDB supports string view.

We may choose to investigate using string views in libcudf at some point, but for the foreseeable future string view refactoring will be lower priority than supporting large strings and improving performance with long strings.

@GregoryKimball GregoryKimball added feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. labels Mar 13, 2024
@GregoryKimball GregoryKimball added the Spark Functionality that helps Spark RAPIDS label Mar 22, 2024
@JayjeetAtGithub JayjeetAtGithub self-assigned this Jul 22, 2024
@JayjeetAtGithub
Copy link
Contributor

JayjeetAtGithub commented Aug 8, 2024

Interop example for arrow::StringViewArray to cudf::column in #16498 . We can integrate this example into the interop module once nanoarrow supports string view types (discussion).

@JayjeetAtGithub JayjeetAtGithub removed their assignment Aug 8, 2024
@GregoryKimball
Copy link
Contributor Author

This may be unblocked by apache/arrow-nanoarrow#596 now

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. Spark Functionality that helps Spark RAPIDS
Projects
Status: No status
Development

No branches or pull requests

2 participants