Skip to content

Conversation

@kosiew
Copy link
Contributor

@kosiew kosiew commented Sep 23, 2025

Which issue does this PR close?

Closes #8403.


Rationale for this change

Casting from BinaryView to Utf8View currently attempts a direct conversion using to_string_view() which returns an error if any value contains invalid UTF‑8. This behavior is inconsistent with other binary array types in Arrow, which honor CastOptions.safe = true by replacing invalid UTF‑8 sequences with NULL values rather than failing the entire cast operation.

This PR makes BinaryView's casting behavior consistent with other binary types and with user expectations: when CastOptions.safe is true, invalid UTF‑8 bytes are replaced by NULL in the resulting StringViewArray; when CastOptions.safe is false, the cast retains the existing failure behavior.


What changes are included in this PR?

  • Change cast_with_options to delegate the BinaryView -> Utf8View branch to a new helper function cast_binary_view_to_string_view(array, cast_options) instead of directly calling to_string_view() and erroring.

  • Add extend_valid_utf8 helper to centralize the logic of mapping Option<&[u8]> to Option<&str> (using std::str::from_utf8(...).ok()), and reuse it for both GenericStringBuilder and StringViewBuilder flows.

  • Implement cast_binary_view_to_string_view which:

    • Attempts array.clone().to_string_view() (fast, zero-copy path) and returns it when Ok.

    • On Err, checks cast_options.safe:

      • If true, builds a StringViewArray by filtering invalid UTF‑8 to NULL using extend_valid_utf8 and returns that array.
      • If false, propagates the original error (existing behavior).
  • Add a unit test test_binary_view_to_string_view_with_invalid_utf8 covering both safe=false (expect error) and safe=true (expect NULL where invalid UTF‑8 occurred).

Files changed (high level):

  • arrow-cast/src/cast/mod.rs: route BinaryView -> Utf8View case to the new helper.
  • arrow-cast/src/cast/string.rs: add extend_valid_utf8 and cast_binary_view_to_string_view, and use extend_valid_utf8 from an existing cast path.

Are there any user-facing changes?

Yes — this changes the observable behavior of casting BinaryView to Utf8View:

  • With CastOptions.safe = true (the safe mode), invalid UTF‑8 in BinaryView elements will be converted to NULL in the resulting Utf8View array instead of causing the entire cast to fail.
  • With CastOptions.safe = false, an invalid UTF‑8 still causes the cast to fail as before.

This is a bug fix aligning BinaryView with the semantics of other binary types and with documented expectations for CastOptions.safe.

No public API surface is changed beyond the fixed behavior; the new helpers are crate-private.

@github-actions github-actions bot added the arrow Changes to the arrow crate label Sep 23, 2025
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @kosiew -- this is a nice improvement in my mind. I kicked off some benchmarks and as long as they don't show any performance difference (I don't expect that they will) I think this PR is ready to go


match array.clone().to_string_view() {
Ok(result) => Ok(Arc::new(result)),
Err(error) => match cast_options.safe {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it would be nice to avoid the conversion twice if there is non utf8 data, but the nice thing about the current implementation is that I don't think it will regress performance: it only uses the slower path if we know for sure there is non utf8 data.

@alamb
Copy link
Contributor

alamb commented Sep 24, 2025

🤖 ./gh_compare_arrow.sh Benchmark Script Running
Linux aal-dev 6.14.0-1016-gcp #17~24.04.1-Ubuntu SMP Wed Sep 3 01:55:36 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing binaryview-8403 (e535998) to 13fb041 diff
BENCH_NAME=cast_kernels
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental --bench cast_kernels
BENCH_FILTER=
BENCH_BRANCH_NAME=binaryview-8403
Results will be posted here when complete

@alamb
Copy link
Contributor

alamb commented Sep 24, 2025

🤖: Benchmark completed

Details

group                                                              binaryview-8403                        main
-----                                                              ---------------                        ----
cast binary view to string                                         1.01     75.5±0.20µs        ? ?/sec    1.00     74.5±9.15µs        ? ?/sec
cast binary view to string view                                    1.00    104.8±0.26µs        ? ?/sec    1.00    105.1±0.49µs        ? ?/sec
cast binary view to wide string                                    1.00     69.5±0.30µs        ? ?/sec    1.01     69.9±1.14µs        ? ?/sec
cast date32 to date64 512                                          1.05    303.3±0.44ns        ? ?/sec    1.00    289.7±0.23ns        ? ?/sec
cast date64 to date32 512                                          1.02    507.1±0.70ns        ? ?/sec    1.00    498.4±0.52ns        ? ?/sec
cast decimal128 to decimal128 512                                  1.01    614.1±0.70ns        ? ?/sec    1.00    608.4±2.08ns        ? ?/sec
cast decimal128 to decimal128 512 lower precision                  1.00      5.2±0.01µs        ? ?/sec    1.01      5.3±0.02µs        ? ?/sec
cast decimal128 to decimal128 512 with lower scale (infallible)    1.00      6.7±0.02µs        ? ?/sec    1.00      6.7±0.01µs        ? ?/sec
cast decimal128 to decimal128 512 with same scale                  1.00     73.9±0.09ns        ? ?/sec    1.03     75.7±0.09ns        ? ?/sec
cast decimal128 to decimal256 512                                  1.00      2.6±0.00µs        ? ?/sec    1.00      2.6±0.01µs        ? ?/sec
cast decimal256 to decimal128 512                                  1.00     47.5±0.06µs        ? ?/sec    1.00     47.7±0.08µs        ? ?/sec
cast decimal256 to decimal256 512                                  1.00     11.4±0.02µs        ? ?/sec    1.00     11.4±0.02µs        ? ?/sec
cast decimal256 to decimal256 512 with same scale                  1.00     73.8±0.70ns        ? ?/sec    1.04     77.1±0.43ns        ? ?/sec
cast decimal32 to decimal32 512                                    1.00      2.2±0.00µs        ? ?/sec    1.00      2.2±0.01µs        ? ?/sec
cast decimal32 to decimal32 512 lower precision                    1.00      3.1±0.01µs        ? ?/sec    1.00      3.0±0.01µs        ? ?/sec
cast decimal32 to decimal64 512                                    1.00    322.9±0.34ns        ? ?/sec    1.01    325.0±0.38ns        ? ?/sec
cast decimal64 to decimal32 512                                    1.04      3.6±0.01µs        ? ?/sec    1.00      3.5±0.01µs        ? ?/sec
cast decimal64 to decimal64 512                                    1.01    387.5±0.49ns        ? ?/sec    1.00    385.5±1.30ns        ? ?/sec
cast dict to string view                                           1.00     49.6±0.14µs        ? ?/sec    1.03     50.9±0.08µs        ? ?/sec
cast f32 to string 512                                             1.00     18.0±0.04µs        ? ?/sec    1.07     19.3±0.04µs        ? ?/sec
cast f64 to string 512                                             1.01     22.0±0.03µs        ? ?/sec    1.00     21.8±0.10µs        ? ?/sec
cast float32 to int32 512                                          1.00   1537.3±2.43ns        ? ?/sec    1.01   1553.6±1.83ns        ? ?/sec
cast float64 to float32 512                                        1.02   1101.2±1.24ns        ? ?/sec    1.00   1081.3±1.98ns        ? ?/sec
cast float64 to uint64 512                                         1.02   1784.1±1.94ns        ? ?/sec    1.00   1757.7±3.30ns        ? ?/sec
cast i64 to string 512                                             1.00     14.6±0.04µs        ? ?/sec    1.00     14.5±0.03µs        ? ?/sec
cast int32 to float32 512                                          1.00   1057.4±1.43ns        ? ?/sec    1.00   1057.5±1.30ns        ? ?/sec
cast int32 to float64 512                                          1.01   1071.2±1.39ns        ? ?/sec    1.00   1055.9±1.31ns        ? ?/sec
cast int32 to int32 512                                            1.00    200.5±0.57ns        ? ?/sec    1.01    202.0±0.28ns        ? ?/sec
cast int32 to int64 512                                            1.06   1150.5±1.75ns        ? ?/sec    1.00   1086.7±2.52ns        ? ?/sec
cast int32 to uint32 512                                           1.00   1507.7±5.27ns        ? ?/sec    1.00   1513.1±4.62ns        ? ?/sec
cast int64 to int32 512                                            1.01   1553.7±2.01ns        ? ?/sec    1.00   1536.3±2.13ns        ? ?/sec
cast string to binary view 512                                     1.00      3.2±0.01µs        ? ?/sec    1.01      3.2±0.01µs        ? ?/sec
cast string view to binary view                                    1.00     96.9±0.11ns        ? ?/sec    1.01     97.8±0.17ns        ? ?/sec
cast string view to dict                                           1.04    178.5±0.36µs        ? ?/sec    1.00    171.0±0.27µs        ? ?/sec
cast string view to string                                         1.00     48.9±0.15µs        ? ?/sec    1.03     50.3±0.19µs        ? ?/sec
cast string view to wide string                                    1.00     48.0±0.14µs        ? ?/sec    1.01     48.6±0.22µs        ? ?/sec
cast time32s to time32ms 512                                       1.14    287.5±0.29ns        ? ?/sec    1.00    251.3±0.60ns        ? ?/sec
cast time32s to time64us 512                                       1.03    297.9±0.46ns        ? ?/sec    1.00    289.5±1.13ns        ? ?/sec
cast time64ns to time32s 512                                       1.01    507.1±0.46ns        ? ?/sec    1.00    500.1±0.51ns        ? ?/sec
cast timestamp_ms to i64 512                                       1.00    431.7±0.80ns        ? ?/sec    1.02    440.9±0.67ns        ? ?/sec
cast timestamp_ms to timestamp_ns 512                              1.00      2.2±0.00µs        ? ?/sec    1.00      2.2±0.01µs        ? ?/sec
cast timestamp_ns to timestamp_s 512                               1.00    198.3±0.28ns        ? ?/sec    1.02    203.1±0.34ns        ? ?/sec
cast utf8 to date32 512                                            1.00     11.4±0.01µs        ? ?/sec    1.01     11.4±0.01µs        ? ?/sec
cast utf8 to date64 512                                            1.01     45.2±0.06µs        ? ?/sec    1.00     44.5±0.06µs        ? ?/sec
cast utf8 to f32                                                   1.06     12.0±0.02µs        ? ?/sec    1.00     11.4±0.02µs        ? ?/sec
cast wide string to binary view 512                                1.00      5.4±0.01µs        ? ?/sec    1.00      5.4±0.01µs        ? ?/sec

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @kosiew

@kosiew
Copy link
Contributor Author

kosiew commented Sep 25, 2025

@alamb
Thanks for your review and feedback

@alamb alamb merged commit 3adccb9 into apache:main Sep 25, 2025
26 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

arrow Changes to the arrow crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Binaryview Utf8 Cast Issue

2 participants