-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Respect CastOptions.safe when casting BinaryView → Utf8View (return null for invalid UTF‑8)
#8415
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @kosiew -- this is a nice improvement in my mind. I kicked off some benchmarks and as long as they don't show any performance difference (I don't expect that they will) I think this PR is ready to go
|
|
||
| match array.clone().to_string_view() { | ||
| Ok(result) => Ok(Arc::new(result)), | ||
| Err(error) => match cast_options.safe { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it would be nice to avoid the conversion twice if there is non utf8 data, but the nice thing about the current implementation is that I don't think it will regress performance: it only uses the slower path if we know for sure there is non utf8 data.
|
🤖 |
|
🤖: Benchmark completed Details
|
Co-authored-by: Andrew Lamb <[email protected]>
Co-authored-by: Andrew Lamb <[email protected]>
… improved safe handling" This reverts commit 34fd0f1.
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @kosiew
|
@alamb |
Which issue does this PR close?
Closes #8403.
Rationale for this change
Casting from
BinaryViewtoUtf8Viewcurrently attempts a direct conversion usingto_string_view()which returns an error if any value contains invalid UTF‑8. This behavior is inconsistent with other binary array types in Arrow, which honorCastOptions.safe = trueby replacing invalid UTF‑8 sequences withNULLvalues rather than failing the entire cast operation.This PR makes
BinaryView's casting behavior consistent with other binary types and with user expectations: whenCastOptions.safeistrue, invalid UTF‑8 bytes are replaced byNULLin the resultingStringViewArray; whenCastOptions.safeisfalse, the cast retains the existing failure behavior.What changes are included in this PR?
Change
cast_with_optionsto delegate theBinaryView -> Utf8Viewbranch to a new helper functioncast_binary_view_to_string_view(array, cast_options)instead of directly callingto_string_view()and erroring.Add
extend_valid_utf8helper to centralize the logic of mappingOption<&[u8]>toOption<&str>(usingstd::str::from_utf8(...).ok()), and reuse it for bothGenericStringBuilderandStringViewBuilderflows.Implement
cast_binary_view_to_string_viewwhich:Attempts
array.clone().to_string_view()(fast, zero-copy path) and returns it whenOk.On
Err, checkscast_options.safe:true, builds aStringViewArrayby filtering invalid UTF‑8 toNULLusingextend_valid_utf8and returns that array.false, propagates the original error (existing behavior).Add a unit test
test_binary_view_to_string_view_with_invalid_utf8covering bothsafe=false(expect error) andsafe=true(expectNULLwhere invalid UTF‑8 occurred).Files changed (high level):
arrow-cast/src/cast/mod.rs: routeBinaryView -> Utf8Viewcase to the new helper.arrow-cast/src/cast/string.rs: addextend_valid_utf8andcast_binary_view_to_string_view, and useextend_valid_utf8from an existing cast path.Are there any user-facing changes?
Yes — this changes the observable behavior of casting
BinaryViewtoUtf8View:CastOptions.safe = true(the safe mode), invalid UTF‑8 inBinaryViewelements will be converted toNULLin the resultingUtf8Viewarray instead of causing the entire cast to fail.CastOptions.safe = false, an invalid UTF‑8 still causes the cast to fail as before.This is a bug fix aligning
BinaryViewwith the semantics of other binary types and with documented expectations forCastOptions.safe.No public API surface is changed beyond the fixed behavior; the new helpers are crate-private.