Skip to content
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 9 additions & 1 deletion datafusion/expr-common/src/type_coercion/binary.rs
Original file line number Diff line number Diff line change
Expand Up @@ -1589,9 +1589,18 @@ fn ree_coercion(
/// This is a union of string coercion rules and specified rules:
/// 1. At least one side of lhs and rhs should be string type (Utf8 / LargeUtf8)
/// 2. Data type of the other side should be able to cast to string type
/// 3. Binary and string types cannot be mixed
fn string_concat_coercion(lhs_type: &DataType, rhs_type: &DataType) -> Option<DataType> {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before the change, Binary || Utf8 returned a Utf8 string, and Binary || Binary also returned Utf8. Now the first errors out, and the second returns Binary, wouldn't this be a breaking change for users who are hitting this in a query?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is worth noting, as we decided to ban mixing according to this analysis.

I'll add api change label and put an upgrade notice shortly.

I intend to put a separate PR to harmonise concat UDF behaviour (it allows mixing) with the pipe operator's behaviour to avoid confusion

use arrow::datatypes::DataType::*;
string_coercion(lhs_type, rhs_type).or_else(|| match (lhs_type, rhs_type) {
// Allow pure binary + binary
(Binary, Binary) => Some(Binary),
(LargeBinary, LargeBinary) => Some(LargeBinary),
(BinaryView, BinaryView) => Some(BinaryView),
// Deny mixed binary + string
(Binary | LargeBinary | BinaryView, Utf8 | LargeUtf8 | Utf8View) => None,
(Utf8 | LargeUtf8 | Utf8View, Binary | LargeBinary | BinaryView) => None,
// Predicate-based coercion rules are following
(Utf8View, from_type) | (from_type, Utf8View) => {
string_concat_internal_coercion(from_type, &Utf8View)
}
Expand All @@ -1604,7 +1613,6 @@ fn string_concat_coercion(lhs_type: &DataType, rhs_type: &DataType) -> Option<Da
(Dictionary(_, lhs_value_type), Dictionary(_, rhs_value_type)) => {
string_coercion(lhs_value_type, rhs_value_type).or(None)
}
(Binary, Binary) => Some(Utf8),
_ => None,
})
}
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -897,3 +897,74 @@ fn test_binary_comparison_string_numeric_coercion() -> Result<()> {
}
Ok(())
}

#[test]
fn test_string_concat_coercion() -> Result<()> {
// Binary
test_coercion_binary_rule!(
DataType::Binary,
DataType::Binary,
Operator::StringConcat,
DataType::Binary
);
test_coercion_binary_rule!(
DataType::LargeBinary,
DataType::LargeBinary,
Operator::StringConcat,
DataType::LargeBinary
);
test_coercion_binary_rule!(
DataType::BinaryView,
DataType::BinaryView,
Operator::StringConcat,
DataType::BinaryView
);

// String
test_coercion_binary_rule!(
DataType::Utf8,
DataType::Utf8,
Operator::StringConcat,
DataType::Utf8
);
test_coercion_binary_rule!(
DataType::LargeUtf8,
DataType::LargeUtf8,
Operator::StringConcat,
DataType::LargeUtf8
);
test_coercion_binary_rule!(
DataType::Utf8View,
DataType::Utf8View,
Operator::StringConcat,
DataType::Utf8View
);

// Mixed string-binary
for dt in [DataType::Utf8, DataType::LargeUtf8, DataType::Utf8View] {
assert!(
BinaryTypeCoercer::new(&DataType::Binary, &Operator::StringConcat, &dt,)
.get_input_types()
.is_err(),
"{}",
dt
);
assert!(
BinaryTypeCoercer::new(&dt, &Operator::StringConcat, &DataType::Binary,)
.get_input_types()
.is_err(),
"{}",
dt
);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it only loops over string types paired with DataType::Binary. The coercion rule actually covers 9 combinations (Binary/LargeBinary/BinaryView × Utf8/LargeUtf8/Utf8View). Could you extend the loop to cover all of them? Something like a nested loop over both lists would do it. Right now LargeBinary || Utf8 and BinaryView || LargeUtf8 are not directly tested. wdyt?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree. By the way, it tests only coercion, the actual operation concat, as Jeffrey spotted for LargeBinary, is done via slt

}

// Mixed string-other
test_coercion_binary_rule!(
DataType::Utf8,
DataType::Timestamp(Second, None),
Operator::StringConcat,
DataType::Utf8
);

Ok(())
}
25 changes: 13 additions & 12 deletions datafusion/physical-expr/src/expressions/binary.rs
Original file line number Diff line number Diff line change
Expand Up @@ -46,6 +46,7 @@ use kernels::{
bitwise_and_dyn, bitwise_and_dyn_scalar, bitwise_or_dyn, bitwise_or_dyn_scalar,
bitwise_shift_left_dyn, bitwise_shift_left_dyn_scalar, bitwise_shift_right_dyn,
bitwise_shift_right_dyn_scalar, bitwise_xor_dyn, bitwise_xor_dyn_scalar,
concat_elements_binary_array, concat_elements_binary_view_array,
concat_elements_utf8view, regex_match_dyn, regex_match_dyn_scalar,
};

Expand Down Expand Up @@ -928,18 +929,6 @@ fn pre_selection_scatter(
}

fn concat_elements(left: &ArrayRef, right: &ArrayRef) -> Result<ArrayRef> {
if *left.data_type() == DataType::Binary && *right.data_type() == DataType::Binary {
// Cast Binary to Utf8 to validate UTF-8 encoding before concatenation
// Follow widespread approach of PostgreSQL, sqlite, DuckDB, Snowflake
// Spark does it in a different way by making a binary-to-binary concatenation
let left = cast(left.as_ref(), &DataType::Utf8)?;
let right = cast(right.as_ref(), &DataType::Utf8)?;
return Ok(Arc::new(concat_elements_utf8(
left.as_string::<i32>(),
right.as_string::<i32>(),
)?));
}

Ok(match left.data_type() {
DataType::Utf8 => Arc::new(concat_elements_utf8(
left.as_string::<i32>(),
Expand All @@ -953,6 +942,18 @@ fn concat_elements(left: &ArrayRef, right: &ArrayRef) -> Result<ArrayRef> {
left.as_string_view(),
right.as_string_view(),
)?),
DataType::Binary => Arc::new(concat_elements_binary_array::<i32>(
left.as_binary(),
right.as_binary(),
)?),
DataType::LargeBinary => Arc::new(concat_elements_binary_array::<i64>(
left.as_binary(),
right.as_binary(),
)?),
DataType::BinaryView => Arc::new(concat_elements_binary_view_array(
left.as_binary_view(),
right.as_binary_view(),
)?),
other => {
return internal_err!(
"Data type {other:?} not supported for binary operation 'concat_elements' on string arrays"
Expand Down
95 changes: 92 additions & 3 deletions datafusion/physical-expr/src/expressions/binary/kernels.rs
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@
//! This module contains computation kernels that are specific to
//! datafusion and not (yet) targeted to port upstream to arrow
use arrow::array::*;
use arrow::buffer::NullBuffer;
use arrow::buffer::{MutableBuffer, NullBuffer};
use arrow::compute::kernels::bitwise::{
bitwise_and, bitwise_and_scalar, bitwise_or, bitwise_or_scalar, bitwise_shift_left,
bitwise_shift_left_scalar, bitwise_shift_right, bitwise_shift_right_scalar,
Expand Down Expand Up @@ -161,11 +161,11 @@ create_left_integral_dyn_scalar_kernel!(
bitwise_shift_left_scalar
);

/// Concatenates two `StringViewArray`s element-wise.
/// Concatenates two `StringViewArray`s element-wise.
/// If either element is `Null`, the result element is also `Null`.
///
/// # Errors
/// - Returns an error if the input arrays have different lengths.
/// - Returns an error if the input arrays have different lengths.
/// - Returns an error if any concatenated string exceeds `u32::MAX` (≈4 GB) in length.
pub fn concat_elements_utf8view(
left: &StringViewArray,
Expand Down Expand Up @@ -204,6 +204,95 @@ pub fn concat_elements_utf8view(
Ok(result.finish())
}

/// Concatenates two `GenericBinaryArray`s element-wise.
/// If either element is `Null`, the result element is also `Null`.
///
/// # Errors
/// - Returns an error if the input arrays have different lengths.
/// - Panics if any concatenated string exceeds `T::Offset::MAX` in length.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shoudl we move it to Panic section? also the kernel below concat_elements_binary_view_array returns an error on overflow while this one panics. Same overflow class, different behavior. Either pick one and apply it to both, or add a comment explaining why they differ.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This went away

pub fn concat_elements_binary_array<T: OffsetSizeTrait>(

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can follow what we do for strings and just use concat_element_binary

DataType::Utf8 => Arc::new(concat_elements_utf8(
left.as_string::<i32>(),
right.as_string::<i32>(),
)?),
DataType::LargeUtf8 => Arc::new(concat_elements_utf8(
left.as_string::<i64>(),
right.as_string::<i64>(),
)?),

https://docs.rs/arrow/latest/arrow/compute/kernels/concat_elements/fn.concat_element_binary.html

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pointing that out. I am reinventing the wheel. Changed to these implementations, leaving only view concats. Is there any reason we don't have StringViewArray/BinaryViewArray/GenericByteViewArray in arrow-string? PR if so?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No particular reason I think, just haven't added support for it

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. If you don't mind, I will add it

left: &GenericBinaryArray<T>,
right: &GenericBinaryArray<T>,
) -> std::result::Result<GenericBinaryArray<T>, ArrowError> {
if left.len() != right.len() {
return Err(ArrowError::ComputeError(format!(
"Arrays must have the same length: {} != {}",
left.len(),
right.len()
)));
}
// data capacity is unknown, so pass zero
let mut result = GenericBinaryBuilder::<T>::with_capacity(left.len(), 0);

// Avoid reallocations by writing to a reused buffer (note we could be even
// more efficient by creating the view directly here and avoid the buffer
// but that would be more complex)
let mut buffer = MutableBuffer::new(0);

// Pre-compute combined null bitmap, so the per-row NULL check is more
// efficient
let nulls = NullBuffer::union(left.nulls(), right.nulls());

for i in 0..left.len() {
if nulls.as_ref().is_some_and(|n| n.is_null(i)) {
result.append_null();
} else {
let l = left.value(i);
let r = right.value(i);
buffer.clear();
buffer.extend_from_slice(l);
buffer.extend_from_slice(r);
// No try-version of append_value because it panics on overflow

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: we could improve the comment

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This went away

result.append_value(&buffer);
}
}
Ok(result.finish())
}

/// Concatenates two `BinaryViewArray`s element-wise.
/// If either element is `Null`, the result element is also `Null`.
///
/// # Errors
/// - Returns an error if the input arrays have different lengths.
/// - Returns an error if any concatenated string exceeds `u32::MAX` in length.
pub fn concat_elements_binary_view_array(
left: &BinaryViewArray,
right: &BinaryViewArray,
) -> std::result::Result<BinaryViewArray, ArrowError> {
if left.len() != right.len() {
return Err(ArrowError::ComputeError(format!(
"Arrays must have the same length: {} != {}",
left.len(),
right.len()
)));
}
let mut result = BinaryViewBuilder::with_capacity(left.len());

// Avoid reallocations by writing to a reused buffer (note we could be even
// more efficient by creating the view directly here and avoid the buffer
// but that would be more complex)
let mut buffer = MutableBuffer::new(0);

// Pre-compute combined null bitmap, so the per-row NULL check is more
// efficient
let nulls = NullBuffer::union(left.nulls(), right.nulls());

for i in 0..left.len() {
if nulls.as_ref().is_some_and(|n| n.is_null(i)) {
result.append_null();
} else {
let l = left.value(i);
let r = right.value(i);
buffer.clear();
buffer.extend_from_slice(l);
buffer.extend_from_slice(r);
// No try-version of append_value
result.try_append_value(&buffer)?;
}
}
Ok(result.finish())
}

/// Invoke a compute kernel on a pair of binary data arrays with flags
macro_rules! regexp_is_match_flag {
($LEFT:expr, $RIGHT:expr, $ARRAYTYPE:ident, $NOT:expr, $FLAG:expr) => {{
Expand Down
17 changes: 17 additions & 0 deletions datafusion/sqllogictest/test_files/binary.slt
Original file line number Diff line number Diff line change
Expand Up @@ -321,3 +321,20 @@ query T
SELECT split_part(CAST(binary AS VARCHAR), 'o', 2) FROM t WHERE binary = X'466f6f';
----
(empty)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we add a test with fixedsizebinary too?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, added support for this type as well

# Pipe concatenation of binaries always provides a binary
query ?
SELECT x'636166c3a9' || x'68656c6c6f';
----
636166c3a968656c6c6f

# Byte pipe operator is forbidden for mixed binary and text

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about mixed binary?

1. query failed: DataFusion error: Error during planning: Cannot infer common string type for string concat operation Binary || LargeBinary
[SQL] SELECT x'636166c3a9' || arrow_cast(x'68656c6c6f', 'LargeBinary');
at /Users/jeffrey/Code/datafusion/datafusion/sqllogictest/test_files/binary.slt:331

Is this intentional?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I slipped it, thanks!. Added unit and SLT tests

query error DataFusion error: Error during planning: Cannot infer common string type for string concat operation Binary || Utf8
SELECT x'c3a9' || 'hello';

query error DataFusion error: Error during planning: Cannot infer common string type for string concat operation Utf8 || LargeBinary
SELECT 'hello' || arrow_cast(arrow_cast('hello', 'Binary'), 'LargeBinary');

query error DataFusion error: Error during planning: Cannot infer common string type for string concat operation Utf8 || BinaryView
SELECT 'hello' || arrow_cast(arrow_cast('hello', 'Binary'), 'BinaryView');

18 changes: 3 additions & 15 deletions datafusion/sqllogictest/test_files/scalar.slt
Original file line number Diff line number Diff line change
Expand Up @@ -1739,23 +1739,11 @@ SELECT 'a' || 42 || 23.3
----
a4223.3

# concat of binary and text provides a text output
query T
select arrow_cast('Café', 'Utf8') || arrow_cast('Foobar', 'Binary');
----
CaféFoobar

query T
select arrow_cast('Café', 'Binary') || arrow_cast('Foobar', 'Utf8');
----
CaféFoobar

# Concat of two binaries should cast arguments to text and produce a text output,
# following common behaviour of PostreSQL. However, Spark is providing binary
query T
# Concat operator of two binaries uses their binary representation without text at all
query ?
select arrow_cast('Café', 'Binary') || arrow_cast('Foobar', 'Binary');
----
CaféFoobar
436166c3a9466f6f626172


# test_not_expressions()
Expand Down
22 changes: 22 additions & 0 deletions datafusion/sqllogictest/test_files/spark/string/concat.slt
Original file line number Diff line number Diff line change
Expand Up @@ -135,3 +135,25 @@ query T
SELECT concat(x'636166c3', x'a968656c6c6f');
----
caféhello

# UDF concatenation for valid UTF-8 arguments
Comment thread
Jefffrey marked this conversation as resolved.
Outdated
query T
SELECT concat(x'636166c3a9', x'68656c6c6f');
----
caféhello

# concat UDF allows invalid UTF-8 arguments, so it allows a valid UTF-8 sequence after concatenation
query T
SELECT concat(x'c3', x'a9');
----
é

# concat UDF cannot form a valid UTF-8 sequence
query error Execution error: invalid UTF-8 in binary literal
SELECT concat(x'ff', x'af');

# Mixed binary and text provide actual UTF-8 sequence, not '\xc3a9' as in PostgreSQL
query T
SELECT concat(x'c3a9', 'hello');

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

concat(x'c3a9', 'hello') works and returns éhello, whereas x'c3a9' || 'hello' would errors, should we track this in an issue?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

concat is more capable as in #20787. As per the other comment, concat could be harmonised.

----
éhello
Loading