Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix Coalesce casting logic to follows what Postgres and DuckDB do. Introduce signature that do non-comparison coercion #10268

Merged
merged 43 commits into from
May 26, 2024
Merged
Show file tree
Hide file tree
Changes from 23 commits
Commits
Show all changes
43 commits
Select commit Hold shift + click to select a range
c79156f
remove casting for coalesce
jayzhan211 Apr 27, 2024
a36e6b2
add more test
jayzhan211 Apr 27, 2024
bf16c92
add more test
jayzhan211 Apr 27, 2024
407e3c7
crate only visibility
jayzhan211 Apr 27, 2024
03b9162
polish comment
jayzhan211 Apr 27, 2024
4abf29d
improve test
jayzhan211 Apr 27, 2024
4965e8d
backup
jayzhan211 Apr 28, 2024
81f0235
introduce new signautre for coalesce
jayzhan211 Apr 28, 2024
bae996c
cleanup
jayzhan211 Apr 28, 2024
ddf9b1c
cleanup
jayzhan211 Apr 28, 2024
c2799ea
ignore err msg
jayzhan211 Apr 28, 2024
2574896
fmt
jayzhan211 Apr 28, 2024
6a17e57
fix doc
jayzhan211 Apr 28, 2024
4cba8c5
cleanup
jayzhan211 Apr 28, 2024
f1cfb8d
add more test
jayzhan211 Apr 28, 2024
d2e83d3
switch to type_resolution coercion
jayzhan211 Apr 28, 2024
3a88ad7
Merge remote-tracking branch 'upstream/main' into fix-coelesce
jayzhan211 Apr 29, 2024
03880a3
fix i64 and u64 case
jayzhan211 Apr 29, 2024
481f548
add more tests
jayzhan211 Apr 29, 2024
dfc4176
cleanup
jayzhan211 Apr 29, 2024
46a9060
add null case
jayzhan211 Apr 29, 2024
d656645
fmt
jayzhan211 Apr 29, 2024
5683447
fix
jayzhan211 Apr 29, 2024
b949fae
rename to type_union_resolution
jayzhan211 Apr 29, 2024
5aaeb5b
add comment
jayzhan211 Apr 29, 2024
a968c0e
Merge remote-tracking branch 'upstream/main' into fix-coelesce
jayzhan211 May 1, 2024
cf679c5
Merge remote-tracking branch 'upstream/main' into fix-coelesce
jayzhan211 May 2, 2024
15471ab
cleanup
jayzhan211 May 2, 2024
e5cc46b
fix test
jayzhan211 May 2, 2024
a810e85
add comment
jayzhan211 May 2, 2024
cb16cda
rm test
jayzhan211 May 3, 2024
53bedda
Merge remote-tracking branch 'upstream/main' into fix-coelesce
jayzhan211 May 12, 2024
a37da2d
cleanup since rebase
jayzhan211 May 12, 2024
70239e0
add more test
jayzhan211 May 12, 2024
be116f8
add more test
jayzhan211 May 12, 2024
8f4e991
fix msg
jayzhan211 May 12, 2024
6a8fe6f
Merge remote-tracking branch 'upstream/main' into fix-coelesce
jayzhan211 May 14, 2024
20e618e
Merge remote-tracking branch 'upstream/main' into fix-coelesce
jayzhan211 May 14, 2024
4153593
fmt
jayzhan211 May 14, 2024
030a519
rm pure_string_coercion
jayzhan211 May 14, 2024
5b797d5
rm duplicate
jayzhan211 May 14, 2024
b954479
change type in select.slt
jayzhan211 May 25, 2024
829b5a2
fix slt
jayzhan211 May 25, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
22 changes: 20 additions & 2 deletions datafusion/expr/src/signature.rs
Original file line number Diff line number Diff line change
Expand Up @@ -92,14 +92,22 @@ pub enum TypeSignature {
/// A function such as `concat` is `Variadic(vec![DataType::Utf8, DataType::LargeUtf8])`
Variadic(Vec<DataType>),
/// One or more arguments of an arbitrary but equal type.
/// DataFusion attempts to coerce all argument types to match the first argument's type
/// DataFusion attempts to coerce all argument types to match to the common type with comparision coercion.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we please link to https://docs.rs/datafusion/latest/datafusion/logical_expr/type_coercion/binary/fn.comparison_coercion.html that explains (however limited) what comparison coercion is?

///
/// # Examples
/// Given types in signature should be coercible to the same final type.
/// A function such as `make_array` is `VariadicEqual`.
///
/// `make_array(i32, i64) -> make_array(i64, i64)`
VariadicEqual,

This comment was marked as outdated.

/// One or more arguments of an arbitrary but equal type or Null.
/// Non-comparison coercion is attempted to match the signatures.
///
/// Functions like `coalesce` is `VariadicEqual`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am a little confused about what "Non-comparison coercion" means in this situation. Specifically how comparison coercion and non comparision coercion differ 🤔

Does non comparison coercion mean "type union resolution" (aka type_union_resolution)?

Also, I think coalesce is now VariadicEqualOrNull (this says the opposite)

Suggested change
/// One or more arguments of an arbitrary but equal type or Null.
/// Non-comparison coercion is attempted to match the signatures.
///
/// Functions like `coalesce` is `VariadicEqual`.
/// One or more arguments of an arbitrary but equal type or Null.
/// Non-comparison coercion is attempted to match the signatures.
///
/// Functions like `coalesce` is `VariadicEqualOrNull`.

Copy link
Contributor Author

@jayzhan211 jayzhan211 May 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes. Non-comparison coercion is type_union_resolution

Actually, I think type_union_resolution is what we really need for coercion. Current comparison_coercion may not exist in the long term because I think there are many workaround solutions included in it, and those are possible to pull out to their correct places. For example, List coercion in string_coercion or Dict coercion that compare value only could be handled in unwrap cast in comparison

Ideally, we could shape comparison_coercion to more like type_union_resolution. But, adjusting logic inside it is error-prone if the test is not covered.

At least I can tell the dict-coercion is different between these two now.

// TODO: Temporary Signature, to differentiate existing VariadicEqual.
// After we swtich `make_array` to VariadicEqualOrNull,
// we can reuse VariadicEqual.
VariadicEqualOrNull,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need a similar signature but an exact args number for nullif and nvl.
Can name it as UniformEqual compare to VariadicEqual 🤔

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we could do something like this (basically to flavor the type signature) 🤔

pub enum TypeSignature {
...
  /// Rather than the usual coercion rules, special type union rules are applied
  Union(Box<TypeSignature>)
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice idea!

/// One or more arguments with arbitrary types
VariadicAny,
/// Fixed number of arguments of an arbitrary but equal type out of a list of valid types.
Expand Down Expand Up @@ -193,6 +201,9 @@ impl TypeSignature {
TypeSignature::VariadicEqual => {
vec!["CoercibleT, .., CoercibleT".to_string()]
}
TypeSignature::VariadicEqualOrNull => {
vec!["CoercibleT or NULL, .., CoercibleT or NULL".to_string()]
}
TypeSignature::VariadicAny => vec!["Any, .., Any".to_string()],
TypeSignature::OneOf(sigs) => {
sigs.iter().flat_map(|s| s.to_string_repr()).collect()
Expand Down Expand Up @@ -255,13 +266,20 @@ impl Signature {
volatility,
}
}
/// An arbitrary number of arguments of the same type.
/// One or more number of arguments to the same type.
pub fn variadic_equal(volatility: Volatility) -> Self {
Self {
type_signature: TypeSignature::VariadicEqual,
volatility,
}
}
/// One or more number of arguments of the same type.
pub fn variadic_equal_or_null(volatility: Volatility) -> Self {
Self {
type_signature: TypeSignature::VariadicEqualOrNull,
volatility,
}
}
/// An arbitrary number of arguments of any type.
pub fn variadic_any(volatility: Volatility) -> Self {
Self {
Expand Down
237 changes: 215 additions & 22 deletions datafusion/expr/src/type_coercion/binary.rs
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,7 @@

//! Coercion rules for matching argument types for binary operators

use std::collections::HashSet;
use std::sync::Arc;

use crate::Operator;
Expand Down Expand Up @@ -289,15 +290,164 @@ fn bitwise_coercion(left_type: &DataType, right_type: &DataType) -> Option<DataT
}
}

#[derive(Debug, PartialEq, Eq, Hash, Clone)]
enum TypeCategory {
Array,
Boolean,
Numeric,
// String, well-defined type, but are considered as unknown type.
DateTime,
Composite,
Unknown,
NotSupported,
}

fn data_type_category(data_type: &DataType) -> TypeCategory {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could also be something like

impl From<&DataType> for TypeCategory { 
...

And then you create a TypeCategory like

let category = TypeCategory::from(&type)

if data_type.is_numeric() {
return TypeCategory::Numeric;
}

if matches!(data_type, DataType::Boolean) {
return TypeCategory::Boolean;
}

if matches!(
data_type,
DataType::List(_) | DataType::FixedSizeList(_, _) | DataType::LargeList(_)
) {
return TypeCategory::Array;
}

// String literal is possible to cast to many other types like numeric or datetime,
// therefore, it is categorized as a unknown type
if matches!(
data_type,
DataType::Utf8 | DataType::LargeUtf8 | DataType::Null
) {
return TypeCategory::Unknown;
}

if matches!(
data_type,
DataType::Date32
| DataType::Date64
| DataType::Time32(_)
| DataType::Time64(_)
| DataType::Timestamp(_, _)
| DataType::Interval(_)
| DataType::Duration(_)
) {
return TypeCategory::DateTime;
}

if matches!(
data_type,
DataType::Dictionary(_, _) | DataType::Struct(_) | DataType::Union(_, _)
) {
return TypeCategory::Composite;
}

TypeCategory::NotSupported
}

/// Coerce `lhs_type` and `rhs_type` to a common type for the purposes of constructs including
/// CASE, ARRAY, VALUES, and the GREATEST and LEAST functions.
/// See <https://www.postgresql.org/docs/current/typeconv-union-case.html> for more information.
/// The actual rules follows the behavior of Postgres and DuckDB
pub fn type_resolution(data_types: &[DataType]) -> Option<DataType> {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the biggest difference between comparison coercion is that we categorize types.

Copy link
Contributor

@erratic-pattern erratic-pattern Apr 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The doc string is a bit confusing here because there are no lhs_type and rhs_type. I assume that case would be type_resolution(&[lhs_type, rhs_type])?

Also, maybe this function name could reflect that it's finding a union type to satisfy a set of input types, for example type_union, type_union_resolution, resolve_type_union. I think type_resolution is too generic.

if data_types.is_empty() {
return None;
}

// if all the data_types is the same return first one
if data_types.iter().all(|t| t == &data_types[0]) {
return Some(data_types[0].clone());
}

// if all the data_types are null, return string
if data_types.iter().all(|t| t == &DataType::Null) {
return Some(DataType::Utf8);
}

// Ignore Nulls, if any data_type category is not the same, return None
let data_types_category: Vec<TypeCategory> = data_types
.iter()
.filter(|&t| t != &DataType::Null)
.map(data_type_category)
.collect();

if data_types_category
.iter()
.any(|t| t == &TypeCategory::NotSupported)
{
return None;
}

// check if there is only one category excluding Unknown
let categories: HashSet<TypeCategory> = HashSet::from_iter(
data_types_category
.iter()
.filter(|&c| c != &TypeCategory::Unknown)
.cloned(),
);
if categories.len() > 1 {
return None;
}

// Ignore Nulls
let mut candidate_type: Option<DataType> = None;
for data_type in data_types.iter() {
if data_type == &DataType::Null {
continue;
}
if let Some(ref candidate_t) = candidate_type {
// Find candidate type that all the data types can be coerced to
// Follows the behavior of Postgres and DuckDB
// Coerced type may be different from the candidate and current data type
// For example,
// i64 and decimal(7, 2) are expect to get coerced type decimal(22, 2)
// numeric string ('1') and numeric (2) are expect to get coerced type numeric (1, 2)
if let Some(t) = type_resolution_coercion(data_type, candidate_t) {
candidate_type = Some(t);
} else {
return None;
}
} else {
candidate_type = Some(data_type.clone());
}
}

candidate_type
}

/// See [type_resolution] for more information.
fn type_resolution_coercion(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same point here about function naming. I think we should highlight that we're creating a union / unifying input types into a common result type

lhs_type: &DataType,
rhs_type: &DataType,
) -> Option<DataType> {
if lhs_type == rhs_type {
return Some(lhs_type.clone());
}

// numeric coercion is the same as comparison coercion, both find the narrowest type
// that can accommodate both types
binary_numeric_coercion(lhs_type, rhs_type)
.or_else(|| pure_string_coercion(lhs_type, rhs_type))
.or_else(|| numeric_string_coercion(lhs_type, rhs_type))
}

/// Coerce `lhs_type` and `rhs_type` to a common type for the purposes of a comparison operation
/// Unlike `coerced_from`, usually the coerced type is for comparison only.
/// For example, compare with Dictionary and Dictionary, only value type is what we care about
pub fn comparison_coercion(lhs_type: &DataType, rhs_type: &DataType) -> Option<DataType> {
if lhs_type == rhs_type {
// same type => equality is possible
return Some(lhs_type.clone());
}
comparison_binary_numeric_coercion(lhs_type, rhs_type)
binary_numeric_coercion(lhs_type, rhs_type)
.or_else(|| dictionary_coercion(lhs_type, rhs_type, true))
.or_else(|| temporal_coercion(lhs_type, rhs_type))
.or_else(|| pure_string_coercion(lhs_type, rhs_type))
jayzhan211 marked this conversation as resolved.
Show resolved Hide resolved
.or_else(|| string_coercion(lhs_type, rhs_type))
.or_else(|| null_coercion(lhs_type, rhs_type))
.or_else(|| string_numeric_coercion(lhs_type, rhs_type))
Expand Down Expand Up @@ -359,9 +509,8 @@ fn string_temporal_coercion(
match_rule(lhs_type, rhs_type).or_else(|| match_rule(rhs_type, lhs_type))
}

/// Coerce `lhs_type` and `rhs_type` to a common type for the purposes of a comparison operation
/// where one both are numeric
pub(crate) fn comparison_binary_numeric_coercion(
/// Coerce `lhs_type` and `rhs_type` to a common type where both are numeric
pub(crate) fn binary_numeric_coercion(
lhs_type: &DataType,
rhs_type: &DataType,
) -> Option<DataType> {
Expand All @@ -375,27 +524,25 @@ pub(crate) fn comparison_binary_numeric_coercion(
return Some(lhs_type.clone());
}

if let Some(t) = decimal_coercion(lhs_type, rhs_type) {
return Some(t);
}

// these are ordered from most informative to least informative so
// that the coercion does not lose information via truncation
match (lhs_type, rhs_type) {
// Prefer decimal data type over floating point for comparison operation
(Decimal128(_, _), Decimal128(_, _)) => {
get_wider_decimal_type(lhs_type, rhs_type)
}
(Decimal128(_, _), _) => get_comparison_common_decimal_type(lhs_type, rhs_type),
(_, Decimal128(_, _)) => get_comparison_common_decimal_type(rhs_type, lhs_type),
(Decimal256(_, _), Decimal256(_, _)) => {
get_wider_decimal_type(lhs_type, rhs_type)
}
(Decimal256(_, _), _) => get_comparison_common_decimal_type(lhs_type, rhs_type),
(_, Decimal256(_, _)) => get_comparison_common_decimal_type(rhs_type, lhs_type),
(Float64, _) | (_, Float64) => Some(Float64),
(_, Float32) | (Float32, _) => Some(Float32),
// The following match arms encode the following logic: Given the two
// integral types, we choose the narrowest possible integral type that
// accommodates all values of both types. Note that some information
// loss is inevitable when we have a signed type and a `UInt64`, in
// which case we use `Int64`;i.e. the widest signed integral type.

// TODO: For i64 and u64, we can use decimal or float64
// Postgres has no unsigned type :(
// DuckDB v.0.10.0 has double (double precision floating-point number (8 bytes))
// for largest signed (signed sixteen-byte integer) and unsigned integer (unsigned sixteen-byte integer)
(Int64, _)
| (_, Int64)
| (UInt64, Int8)
Expand Down Expand Up @@ -426,9 +573,31 @@ pub(crate) fn comparison_binary_numeric_coercion(
}
}

/// Coerce `lhs_type` and `rhs_type` to a common type for the purposes of
/// a comparison operation where one is a decimal
fn get_comparison_common_decimal_type(
/// Decimal coercion rules.
pub(crate) fn decimal_coercion(
lhs_type: &DataType,
rhs_type: &DataType,
) -> Option<DataType> {
use arrow::datatypes::DataType::*;

match (lhs_type, rhs_type) {
// Prefer decimal data type over floating point for comparison operation
(Decimal128(_, _), Decimal128(_, _)) => {
get_wider_decimal_type(lhs_type, rhs_type)
}
(Decimal128(_, _), _) => get_common_decimal_type(lhs_type, rhs_type),
(_, Decimal128(_, _)) => get_common_decimal_type(rhs_type, lhs_type),
(Decimal256(_, _), Decimal256(_, _)) => {
get_wider_decimal_type(lhs_type, rhs_type)
}
(Decimal256(_, _), _) => get_common_decimal_type(lhs_type, rhs_type),
(_, Decimal256(_, _)) => get_common_decimal_type(rhs_type, lhs_type),
(_, _) => None,
}
}

/// Coerce `lhs_type` and `rhs_type` to a common type.
fn get_common_decimal_type(
decimal_type: &DataType,
other_type: &DataType,
) -> Option<DataType> {
Expand Down Expand Up @@ -701,11 +870,12 @@ fn string_concat_internal_coercion(
/// to string type.
fn string_coercion(lhs_type: &DataType, rhs_type: &DataType) -> Option<DataType> {
use arrow::datatypes::DataType::*;

if let Some(t) = pure_string_coercion(lhs_type, rhs_type) {
return Some(t);
}

match (lhs_type, rhs_type) {
(Utf8, Utf8) => Some(Utf8),
(LargeUtf8, Utf8) => Some(LargeUtf8),
(Utf8, LargeUtf8) => Some(LargeUtf8),
(LargeUtf8, LargeUtf8) => Some(LargeUtf8),
// TODO: cast between array elements (#6558)
(List(_), List(_)) => Some(lhs_type.clone()),
(List(_), _) => Some(lhs_type.clone()),
Expand All @@ -714,6 +884,29 @@ fn string_coercion(lhs_type: &DataType, rhs_type: &DataType) -> Option<DataType>
}
}

fn pure_string_coercion(lhs_type: &DataType, rhs_type: &DataType) -> Option<DataType> {
use arrow::datatypes::DataType::*;
match (lhs_type, rhs_type) {
(Utf8, Utf8) => Some(Utf8),
(LargeUtf8, Utf8) => Some(LargeUtf8),
(Utf8, LargeUtf8) => Some(LargeUtf8),
(LargeUtf8, LargeUtf8) => Some(LargeUtf8),
_ => None,
}
}

fn numeric_string_coercion(lhs_type: &DataType, rhs_type: &DataType) -> Option<DataType> {
use arrow::datatypes::DataType::*;
match (lhs_type, rhs_type) {
(Utf8 | LargeUtf8, other_type) | (other_type, Utf8 | LargeUtf8)
if other_type.is_numeric() =>
{
Some(other_type.clone())
}
_ => None,
}
}

/// Coercion rules for binary (Binary/LargeBinary) to string (Utf8/LargeUtf8):
/// If one argument is binary and the other is a string then coerce to string
/// (e.g. for `like`)
Expand Down
Loading
Loading