-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Use coerced type in inlist expr planning #2794
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
After this change, For another failed test case cc @alamb |
Codecov Report
@@ Coverage Diff @@
## master #2794 +/- ##
==========================================
- Coverage 85.14% 85.11% -0.04%
==========================================
Files 273 273
Lines 48248 48243 -5
==========================================
- Hits 41079 41060 -19
- Misses 7169 7183 +14
Continue to review full report at Codecov.
|
|
|
||
| expressions::in_list(value_expr, list_exprs, negated, input_schema) | ||
| let (cast_expr, cast_list_exprs) = | ||
| in_list_cast(value_expr, list_exprs, input_schema)?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should check data type casting for in_list during expression planning.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is much better in my opinion. Thank you @viirya
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@viirya I just go through the code for the data coerced for the InList.
When I want to support decimal in the InList, i created the pr #2764 and added coercion rule for InList in the in_list.rs file.
It make sense to me that you move the coerced rule from inlist to there.
From my option known from codebase of other database, the coercion should be done before generating the physical plan or physical expr.
We have discussed about the coercion rule in #1356 (comment)
| // TODO: Can't cast from list type to value type directly | ||
| // We should use the coercion rule to get the common data type |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use type coercion rule now.
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me @viirya -- thank you!
The only thing I am worried about is the change of the planner.rs case so it no longer checks c1 = c2 -- but as long as there is a good reason for that change this PR looks really nice to me
cc @Ted-Jiang and @liukun4515
| .or_else(|| null_coercion(lhs_type, rhs_type)) | ||
| } | ||
|
|
||
| fn string_numeric_coercion(lhs_type: &DataType, rhs_type: &DataType) -> Option<DataType> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have some concerns about the rule between string and number.
I check some situation in the spark:
spark-sql> desc t3;
c1 int
spark-sql> explain extended select * from t3 where c1 = cast(123.123 as string);
== Parsed Logical Plan ==
'Project [*]
+- 'Filter ('c1 = cast(123.123 as string))
+- 'UnresolvedRelation [t3], [], false
== Analyzed Logical Plan ==
c1: int
Project [c1#186]
+- Filter (c1#186 = cast(cast(123.123 as string) as int))
+- SubqueryAlias spark_catalog.default.t3
+- HiveTableRelation [`default`.`t3`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, Data Cols: [c1#186], Partition Cols: []]
== Optimized Logical Plan ==
Filter (isnotnull(c1#186) AND (c1#186 = 123))
+- HiveTableRelation [`default`.`t3`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, Data Cols: [c1#186], Partition Cols: []]
== Physical Plan ==
*(1) Filter (isnotnull(c1#186) AND (c1#186 = 123))
+- Scan hive default.t3 [c1#186], HiveTableRelation [`default`.`t3`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, Data Cols: [c1#186], Partition Cols: []]
In the previous case, the result of coercion is Int.
I think we need to create an issue to track this.
@viirya @alamb
| .or_else(|| temporal_coercion(lhs_type, rhs_type)) | ||
| .or_else(|| string_coercion(lhs_type, rhs_type)) | ||
| .or_else(|| null_coercion(lhs_type, rhs_type)) | ||
| .or_else(|| string_numeric_coercion(lhs_type, rhs_type)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes a lot of sense -- thanks. 👍
| input_schema, | ||
| execution_props, | ||
| ), | ||
| // TODO refactor the logic of coercion the data type |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @liukun4515
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This pr resolve some TODO inList for me.
it make sense to me
|
|
||
| expressions::in_list(value_expr, list_exprs, negated, input_schema) | ||
| let (cast_expr, cast_list_exprs) = | ||
| in_list_cast(value_expr, list_exprs, input_schema)?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is much better in my opinion. Thank you @viirya
| col("c1").and(col("c1")), | ||
| // u8 AND u8 | ||
| col("c3").and(col("c3")), | ||
| // utf8 = u32 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why remove this case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
utf8 and u32 now have coerced type (utf8).
| .build()?; | ||
| let execution_plan = plan(&logical_plan).await?; | ||
| let expected = "expr: [(InListExpr { expr: Column { name: \"c1\", index: 0 }, list: [Literal { value: Utf8(\"a\") }, CastExpr { expr: Literal { value: Int64(1) }, cast_type: Utf8, cast_options: CastOptions { safe: false } }, CastExpr { expr: Literal { value: Int64(2) }, cast_type: Utf8, cast_options: CastOptions { safe: false } }, CastExpr { expr: Literal { value: Int64(3) }, cast_type: Utf8, cast_options: CastOptions { safe: false } }, CastExpr { expr: Literal { value: Int64(4) }, cast_type: Utf8, cast_options: CastOptions { safe: false } }, CastExpr { expr: Literal { value: Int64(5) }, cast_type: Utf8, cast_options: CastOptions { safe: false } }, CastExpr { expr: Literal { value: Int64(6) }, cast_type: Utf8, cast_options: CastOptions { safe: false } }, CastExpr { expr: Literal { value: Int64(7) }, cast_type: Utf8, cast_options: CastOptions { safe: false } }, CastExpr { expr: Literal { value: Int64(8) }, cast_type: Utf8, cast_options: CastOptions { safe: false } }, CastExpr { expr: Literal { value: Int64(9) }, cast_type: Utf8, cast_options: CastOptions { safe: false } }, CastExpr { expr: Literal { value: Int64(10) }, cast_type: Utf8, cast_options: CastOptions { safe: false } }, CastExpr { expr: Literal { value: Int64(11) }, cast_type: Utf8, cast_options: CastOptions { safe: false } }, CastExpr { expr: Literal { value: Int64(12) }, cast_type: Utf8, cast_options: CastOptions { safe: false } }, CastExpr { expr: Literal { value: Int64(13) }, cast_type: Utf8, cast_options: CastOptions { safe: false } }, CastExpr { expr: Literal { value: Int64(14) }, cast_type: Utf8, cast_options: CastOptions { safe: false } }, CastExpr { expr: Literal { value: Int64(15) }, cast_type: Utf8, cast_options: CastOptions { safe: false } }, CastExpr { expr: Literal { value: Int64(16) }, cast_type: Utf8, cast_options: CastOptions { safe: false } }, CastExpr { expr: Literal { value: Int64(17) }, cast_type: Utf8, cast_options: CastOptions { safe: false } }, CastExpr { expr: Literal { value: Int64(18) }, cast_type: Utf8, cast_options: CastOptions { safe: false } }, CastExpr { expr: Literal { value: Int64(19) }, cast_type: Utf8, cast_options: CastOptions { safe: false } }, CastExpr { expr: Literal { value: Int64(20) }, cast_type: Utf8, cast_options: CastOptions { safe: false } }, CastExpr { expr: Literal { value: Int64(21) }, cast_type: Utf8, cast_options: CastOptions { safe: false } }, CastExpr { expr: Literal { value: Int64(22) }, cast_type: Utf8, cast_options: CastOptions { safe: false } }, CastExpr { expr: Literal { value: Int64(23) }, cast_type: Utf8, cast_options: CastOptions { safe: false } }, CastExpr { expr: Literal { value: Int64(24) }, cast_type: Utf8, cast_options: CastOptions { safe: false } }, CastExpr { expr: Literal { value: Int64(25) }, cast_type: Utf8, cast_options: CastOptions { safe: false } }, CastExpr { expr: Literal { value: Int64(26) }, cast_type: Utf8, cast_options: CastOptions { safe: false } }, CastExpr { expr: Literal { value: Int64(27) }, cast_type: Utf8, cast_options: CastOptions { safe: false } }, CastExpr { expr: Literal { value: Int64(28) }, cast_type: Utf8, cast_options: CastOptions { safe: false } }, CastExpr { expr: Literal { value: Int64(29) }, cast_type: Utf8, cast_options: CastOptions { safe: false } }, CastExpr { expr: Literal { value: Int64(30) }, cast_type: Utf8, cast_options: CastOptions { safe: false } }], negated: false, set: Some(InSet { set:"; | ||
| let expected = "expr: [(InListExpr { expr: Column { name: \"c1\", index: 0 }, list: [Literal { value: Utf8(\"a\") }, TryCastExpr { expr: Literal { value: Int64(1) }, cast_type: Utf8 }, TryCastExpr { expr: Literal { value: Int64(2) }, cast_type: Utf8 }, TryCastExpr { expr: Literal { value: Int64(3) }, cast_type: Utf8 }, TryCastExpr { expr: Literal { value: Int64(4) }, cast_type: Utf8 }, TryCastExpr { expr: Literal { value: Int64(5) }, cast_type: Utf8 }, TryCastExpr { expr: Literal { value: Int64(6) }, cast_type: Utf8 }, TryCastExpr { expr: Literal { value: Int64(7) }, cast_type: Utf8 }, TryCastExpr { expr: Literal { value: Int64(8) }, cast_type: Utf8 }, TryCastExpr { expr: Literal { value: Int64(9) }, cast_type: Utf8 }, TryCastExpr { expr: Literal { value: Int64(10) }, cast_type: Utf8 }, TryCastExpr { expr: Literal { value: Int64(11) }, cast_type: Utf8 }, TryCastExpr { expr: Literal { value: Int64(12) }, cast_type: Utf8 }, TryCastExpr { expr: Literal { value: Int64(13) }, cast_type: Utf8 }, TryCastExpr { expr: Literal { value: Int64(14) }, cast_type: Utf8 }, TryCastExpr { expr: Literal { value: Int64(15) }, cast_type: Utf8 }, TryCastExpr { expr: Literal { value: Int64(16) }, cast_type: Utf8 }, TryCastExpr { expr: Literal { value: Int64(17) }, cast_type: Utf8 }, TryCastExpr { expr: Literal { value: Int64(18) }, cast_type: Utf8 }, TryCastExpr { expr: Literal { value: Int64(19) }, cast_type: Utf8 }, TryCastExpr { expr: Literal { value: Int64(20) }, cast_type: Utf8 }, TryCastExpr { expr: Literal { value: Int64(21) }, cast_type: Utf8 }, TryCastExpr { expr: Literal { value: Int64(22) }, cast_type: Utf8 }, TryCastExpr { expr: Literal { value: Int64(23) }, cast_type: Utf8 }, TryCastExpr { expr: Literal { value: Int64(24) }, cast_type: Utf8 }, TryCastExpr { expr: Literal { value: Int64(25) }, cast_type: Utf8 }, TryCastExpr { expr: Literal { value: Int64(26) }, cast_type: Utf8 }, TryCastExpr { expr: Literal { value: Int64(27) }, cast_type: Utf8 }, TryCastExpr { expr: Literal { value: Int64(28) }, cast_type: Utf8 }, TryCastExpr { expr: Literal { value: Int64(29) }, cast_type: Utf8 }, TryCastExpr { expr: Literal { value: Int64(30) }, cast_type: Utf8 }], negated: false, set: None }"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is this expression formatted for anyone else who is interested:
(InListExpr {
expr: Column { name: \"c1\", index: 0 },
list: [Literal { value: Utf8(\"a\") },
TryCastExpr { expr: Literal { value: Int64(1) },cast_type: Utf8 },
TryCastExpr { expr: Literal { value: Int64(2) }, cast_type: Utf8 },
TryCastExpr { expr: Literal { value: Int64(3) }, cast_type: Utf8 },
TryCastExpr { expr: Literal { value: Int64(4) }, cast_type: Utf8 },
TryCastExpr { expr: Literal { value: Int64(5) }, cast_type: Utf8 },
TryCastExpr { expr: Literal { value: Int64(6) }, cast_type: Utf8 },
TryCastExpr { expr: Literal { value: Int64(7) }, cast_type: Utf8 },
TryCastExpr { expr: Literal { value: Int64(8) }, cast_type: Utf8 },
TryCastExpr { expr: Literal { value: Int64(9) }, cast_type: Utf8 },
TryCastExpr { expr: Literal { value: Int64(10) }, cast_type: Utf8 },
TryCastExpr { expr: Literal { value: Int64(11) }, cast_type: Utf8 },
TryCastExpr { expr: Literal { value: Int64(12) }, cast_type: Utf8 },
TryCastExpr { expr: Literal { value: Int64(13) }, cast_type: Utf8 },
TryCastExpr { expr: Literal { value: Int64(14) }, cast_type: Utf8 },
TryCastExpr { expr: Literal { value: Int64(15) }, cast_type: Utf8 },
TryCastExpr { expr: Literal { value: Int64(16) }, cast_type: Utf8 },
TryCastExpr { expr: Literal { value: Int64(17) }, cast_type: Utf8 },
TryCastExpr { expr: Literal { value: Int64(18) }, cast_type: Utf8 },
TryCastExpr { expr: Literal { value: Int64(19) }, cast_type: Utf8 },
TryCastExpr { expr: Literal { value: Int64(20) }, cast_type: Utf8 },
TryCastExpr { expr: Literal { value: Int64(21) }, cast_type: Utf8 },
TryCastExpr { expr: Literal { value: Int64(22) }, cast_type: Utf8 },
TryCastExpr { expr: Literal { value: Int64(23) }, cast_type: Utf8 },
TryCastExpr { expr: Literal { value: Int64(24) }, cast_type: Utf8 },
TryCastExpr { expr: Literal { value: Int64(25) }, cast_type: Utf8 },
TryCastExpr { expr: Literal { value: Int64(26) }, cast_type: Utf8 },
TryCastExpr { expr: Literal { value: Int64(27) }, cast_type: Utf8 },
TryCastExpr { expr: Literal { value: Int64(28) }, cast_type: Utf8 },
TryCastExpr { expr: Literal { value: Int64(29) }, cast_type: Utf8 },
TryCastExpr { expr: Literal { value: Int64(30) }, cast_type: Utf8 }],
negated: false,
set: None
}
Co-authored-by: Andrew Lamb <[email protected]>
Co-authored-by: Andrew Lamb <[email protected]>
Co-authored-by: Andrew Lamb <[email protected]>
liukun4515
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
|
|
||
| fn string_numeric_coercion(lhs_type: &DataType, rhs_type: &DataType) -> Option<DataType> { | ||
| use arrow::datatypes::DataType::*; | ||
| match (lhs_type, rhs_type) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I test in 748b6a65a5fa801595fd80a3c7b2728be3c9cdaa(not this commit)
explain select * from part where p_partkey in (1, 2, '3');
+---------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| plan_type | plan |
+---------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| logical_plan | Projection: #part.p_partkey, #part.p_name, #part.p_mfgr, #part.p_brand, #part.p_type, #part.p_size, #part.p_container, #part.p_retailprice, #part.p_comment |
| | Filter: #part.p_partkey IN ([Int64(1), Int64(2), Utf8("3")]) |
| | TableScan: part projection=Some([p_partkey, p_name, p_mfgr, p_brand, p_type, p_size, p_container, p_retailprice, p_comment]), partial_filters=[#part.p_partkey IN ([Int64(1), Int64(2), Utf8("3")])] |
| physical_plan | ProjectionExec: expr=[p_partkey@0 as p_partkey, p_name@1 as p_name, p_mfgr@2 as p_mfgr, p_brand@3 as p_brand, p_type@4 as p_type, p_size@5 as p_size, p_container@6 as p_container, p_retailprice@7 as p_retailprice, p_comment@8 as p_comment] |
| | CoalesceBatchesExec: target_batch_size=4096 |
| | FilterExec: p_partkey@0 IN ([Literal { value: Int64(1) }, Literal { value: Int64(2) }, CastExpr { expr: Literal { value: Utf8("3") }, cast_type: Int64, cast_options: CastOptions { safe: false } }]) |
| | RepartitionExec: partitioning=RoundRobinBatch(16) |
| | ParquetExec: limit=None, partitions=[/Users/yangjiang/test-data/tpch-1g-oneFile/part/part-00000-3a3c2777-00d3-4c27-b917-4ff2145123dc-c000.snappy.parquet], projection=[p_partkey, p_name, p_mfgr, p_brand, p_type, p_size, p_container, p_retailprice, p_comment] |
| | |
+---------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
int, int,utf8 cast to -> int, int, int,
In my opinion, after apply this patch it will get int, int,utf8cast to ->utf8, utf8, utf8I think when list_values_size is large, we will construct a hashSet in https://github.com/apache/arrow-datafusion/pull/2156, change toint` will get better performance in build hasSet, Am i right? 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
test in this patch
explain select * from part where p_partkey in (1, 2, '3');
+---------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| plan_type | plan |
+---------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| logical_plan | Projection: #part.p_partkey, #part.p_name, #part.p_mfgr, #part.p_brand, #part.p_type, #part.p_size, #part.p_container, #part.p_retailprice, #part.p_comment |
| | Filter: #part.p_partkey IN ([Int64(1), Int64(2), Utf8("3")]) |
| | TableScan: part projection=Some([p_partkey, p_name, p_mfgr, p_brand, p_type, p_size, p_container, p_retailprice, p_comment]), partial_filters=[#part.p_partkey IN ([Int64(1), Int64(2), Utf8("3")])] |
| physical_plan | ProjectionExec: expr=[p_partkey@0 as p_partkey, p_name@1 as p_name, p_mfgr@2 as p_mfgr, p_brand@3 as p_brand, p_type@4 as p_type, p_size@5 as p_size, p_container@6 as p_container, p_retailprice@7 as p_retailprice, p_comment@8 as p_comment] |
| | CoalesceBatchesExec: target_batch_size=4096 |
| | FilterExec: CAST(p_partkey@0 AS Utf8) IN ([TryCastExpr { expr: Literal { value: Int64(1) }, cast_type: Utf8 }, TryCastExpr { expr: Literal { value: Int64(2) }, cast_type: Utf8 }, Literal { value: Utf8("3") }]) |
| | RepartitionExec: partitioning=RoundRobinBatch(16) |
| | ParquetExec: limit=None, partitions=[/Users/yangjiang/test-data/tpch-1g-oneFile/part/part-00000-3a3c2777-00d3-4c27-b917-4ff2145123dc-c000.snappy.parquet], projection=[p_partkey, p_name, p_mfgr, p_brand, p_type, p_size, p_container, p_retailprice, p_comment] |
| | |
+---------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess that you are basically talk the same as #2794 (comment), right?
Actually the string_numeric_coercion rule coerces Utf8 and LargeUtf8 to numeric type in its first version. But a few test cases in sql::expr::test_in_list_scalar failed. For example,
test_expression!("'2' IN ('a','b',1)", "false");
Because 'a' and 'b' cannot converted to int, they will be null. So the result of this in_list expression is null, instead of false now. There are also other similar cases.
So I changed the coercion rule to use Utf8 and LargeUtf8 to more fit with existing logic, without changing too much from existing behavior.
I'm fine if we can get a consensus about if numeric type is more correct for such cases. Then I can change them (the test cases) and the coercion rule.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As it is somehow following current behavior, I can address it in the other issue #2799.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I test in
748b6a65a5fa801595fd80a3c7b2728be3c9cdaa(not this commit)explain select * from part where p_partkey in (1, 2, '3'); +---------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | plan_type | plan | +---------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | logical_plan | Projection: #part.p_partkey, #part.p_name, #part.p_mfgr, #part.p_brand, #part.p_type, #part.p_size, #part.p_container, #part.p_retailprice, #part.p_comment | | | Filter: #part.p_partkey IN ([Int64(1), Int64(2), Utf8("3")]) | | | TableScan: part projection=Some([p_partkey, p_name, p_mfgr, p_brand, p_type, p_size, p_container, p_retailprice, p_comment]), partial_filters=[#part.p_partkey IN ([Int64(1), Int64(2), Utf8("3")])] | | physical_plan | ProjectionExec: expr=[p_partkey@0 as p_partkey, p_name@1 as p_name, p_mfgr@2 as p_mfgr, p_brand@3 as p_brand, p_type@4 as p_type, p_size@5 as p_size, p_container@6 as p_container, p_retailprice@7 as p_retailprice, p_comment@8 as p_comment] | | | CoalesceBatchesExec: target_batch_size=4096 | | | FilterExec: p_partkey@0 IN ([Literal { value: Int64(1) }, Literal { value: Int64(2) }, CastExpr { expr: Literal { value: Utf8("3") }, cast_type: Int64, cast_options: CastOptions { safe: false } }]) | | | RepartitionExec: partitioning=RoundRobinBatch(16) | | | ParquetExec: limit=None, partitions=[/Users/yangjiang/test-data/tpch-1g-oneFile/part/part-00000-3a3c2777-00d3-4c27-b917-4ff2145123dc-c000.snappy.parquet], projection=[p_partkey, p_name, p_mfgr, p_brand, p_type, p_size, p_container, p_retailprice, p_comment] | | | | +---------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
int, int,utf8cast to ->int, int, int,In my opinion, after apply this patch it will get int, int,utf8
cast to ->utf8, utf8, utf8I think when list_values_size is large, we will construct a hashSet in https://github.com/apache/arrow-datafusion/pull/2156, change toint` will get better performance in build hasSet, Am i right? 😄
Yes, the performance is greater for comparing integer.
Now the coercion rule is unstable, we should do mush work in that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm fine if we can get a consensus about if numeric type is more correct for such cases. Then I can change them (the test cases) and the coercion rule.
I think what is important here is to have a consistent set of semantics. I don't have any particular preference related to the automatic coercion of int, int, utf8 as I think there are different tradeoffs
For example coercing int, int, utf8 to utf8, utf8 utf8 would allow a predicate like c1 IN (1, 2 'foo') to run without error (assuming c1 can be coerced to utf8), but would result in a runtime error if we attempted to automatically coerce to int, int int. However, as @Ted-Jiang notes, the performance will be slower for predicates like c1 IN (1, 2, '3')
Best practice would be to explicitly cast all columns to i32 in the query if they were supposed to be compared as i32:
c1 in (1::smallint, 2::smallint, '3'::smallint)but I realize that may not be practical for all users. 🤔
|
Thanks @alamb @liukun4515 @Ted-Jiang |
Which issue does this PR close?
Closes #2793
Closes #2787
Rationale for this change
What changes are included in this PR?
Are there any user-facing changes?