[SPARK-41982][SQL] Partitions of type string should not be treated as numeric types #39558

smallzhongfeng · 2023-01-13T19:09:34Z

What changes were proposed in this pull request?

Ensure that partitions of type string without quotation marks are not recognized as numeric types.
For example:

create table if not exists test_90(a string, b string) partitioned by (dt string);
desc formatted test_90;
insert into table test_90 partition (dt=05) values("1","2");
insert into table test_90 partition (dt='05') values("1","2");
drop table test_90;

before spark3.1 and earlier, it will generate such a path: hdfs://test5/user/hive/db1/test_90/dt=05

spark-sql> select * from test_90;
1       2       05
1       2       05
Time taken: 1.316 seconds, Fetched 2 row(s)

spark-sql> show partitions test_90; 
dt=05 
Time taken: 0.201 seconds, Fetched 1 row(s)

spark-sql> select * from test_90 where dt='05';
1       2       05
1       2       05
Time taken: 0.212 seconds, Fetched 2 row(s)

after spark3.1, it will generate two path: hdfs://test5/user/hive/db1/test_90/dt=05 and hdfs://test5/user/hive/db1/test_90/dt=5

spark-sql> select * from test_90;
1       2       05
1       2       5
Time taken: 2.119 seconds, Fetched 2 row(s)

spark-sql> show partitions test_90;
dt=05
dt=5
Time taken: 0.161 seconds, Fetched 2 row(s)

spark-sql> select * from test_90 where dt='05';
1       2       05
Time taken: 0.252 seconds, Fetched 1 row(s)

This will cause inconsistent read data. After seeing #30421, I think if the user does not know about this change and the migration document does not mention it, I think it will affect the data quality, so I added the parameter spark.sql.legacy.keepPartitionSpecAsStringLiteral, which will maintain the original effect when the parameter set true.

Why are the changes needed?

If the partition is of String, but the value of partition without quotation marks, it will still be treated as String through parameter configuration.

Does this PR introduce any user-facing change?

After the parameter spark.sql.legacy.keepPartitionSpecAsStringLiteral is enabled, the partition path generated by partition partition (dt=05) and partition partition (dt='05') is the same.

How was this patch tested?

New uts.

…meric types

smallzhongfeng · 2023-01-14T07:45:57Z

cc @AngersZhuuuu @cloud-fan @maropu @wangyum @dongjoon-hyun @HyukjinKwon Hope to get your reply, thanks :-)

AmplabJenkins · 2023-01-15T10:59:39Z

Can one of the admins verify this patch?

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

sql/core/src/main/scala/org/apache/spark/sql/execution/SparkSqlParser.scala

sql/core/src/test/scala/org/apache/spark/sql/SQLInsertTestSuite.scala

smallzhongfeng · 2023-01-16T12:19:31Z

Could you help me review again ? @cloud-fan

docs/sql-migration-guide.md

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

sql/core/src/test/scala/org/apache/spark/sql/SQLInsertTestSuite.scala

smallzhongfeng · 2023-01-17T03:04:43Z

All comments have been addressed, PTAL, thanks @cloud-fan

cloud-fan · 2023-01-17T04:44:36Z

thanks, merging to master!

smallzhongfeng · 2023-01-17T06:16:07Z

thanks, merging to master!

Thanks for your review! @cloud-fan

gatorsmile · 2023-02-27T04:10:15Z

@smallzhongfeng Could you help add it to the migration guide?

smallzhongfeng · 2023-02-27T09:41:24Z

Sure, but the result of the previous discussion is that there is no need to add, you can see #39558 (comment) I will add it if necessary. @gatorsmile

[SPARK-41982] Partitions of type string should not be converted to nu…

8c8e281

…meric types

github-actions bot added the SQL label Jan 13, 2023

HyukjinKwon changed the title ~~[SPARK-41982] Partitions of type string should not be treated as numeric types~~ [SPARK-41982][SQL] Partitions of type string should not be treated as numeric types Jan 13, 2023

fix sth

fdd89b8

smallzhongfeng added 2 commits January 14, 2023 19:31

nit

80ca487

retrigger checks

485f9d1