Skip to content

Conversation

@smallzhongfeng
Copy link
Contributor

@smallzhongfeng smallzhongfeng commented Jan 13, 2023

What changes were proposed in this pull request?

Ensure that partitions of type string without quotation marks are not recognized as numeric types.
For example:

create table if not exists test_90(a string, b string) partitioned by (dt string);
desc formatted test_90;
insert into table test_90 partition (dt=05) values("1","2");
insert into table test_90 partition (dt='05') values("1","2");
drop table test_90;

before spark3.1 and earlier, it will generate such a path: hdfs://test5/user/hive/db1/test_90/dt=05

spark-sql> select * from test_90;
1       2       05
1       2       05
Time taken: 1.316 seconds, Fetched 2 row(s)

spark-sql> show partitions test_90; 
dt=05 
Time taken: 0.201 seconds, Fetched 1 row(s)

spark-sql> select * from test_90 where dt='05';
1       2       05
1       2       05
Time taken: 0.212 seconds, Fetched 2 row(s)

after spark3.1, it will generate two path: hdfs://test5/user/hive/db1/test_90/dt=05 and hdfs://test5/user/hive/db1/test_90/dt=5

spark-sql> select * from test_90;
1       2       05
1       2       5
Time taken: 2.119 seconds, Fetched 2 row(s)

spark-sql> show partitions test_90;
dt=05
dt=5
Time taken: 0.161 seconds, Fetched 2 row(s)

spark-sql> select * from test_90 where dt='05';
1       2       05
Time taken: 0.252 seconds, Fetched 1 row(s)

This will cause inconsistent read data. After seeing #30421, I think if the user does not know about this change and the migration document does not mention it, I think it will affect the data quality, so I added the parameter spark.sql.legacy.keepPartitionSpecAsStringLiteral, which will maintain the original effect when the parameter set true.

Why are the changes needed?

If the partition is of String, but the value of partition without quotation marks, it will still be treated as String through parameter configuration.

Does this PR introduce any user-facing change?

After the parameter spark.sql.legacy.keepPartitionSpecAsStringLiteral is enabled, the partition path generated by partition partition (dt=05) and partition partition (dt='05') is the same.

How was this patch tested?

New uts.

@github-actions github-actions bot added the SQL label Jan 13, 2023
@HyukjinKwon HyukjinKwon changed the title [SPARK-41982] Partitions of type string should not be treated as numeric types [SPARK-41982][SQL] Partitions of type string should not be treated as numeric types Jan 13, 2023
@smallzhongfeng
Copy link
Contributor Author

smallzhongfeng commented Jan 14, 2023

cc @AngersZhuuuu @cloud-fan @maropu @wangyum @dongjoon-hyun @HyukjinKwon Hope to get your reply, thanks :-)

smallzhongfeng added 2 commits January 14, 2023 19:31
@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@github-actions github-actions bot added the DOCS label Jan 16, 2023
@smallzhongfeng
Copy link
Contributor Author

Could you help me review again ? @cloud-fan

@github-actions github-actions bot removed the DOCS label Jan 16, 2023
@smallzhongfeng
Copy link
Contributor Author

All comments have been addressed, PTAL, thanks @cloud-fan

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@cloud-fan cloud-fan closed this in 97a6955 Jan 17, 2023
@smallzhongfeng
Copy link
Contributor Author

thanks, merging to master!

Thanks for your review! @cloud-fan

@gatorsmile
Copy link
Member

@smallzhongfeng Could you help add it to the migration guide?

@smallzhongfeng
Copy link
Contributor Author

Sure, but the result of the previous discussion is that there is no need to add, you can see #39558 (comment) I will add it if necessary. @gatorsmile

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants