-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-32318][SQL][TESTS] Add a test case to EliminateSortsSuite for ORDER BY in DISTRIBUTE BY #29118
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…ORDER BY in DISTRIBUTE BY
|
Could you review this, @viirya ? This will protect us from the future regression. This part is tricky. |
|
Also, cc @cloud-fan , @HyukjinKwon , @maropu |
| comparePlans(optimized, correctAnswer) | ||
| } | ||
|
|
||
| test("SPARK-32318 should not remove orderBy in distribute statement") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
super nit: in most cases, add : in the prefix, SPARK-32318:?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh.. Right. I missed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, I know you just forgot it
maropu
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I see...
viirya
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Although it is hard to know why we cannot remove orderBy in distribute statement from the test case.
I'm okay to add this. At least there is JIRA ticket we can track it.
|
Do you think it is easy to add a test that checks file size like in the description? Or current one is enough? |
|
Actually, the file size check test cases are very
I believe this one is enough because file generations cost us write/read/full execution time in Jenkins and GitHub~ |
|
okay, sounds good. |
|
@dongjoon-hyun Tiny nit: The 2nd example still has the inner ORDER BY clause. We meant to remove it , right ? |
|
Thanks, @dilipbiswal . You're also right. Initially, it was intentional. To be more clear, I will add a comment to refer SPARK-32318. Technically, if the user writes a code |
|
Thank you all. GitHub Action |
…ORDER BY in DISTRIBUTE BY
### What changes were proposed in this pull request?
This PR aims to add a test case to EliminateSortsSuite to protect a valid use case which is using ORDER BY in DISTRIBUTE BY statement.
### Why are the changes needed?
```scala
scala> scala.util.Random.shuffle((1 to 100000).map(x => (x % 2, x))).toDF("a", "b").repartition(2).createOrReplaceTempView("t")
scala> sql("select * from (select * from t order by b) distribute by a").write.orc("/tmp/master")
$ ls -al /tmp/master/
total 56
drwxr-xr-x 10 dongjoon wheel 320 Jul 14 22:12 ./
drwxrwxrwt 15 root wheel 480 Jul 14 22:12 ../
-rw-r--r-- 1 dongjoon wheel 8 Jul 14 22:12 ._SUCCESS.crc
-rw-r--r-- 1 dongjoon wheel 12 Jul 14 22:12 .part-00000-2cd3a50e-eded-49a4-b7cf-94e3f090b8c1-c000.snappy.orc.crc
-rw-r--r-- 1 dongjoon wheel 16 Jul 14 22:12 .part-00043-2cd3a50e-eded-49a4-b7cf-94e3f090b8c1-c000.snappy.orc.crc
-rw-r--r-- 1 dongjoon wheel 16 Jul 14 22:12 .part-00191-2cd3a50e-eded-49a4-b7cf-94e3f090b8c1-c000.snappy.orc.crc
-rw-r--r-- 1 dongjoon wheel 0 Jul 14 22:12 _SUCCESS
-rw-r--r-- 1 dongjoon wheel 119 Jul 14 22:12 part-00000-2cd3a50e-eded-49a4-b7cf-94e3f090b8c1-c000.snappy.orc
-rw-r--r-- 1 dongjoon wheel 932 Jul 14 22:12 part-00043-2cd3a50e-eded-49a4-b7cf-94e3f090b8c1-c000.snappy.orc
-rw-r--r-- 1 dongjoon wheel 939 Jul 14 22:12 part-00191-2cd3a50e-eded-49a4-b7cf-94e3f090b8c1-c000.snappy.orc
```
The following was found during SPARK-32276. If Spark optimizer removes the inner `ORDER BY`, the file size increases.
```scala
scala> scala.util.Random.shuffle((1 to 100000).map(x => (x % 2, x))).toDF("a", "b").repartition(2).createOrReplaceTempView("t")
scala> sql("select * from (select * from t order by b) distribute by a").write.orc("/tmp/SPARK-32276")
$ ls -al /tmp/SPARK-32276/
total 632
drwxr-xr-x 10 dongjoon wheel 320 Jul 14 22:08 ./
drwxrwxrwt 14 root wheel 448 Jul 14 22:08 ../
-rw-r--r-- 1 dongjoon wheel 8 Jul 14 22:08 ._SUCCESS.crc
-rw-r--r-- 1 dongjoon wheel 12 Jul 14 22:08 .part-00000-ba5049f9-b835-49b7-9fdb-bdd11b9891cb-c000.snappy.orc.crc
-rw-r--r-- 1 dongjoon wheel 1188 Jul 14 22:08 .part-00043-ba5049f9-b835-49b7-9fdb-bdd11b9891cb-c000.snappy.orc.crc
-rw-r--r-- 1 dongjoon wheel 1188 Jul 14 22:08 .part-00191-ba5049f9-b835-49b7-9fdb-bdd11b9891cb-c000.snappy.orc.crc
-rw-r--r-- 1 dongjoon wheel 0 Jul 14 22:08 _SUCCESS
-rw-r--r-- 1 dongjoon wheel 119 Jul 14 22:08 part-00000-ba5049f9-b835-49b7-9fdb-bdd11b9891cb-c000.snappy.orc
-rw-r--r-- 1 dongjoon wheel 150735 Jul 14 22:08 part-00043-ba5049f9-b835-49b7-9fdb-bdd11b9891cb-c000.snappy.orc
-rw-r--r-- 1 dongjoon wheel 150741 Jul 14 22:08 part-00191-ba5049f9-b835-49b7-9fdb-bdd11b9891cb-c000.snappy.orc
```
### Does this PR introduce _any_ user-facing change?
No. This only improves the test coverage.
### How was this patch tested?
Pass the GitHub Action or Jenkins.
Closes #29118 from dongjoon-hyun/SPARK-32318.
Authored-by: Dongjoon Hyun <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
(cherry picked from commit 8950dcb)
Signed-off-by: Dongjoon Hyun <[email protected]>
…ORDER BY in DISTRIBUTE BY
This PR aims to add a test case to EliminateSortsSuite to protect a valid use case which is using ORDER BY in DISTRIBUTE BY statement.
```scala
scala> scala.util.Random.shuffle((1 to 100000).map(x => (x % 2, x))).toDF("a", "b").repartition(2).createOrReplaceTempView("t")
scala> sql("select * from (select * from t order by b) distribute by a").write.orc("/tmp/master")
$ ls -al /tmp/master/
total 56
drwxr-xr-x 10 dongjoon wheel 320 Jul 14 22:12 ./
drwxrwxrwt 15 root wheel 480 Jul 14 22:12 ../
-rw-r--r-- 1 dongjoon wheel 8 Jul 14 22:12 ._SUCCESS.crc
-rw-r--r-- 1 dongjoon wheel 12 Jul 14 22:12 .part-00000-2cd3a50e-eded-49a4-b7cf-94e3f090b8c1-c000.snappy.orc.crc
-rw-r--r-- 1 dongjoon wheel 16 Jul 14 22:12 .part-00043-2cd3a50e-eded-49a4-b7cf-94e3f090b8c1-c000.snappy.orc.crc
-rw-r--r-- 1 dongjoon wheel 16 Jul 14 22:12 .part-00191-2cd3a50e-eded-49a4-b7cf-94e3f090b8c1-c000.snappy.orc.crc
-rw-r--r-- 1 dongjoon wheel 0 Jul 14 22:12 _SUCCESS
-rw-r--r-- 1 dongjoon wheel 119 Jul 14 22:12 part-00000-2cd3a50e-eded-49a4-b7cf-94e3f090b8c1-c000.snappy.orc
-rw-r--r-- 1 dongjoon wheel 932 Jul 14 22:12 part-00043-2cd3a50e-eded-49a4-b7cf-94e3f090b8c1-c000.snappy.orc
-rw-r--r-- 1 dongjoon wheel 939 Jul 14 22:12 part-00191-2cd3a50e-eded-49a4-b7cf-94e3f090b8c1-c000.snappy.orc
```
The following was found during SPARK-32276. If Spark optimizer removes the inner `ORDER BY`, the file size increases.
```scala
scala> scala.util.Random.shuffle((1 to 100000).map(x => (x % 2, x))).toDF("a", "b").repartition(2).createOrReplaceTempView("t")
scala> sql("select * from (select * from t order by b) distribute by a").write.orc("/tmp/SPARK-32276")
$ ls -al /tmp/SPARK-32276/
total 632
drwxr-xr-x 10 dongjoon wheel 320 Jul 14 22:08 ./
drwxrwxrwt 14 root wheel 448 Jul 14 22:08 ../
-rw-r--r-- 1 dongjoon wheel 8 Jul 14 22:08 ._SUCCESS.crc
-rw-r--r-- 1 dongjoon wheel 12 Jul 14 22:08 .part-00000-ba5049f9-b835-49b7-9fdb-bdd11b9891cb-c000.snappy.orc.crc
-rw-r--r-- 1 dongjoon wheel 1188 Jul 14 22:08 .part-00043-ba5049f9-b835-49b7-9fdb-bdd11b9891cb-c000.snappy.orc.crc
-rw-r--r-- 1 dongjoon wheel 1188 Jul 14 22:08 .part-00191-ba5049f9-b835-49b7-9fdb-bdd11b9891cb-c000.snappy.orc.crc
-rw-r--r-- 1 dongjoon wheel 0 Jul 14 22:08 _SUCCESS
-rw-r--r-- 1 dongjoon wheel 119 Jul 14 22:08 part-00000-ba5049f9-b835-49b7-9fdb-bdd11b9891cb-c000.snappy.orc
-rw-r--r-- 1 dongjoon wheel 150735 Jul 14 22:08 part-00043-ba5049f9-b835-49b7-9fdb-bdd11b9891cb-c000.snappy.orc
-rw-r--r-- 1 dongjoon wheel 150741 Jul 14 22:08 part-00191-ba5049f9-b835-49b7-9fdb-bdd11b9891cb-c000.snappy.orc
```
No. This only improves the test coverage.
Pass the GitHub Action or Jenkins.
Closes #29118 from dongjoon-hyun/SPARK-32318.
Authored-by: Dongjoon Hyun <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
(cherry picked from commit 8950dcb)
Signed-off-by: Dongjoon Hyun <[email protected]>
What changes were proposed in this pull request?
This PR aims to add a test case to EliminateSortsSuite to protect a valid use case which is using ORDER BY in DISTRIBUTE BY statement.
Why are the changes needed?
The following was found during SPARK-32276. If Spark optimizer removes the inner
ORDER BY, the file size increases.Does this PR introduce any user-facing change?
No. This only improves the test coverage.
How was this patch tested?
Pass the GitHub Action or Jenkins.