Skip to content

Conversation

@bersprockets
Copy link
Contributor

What changes were proposed in this pull request?

Change LimitPushDownThroughWindow so that it no longer supports pushing down a limit through a window using ntile.

Why are the changes needed?

In an unpartitioned window, the ntile function is currently applied to the result of the limit. This behavior produces results that conflict with Spark 3.1.3, Hive 2.3.9 and Prestodb 0.268

Example

Assume this data:

create table t1 stored as parquet as
select *
from range(101);

Also assume this query:

select id, ntile(10) over (order by id) as nt
from t1
limit 10;

With Spark 3.2.2, Spark 3.3.0, and master, the limit is applied before the ntile function:

+---+---+
|id |nt |
+---+---+
|0  |1  |
|1  |2  |
|2  |3  |
|3  |4  |
|4  |5  |
|5  |6  |
|6  |7  |
|7  |8  |
|8  |9  |
|9  |10 |
+---+---+

With Spark 3.1.3, and Hive 2.3.9, and Prestodb 0.268, the limit is applied after ntile.

Spark 3.1.3:

+---+---+
|id |nt |
+---+---+
|0  |1  |
|1  |1  |
|2  |1  |
|3  |1  |
|4  |1  |
|5  |1  |
|6  |1  |
|7  |1  |
|8  |1  |
|9  |1  |
+---+---+

Hive 2.3.9:

+-----+-----+
| id  | nt  |
+-----+-----+
| 0   | 1   |
| 1   | 1   |
| 2   | 1   |
| 3   | 1   |
| 4   | 1   |
| 5   | 1   |
| 6   | 1   |
| 7   | 1   |
| 8   | 1   |
| 9   | 1   |
+-----+-----+
10 rows selected (1.72 seconds)

Prestodb 0.268:

 id | nt 
----+----
  0 |  1 
  1 |  1 
  2 |  1 
  3 |  1 
  4 |  1 
  5 |  1 
  6 |  1 
  7 |  1 
  8 |  1 
  9 |  1 
(10 rows)

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Two new unit tests.

@github-actions github-actions bot added the SQL label Aug 8, 2022
@bersprockets
Copy link
Contributor Author

@wangyum

I should have fixed this when I was fixing the same issue with percent_rank (SPARK-38614), but I missed it.

@HyukjinKwon
Copy link
Member

Merged to master, branch-3.3, and branch-3.2.

HyukjinKwon pushed a commit that referenced this pull request Aug 9, 2022
### What changes were proposed in this pull request?

Change `LimitPushDownThroughWindow` so that it no longer supports pushing down a limit through a window using ntile.

### Why are the changes needed?

In an unpartitioned window, the ntile function is currently applied to the result of the limit. This behavior produces results that conflict with Spark 3.1.3, Hive 2.3.9 and Prestodb 0.268

#### Example

Assume this data:
```
create table t1 stored as parquet as
select *
from range(101);
```
Also assume this query:
```
select id, ntile(10) over (order by id) as nt
from t1
limit 10;
```
With Spark 3.2.2, Spark 3.3.0, and master, the limit is applied before the ntile function:
```
+---+---+
|id |nt |
+---+---+
|0  |1  |
|1  |2  |
|2  |3  |
|3  |4  |
|4  |5  |
|5  |6  |
|6  |7  |
|7  |8  |
|8  |9  |
|9  |10 |
+---+---+
```
With Spark 3.1.3, and Hive 2.3.9, and Prestodb 0.268, the limit is applied _after_ ntile.

Spark 3.1.3:
```
+---+---+
|id |nt |
+---+---+
|0  |1  |
|1  |1  |
|2  |1  |
|3  |1  |
|4  |1  |
|5  |1  |
|6  |1  |
|7  |1  |
|8  |1  |
|9  |1  |
+---+---+
```
Hive 2.3.9:
```
+-----+-----+
| id  | nt  |
+-----+-----+
| 0   | 1   |
| 1   | 1   |
| 2   | 1   |
| 3   | 1   |
| 4   | 1   |
| 5   | 1   |
| 6   | 1   |
| 7   | 1   |
| 8   | 1   |
| 9   | 1   |
+-----+-----+
10 rows selected (1.72 seconds)
```
Prestodb 0.268:
```
 id | nt
----+----
  0 |  1
  1 |  1
  2 |  1
  3 |  1
  4 |  1
  5 |  1
  6 |  1
  7 |  1
  8 |  1
  9 |  1
(10 rows)

```
### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Two new unit tests.

Closes #37443 from bersprockets/pushdown_ntile.

Authored-by: Bruce Robbins <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
(cherry picked from commit c9156e5)
Signed-off-by: Hyukjin Kwon <[email protected]>
HyukjinKwon pushed a commit that referenced this pull request Aug 9, 2022
### What changes were proposed in this pull request?

Change `LimitPushDownThroughWindow` so that it no longer supports pushing down a limit through a window using ntile.

### Why are the changes needed?

In an unpartitioned window, the ntile function is currently applied to the result of the limit. This behavior produces results that conflict with Spark 3.1.3, Hive 2.3.9 and Prestodb 0.268

#### Example

Assume this data:
```
create table t1 stored as parquet as
select *
from range(101);
```
Also assume this query:
```
select id, ntile(10) over (order by id) as nt
from t1
limit 10;
```
With Spark 3.2.2, Spark 3.3.0, and master, the limit is applied before the ntile function:
```
+---+---+
|id |nt |
+---+---+
|0  |1  |
|1  |2  |
|2  |3  |
|3  |4  |
|4  |5  |
|5  |6  |
|6  |7  |
|7  |8  |
|8  |9  |
|9  |10 |
+---+---+
```
With Spark 3.1.3, and Hive 2.3.9, and Prestodb 0.268, the limit is applied _after_ ntile.

Spark 3.1.3:
```
+---+---+
|id |nt |
+---+---+
|0  |1  |
|1  |1  |
|2  |1  |
|3  |1  |
|4  |1  |
|5  |1  |
|6  |1  |
|7  |1  |
|8  |1  |
|9  |1  |
+---+---+
```
Hive 2.3.9:
```
+-----+-----+
| id  | nt  |
+-----+-----+
| 0   | 1   |
| 1   | 1   |
| 2   | 1   |
| 3   | 1   |
| 4   | 1   |
| 5   | 1   |
| 6   | 1   |
| 7   | 1   |
| 8   | 1   |
| 9   | 1   |
+-----+-----+
10 rows selected (1.72 seconds)
```
Prestodb 0.268:
```
 id | nt
----+----
  0 |  1
  1 |  1
  2 |  1
  3 |  1
  4 |  1
  5 |  1
  6 |  1
  7 |  1
  8 |  1
  9 |  1
(10 rows)

```
### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Two new unit tests.

Closes #37443 from bersprockets/pushdown_ntile.

Authored-by: Bruce Robbins <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
(cherry picked from commit c9156e5)
Signed-off-by: Hyukjin Kwon <[email protected]>
Copy link
Member

@wangyum wangyum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Late LGTM.

@bersprockets bersprockets deleted the pushdown_ntile branch August 29, 2022 15:23
sunchao pushed a commit to sunchao/spark that referenced this pull request Jun 2, 2023
### What changes were proposed in this pull request?

Change `LimitPushDownThroughWindow` so that it no longer supports pushing down a limit through a window using ntile.

### Why are the changes needed?

In an unpartitioned window, the ntile function is currently applied to the result of the limit. This behavior produces results that conflict with Spark 3.1.3, Hive 2.3.9 and Prestodb 0.268

#### Example

Assume this data:
```
create table t1 stored as parquet as
select *
from range(101);
```
Also assume this query:
```
select id, ntile(10) over (order by id) as nt
from t1
limit 10;
```
With Spark 3.2.2, Spark 3.3.0, and master, the limit is applied before the ntile function:
```
+---+---+
|id |nt |
+---+---+
|0  |1  |
|1  |2  |
|2  |3  |
|3  |4  |
|4  |5  |
|5  |6  |
|6  |7  |
|7  |8  |
|8  |9  |
|9  |10 |
+---+---+
```
With Spark 3.1.3, and Hive 2.3.9, and Prestodb 0.268, the limit is applied _after_ ntile.

Spark 3.1.3:
```
+---+---+
|id |nt |
+---+---+
|0  |1  |
|1  |1  |
|2  |1  |
|3  |1  |
|4  |1  |
|5  |1  |
|6  |1  |
|7  |1  |
|8  |1  |
|9  |1  |
+---+---+
```
Hive 2.3.9:
```
+-----+-----+
| id  | nt  |
+-----+-----+
| 0   | 1   |
| 1   | 1   |
| 2   | 1   |
| 3   | 1   |
| 4   | 1   |
| 5   | 1   |
| 6   | 1   |
| 7   | 1   |
| 8   | 1   |
| 9   | 1   |
+-----+-----+
10 rows selected (1.72 seconds)
```
Prestodb 0.268:
```
 id | nt
----+----
  0 |  1
  1 |  1
  2 |  1
  3 |  1
  4 |  1
  5 |  1
  6 |  1
  7 |  1
  8 |  1
  9 |  1
(10 rows)

```
### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Two new unit tests.

Closes apache#37443 from bersprockets/pushdown_ntile.

Authored-by: Bruce Robbins <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
(cherry picked from commit c9156e5)
Signed-off-by: Hyukjin Kwon <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants