Skip to content

Conversation

@bersprockets
Copy link
Contributor

Backport #35368 to 3.1.

What changes were proposed in this pull request?

Fix hasNext in HiveScriptTransformationExec's process output iterator to always return false if it had previously returned false.

Why are the changes needed?

When hasNext on the process output iterator returns false, it leaves the iterator in a state (i.e., scriptOutputWritable is not null) such that the next call returns true.

The Guava Ordering used in TakeOrderedAndProjectExec will call hasNext on the process output iterator even after an earlier call had returned false. This results in fake rows when script transform is used with order by and limit. For example:

create or replace temp view t as
select * from values
(1),
(2),
(3)
as t(a);

select transform(a)
USING 'cat' AS (a int)
FROM t order by a limit 10;

This returns:

NULL
NULL
NULL
1
2
3

Does this PR introduce any user-facing change?

No, other than removing the correctness issue.

How was this patch tested?

New unit test.

…process output iterator

Fix hasNext in HiveScriptTransformationExec's process output iterator to always return false if it had previously returned false.

When hasNext on the process output iterator returns false, it leaves the iterator in a state (i.e., scriptOutputWritable is not null) such that the next call returns true.

The Guava Ordering used in TakeOrderedAndProjectExec will call hasNext on the process output iterator even after an earlier call had returned false. This results in fake rows when script transform is used with `order by` and `limit`. For example:

```
create or replace temp view t as
select * from values
(1),
(2),
(3)
as t(a);

select transform(a)
USING 'cat' AS (a int)
FROM t order by a limit 10;
```
This returns:
```
NULL
NULL
NULL
1
2
3
```

No, other than removing the correctness issue.

New unit test.

Closes apache#35368 from bersprockets/script_transformation_issue.

Authored-by: Bruce Robbins <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
@github-actions github-actions bot added the SQL label Jan 31, 2022
Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM. Thank you, @bersprockets .
Merged to branch-3.1.

dongjoon-hyun pushed a commit that referenced this pull request Feb 1, 2022
…c`'s process output iterator

Backport #35368 to 3.1.

### What changes were proposed in this pull request?

Fix hasNext in HiveScriptTransformationExec's process output iterator to always return false if it had previously returned false.

### Why are the changes needed?

When hasNext on the process output iterator returns false, it leaves the iterator in a state (i.e., scriptOutputWritable is not null) such that the next call returns true.

The Guava Ordering used in TakeOrderedAndProjectExec will call hasNext on the process output iterator even after an earlier call had returned false. This results in fake rows when script transform is used with `order by` and `limit`. For example:

```
create or replace temp view t as
select * from values
(1),
(2),
(3)
as t(a);

select transform(a)
USING 'cat' AS (a int)
FROM t order by a limit 10;
```
This returns:
```
NULL
NULL
NULL
1
2
3
```

### Does this PR introduce _any_ user-facing change?

No, other than removing the correctness issue.

### How was this patch tested?

New unit test.

Closes #35375 from bersprockets/SPARK-38075_3.1.

Authored-by: Bruce Robbins <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
fishcus pushed a commit to fishcus/spark that referenced this pull request Feb 15, 2022
…c`'s process output iterator

Backport apache#35368 to 3.1.

### What changes were proposed in this pull request?

Fix hasNext in HiveScriptTransformationExec's process output iterator to always return false if it had previously returned false.

### Why are the changes needed?

When hasNext on the process output iterator returns false, it leaves the iterator in a state (i.e., scriptOutputWritable is not null) such that the next call returns true.

The Guava Ordering used in TakeOrderedAndProjectExec will call hasNext on the process output iterator even after an earlier call had returned false. This results in fake rows when script transform is used with `order by` and `limit`. For example:

```
create or replace temp view t as
select * from values
(1),
(2),
(3)
as t(a);

select transform(a)
USING 'cat' AS (a int)
FROM t order by a limit 10;
```
This returns:
```
NULL
NULL
NULL
1
2
3
```

### Does this PR introduce _any_ user-facing change?

No, other than removing the correctness issue.

### How was this patch tested?

New unit test.

Closes apache#35375 from bersprockets/SPARK-38075_3.1.

Authored-by: Bruce Robbins <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
@bersprockets bersprockets deleted the SPARK-38075_3.1 branch August 10, 2022 18:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants