Skip to content

Conversation

@gatorsmile
Copy link
Member

What changes were proposed in this pull request?

CREATE TABLE `tab1`
(`custom_fields` ARRAY<STRUCT<`id`: BIGINT, `value`: STRING>>)
USING parquet

INSERT INTO `tab1`
SELECT ARRAY(named_struct('id', 1, 'value', 'a'), named_struct('id', 2, 'value', 'b'))

SELECT custom_fields.id, custom_fields.value FROM tab1

The above query always return the last struct of the array, because the rule SimplifyCasts incorrectly rewrites the query. The underlying cause is we always use the same GenericInternalRow object when doing the cast.

How was this patch tested?

@gatorsmile gatorsmile changed the title Fix wrong results of insertion of Array of Struct [SPARK-21203] [SQL] Fix wrong results of insertion of Array of Struct Jun 24, 2017
@gatorsmile
Copy link
Member Author

gatorsmile commented Jun 24, 2017

cc @liancheng @cloud-fan @hvanhovell

@SparkQA
Copy link

SparkQA commented Jun 24, 2017

Test build #78556 has finished for PR 18412 at commit 3be3475.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

LGTM, pending test

@SparkQA
Copy link

SparkQA commented Jun 24, 2017

Test build #78559 has started for PR 18412 at commit 6ad657c.

@gatorsmile
Copy link
Member Author

retest this please

@SparkQA
Copy link

SparkQA commented Jun 24, 2017

Test build #78562 has finished for PR 18412 at commit 6ad657c.

  • This patch fails PySpark pip packaging tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

retest this please

if (row.isNullAt(i)) null else castFuncs(i)(row.get(i, from.apply(i).dataType)))
i += 1
}
newRow.copy()
Copy link
Contributor

@hvanhovell hvanhovell Jun 24, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't it better to just fix GenericInternalRow.copy? I think I broke it when I removed MutableRow (see #15333).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of having a same row instance and copy it every time, I think it makes more sense to create a different row everytime.

Besides, I also had a PR to fix this: #15082 . Maybe I should reopen it...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hvanhovell I found the semantics of the original GenericMutableRow.copy() pretty dangerous since it didn't properly implement deep copy semantics. In fact, it's impossible to do a proper deep copy there. We only copy the underlying field array but may still share field objects accidentally.

If we do want to "fix" GenericInternalRow.copy(), I'd prefer throwing an exception instead of following the old GenericMutableRow.copy() method.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do agree that there are problems with GenericInternalRow and we should document the current shallow copy semantivs. However, since the contract of InternalRow currently allows update, I do think we need to fix copy in order to make it less broken. The

@SparkQA
Copy link

SparkQA commented Jun 24, 2017

Test build #78568 has finished for PR 18412 at commit 6ad657c.

  • This patch fails PySpark pip packaging tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

asfgit pushed a commit that referenced this pull request Jun 24, 2017
### What changes were proposed in this pull request?
```SQL
CREATE TABLE `tab1`
(`custom_fields` ARRAY<STRUCT<`id`: BIGINT, `value`: STRING>>)
USING parquet

INSERT INTO `tab1`
SELECT ARRAY(named_struct('id', 1, 'value', 'a'), named_struct('id', 2, 'value', 'b'))

SELECT custom_fields.id, custom_fields.value FROM tab1
```

The above query always return the last struct of the array, because the rule `SimplifyCasts` incorrectly rewrites the query. The underlying cause is we always use the same `GenericInternalRow` object when doing the cast.

### How was this patch tested?

Author: gatorsmile <[email protected]>

Closes #18412 from gatorsmile/castStruct.

(cherry picked from commit 2e1586f)
Signed-off-by: Wenchen Fan <[email protected]>
@cloud-fan
Copy link
Contributor

the test failure is unrelated, merging to master/2.2/2.1, thanks!

asfgit pushed a commit that referenced this pull request Jun 24, 2017
### What changes were proposed in this pull request?
```SQL
CREATE TABLE `tab1`
(`custom_fields` ARRAY<STRUCT<`id`: BIGINT, `value`: STRING>>)
USING parquet

INSERT INTO `tab1`
SELECT ARRAY(named_struct('id', 1, 'value', 'a'), named_struct('id', 2, 'value', 'b'))

SELECT custom_fields.id, custom_fields.value FROM tab1
```

The above query always return the last struct of the array, because the rule `SimplifyCasts` incorrectly rewrites the query. The underlying cause is we always use the same `GenericInternalRow` object when doing the cast.

### How was this patch tested?

Author: gatorsmile <[email protected]>

Closes #18412 from gatorsmile/castStruct.

(cherry picked from commit 2e1586f)
Signed-off-by: Wenchen Fan <[email protected]>
@asfgit asfgit closed this in 2e1586f Jun 24, 2017
robert3005 pushed a commit to palantir/spark that referenced this pull request Jun 29, 2017
### What changes were proposed in this pull request?
```SQL
CREATE TABLE `tab1`
(`custom_fields` ARRAY<STRUCT<`id`: BIGINT, `value`: STRING>>)
USING parquet

INSERT INTO `tab1`
SELECT ARRAY(named_struct('id', 1, 'value', 'a'), named_struct('id', 2, 'value', 'b'))

SELECT custom_fields.id, custom_fields.value FROM tab1
```

The above query always return the last struct of the array, because the rule `SimplifyCasts` incorrectly rewrites the query. The underlying cause is we always use the same `GenericInternalRow` object when doing the cast.

### How was this patch tested?

Author: gatorsmile <[email protected]>

Closes apache#18412 from gatorsmile/castStruct.
jzhuge pushed a commit to jzhuge/spark that referenced this pull request Aug 20, 2018
```SQL
CREATE TABLE `tab1`
(`custom_fields` ARRAY<STRUCT<`id`: BIGINT, `value`: STRING>>)
USING parquet

INSERT INTO `tab1`
SELECT ARRAY(named_struct('id', 1, 'value', 'a'), named_struct('id', 2, 'value', 'b'))

SELECT custom_fields.id, custom_fields.value FROM tab1
```

The above query always return the last struct of the array, because the rule `SimplifyCasts` incorrectly rewrites the query. The underlying cause is we always use the same `GenericInternalRow` object when doing the cast.

Author: gatorsmile <[email protected]>

Closes apache#18412 from gatorsmile/castStruct.

(cherry picked from commit 2e1586f)
Signed-off-by: Wenchen Fan <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants