-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-21203] [SQL] Fix wrong results of insertion of Array of Struct #18412
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #78556 has finished for PR 18412 at commit
|
|
LGTM, pending test |
|
Test build #78559 has started for PR 18412 at commit |
|
retest this please |
|
Test build #78562 has finished for PR 18412 at commit
|
|
retest this please |
| if (row.isNullAt(i)) null else castFuncs(i)(row.get(i, from.apply(i).dataType))) | ||
| i += 1 | ||
| } | ||
| newRow.copy() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't it better to just fix GenericInternalRow.copy? I think I broke it when I removed MutableRow (see #15333).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
instead of having a same row instance and copy it every time, I think it makes more sense to create a different row everytime.
Besides, I also had a PR to fix this: #15082 . Maybe I should reopen it...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@hvanhovell I found the semantics of the original GenericMutableRow.copy() pretty dangerous since it didn't properly implement deep copy semantics. In fact, it's impossible to do a proper deep copy there. We only copy the underlying field array but may still share field objects accidentally.
If we do want to "fix" GenericInternalRow.copy(), I'd prefer throwing an exception instead of following the old GenericMutableRow.copy() method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do agree that there are problems with GenericInternalRow and we should document the current shallow copy semantivs. However, since the contract of InternalRow currently allows update, I do think we need to fix copy in order to make it less broken. The
|
Test build #78568 has finished for PR 18412 at commit
|
### What changes were proposed in this pull request?
```SQL
CREATE TABLE `tab1`
(`custom_fields` ARRAY<STRUCT<`id`: BIGINT, `value`: STRING>>)
USING parquet
INSERT INTO `tab1`
SELECT ARRAY(named_struct('id', 1, 'value', 'a'), named_struct('id', 2, 'value', 'b'))
SELECT custom_fields.id, custom_fields.value FROM tab1
```
The above query always return the last struct of the array, because the rule `SimplifyCasts` incorrectly rewrites the query. The underlying cause is we always use the same `GenericInternalRow` object when doing the cast.
### How was this patch tested?
Author: gatorsmile <[email protected]>
Closes #18412 from gatorsmile/castStruct.
(cherry picked from commit 2e1586f)
Signed-off-by: Wenchen Fan <[email protected]>
|
the test failure is unrelated, merging to master/2.2/2.1, thanks! |
### What changes were proposed in this pull request?
```SQL
CREATE TABLE `tab1`
(`custom_fields` ARRAY<STRUCT<`id`: BIGINT, `value`: STRING>>)
USING parquet
INSERT INTO `tab1`
SELECT ARRAY(named_struct('id', 1, 'value', 'a'), named_struct('id', 2, 'value', 'b'))
SELECT custom_fields.id, custom_fields.value FROM tab1
```
The above query always return the last struct of the array, because the rule `SimplifyCasts` incorrectly rewrites the query. The underlying cause is we always use the same `GenericInternalRow` object when doing the cast.
### How was this patch tested?
Author: gatorsmile <[email protected]>
Closes #18412 from gatorsmile/castStruct.
(cherry picked from commit 2e1586f)
Signed-off-by: Wenchen Fan <[email protected]>
### What changes were proposed in this pull request?
```SQL
CREATE TABLE `tab1`
(`custom_fields` ARRAY<STRUCT<`id`: BIGINT, `value`: STRING>>)
USING parquet
INSERT INTO `tab1`
SELECT ARRAY(named_struct('id', 1, 'value', 'a'), named_struct('id', 2, 'value', 'b'))
SELECT custom_fields.id, custom_fields.value FROM tab1
```
The above query always return the last struct of the array, because the rule `SimplifyCasts` incorrectly rewrites the query. The underlying cause is we always use the same `GenericInternalRow` object when doing the cast.
### How was this patch tested?
Author: gatorsmile <[email protected]>
Closes apache#18412 from gatorsmile/castStruct.
```SQL
CREATE TABLE `tab1`
(`custom_fields` ARRAY<STRUCT<`id`: BIGINT, `value`: STRING>>)
USING parquet
INSERT INTO `tab1`
SELECT ARRAY(named_struct('id', 1, 'value', 'a'), named_struct('id', 2, 'value', 'b'))
SELECT custom_fields.id, custom_fields.value FROM tab1
```
The above query always return the last struct of the array, because the rule `SimplifyCasts` incorrectly rewrites the query. The underlying cause is we always use the same `GenericInternalRow` object when doing the cast.
Author: gatorsmile <[email protected]>
Closes apache#18412 from gatorsmile/castStruct.
(cherry picked from commit 2e1586f)
Signed-off-by: Wenchen Fan <[email protected]>
What changes were proposed in this pull request?
The above query always return the last struct of the array, because the rule
SimplifyCastsincorrectly rewrites the query. The underlying cause is we always use the sameGenericInternalRowobject when doing the cast.How was this patch tested?