[SPARK-21804][SQL] json_tuple returns null values within repeated columns except the first one#19017
[SPARK-21804][SQL] json_tuple returns null values within repeated columns except the first one#19017jmchung wants to merge 4 commits intoapache:masterfrom
Conversation
…xcept the first one
|
cc @viirya |
|
ok to test |
|
Thanks for triggering the test @HyukjinKwon |
|
Test build #80971 has finished for PR 19017 at commit
|
There was a problem hiding this comment.
@jmchung Could we avoid functional transformations by a while loop here? I think this should be avoided, in particular, when we are in a hot path. This should be a valid suggestion per https://github.com/databricks/scala-style-guide#traversal-and-zipwithindex
I was thinking of an additional while loop with a if because all we need is to set the same value to multiple fields (while) if the field is the same (if) ..
|
@HyukjinKwon That's a good point, thanks. |
|
@HyukjinKwon @viirya I replaced the functional transformations with a while loop. |
|
Test build #81016 has finished for PR 19017 at commit
|
|
retest this please |
1 similar comment
|
retest this please |
|
Test build #81025 has finished for PR 19017 at commit
|
|
Please edit the PR title as |
|
@viirya PR title fixed, thanks. |
|
LGTM |
|
LGTM too. |
| row(idx) = jsonValue | ||
| } | ||
| idx = idx + 1 | ||
| } |
There was a problem hiding this comment.
Could you rewrite it using less lines? A more Scala way?
There was a problem hiding this comment.
We have followed @HyukjinKwon's suggestion #19017 (review) to avoid functional transformation with a while-loop, since this is a hot path. It makes sense to me.
There was a problem hiding this comment.
You still can simplify the codes a lot without functional transformation.
There was a problem hiding this comment.
If I comment out the L451-452, the repeated fields still have the same jsonValue because fieldNames(idx) == jsonField, but the first comparison is not necessary since idx >= 0 means matched.
Could you please give me some advice?
There was a problem hiding this comment.
Would you maybe have a suggestion? The current status looks fine.
There was a problem hiding this comment.
row(idx) = jsonValue
idx = idx + 1
// SPARK-21804: json_tuple returns null values within repeated columns
// except the first one; so that we need to check the remaining fields.
while (idx < fieldNames.length) {
if (fieldNames(idx) == jsonField) {
row(idx) = jsonValue
}
idx = idx + 1
}->
do {
row(idx) = jsonValue
idx = fieldNames.indexOf(jsonField, idx + 1)
} while (idx >= 0)There was a problem hiding this comment.
I am also thinking if we should use a Hash table. However,,, the number of columns is not large. Thus, it might not get a noticeable benefit.
|
Test build #81066 has finished for PR 19017 at commit
|
|
Is it better than functional transformation?
…On Aug 24, 2017 1:46 PM, "Xiao Li" ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/
expressions/jsonExpressions.scala
<#19017 (comment)>:
> @@ -447,7 +448,18 @@ case class JsonTuple(children: Seq[Expression])
generator => copyCurrentStructure(generator, parser)
}
- row(idx) = UTF8String.fromBytes(output.toByteArray)
+ val jsonValue = UTF8String.fromBytes(output.toByteArray)
+ row(idx) = jsonValue
+ idx = idx + 1
+
+ // SPARK-21804: json_tuple returns null values within repeated columns
+ // except the first one; so that we need to check the remaining fields.
+ while (idx < fieldNames.length) {
+ if (fieldNames(idx) == jsonField) {
+ row(idx) = jsonValue
+ }
+ idx = idx + 1
+ }
row(idx) = jsonValue
idx = idx + 1
// SPARK-21804: json_tuple returns null values within repeated columns
// except the first one; so that we need to check the remaining fields.
while (idx < fieldNames.length) {
if (fieldNames(idx) == jsonField) {
row(idx) = jsonValue
}
idx = idx + 1
}
->
do {
row(idx) = jsonValue
idx = fieldNames.indexOf(jsonField, idx + 1)
} while (idx >= 0)
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#19017 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AAEM9469kwhOt3hF3Wl6MKVviD7dRMQoks5sbQ5KgaJpZM4O-LpB>
.
|
|
Hash table is over-kill here.
On Aug 24, 2017 1:49 PM, "Xiao Li" <notifications@github.com> wrote:
*@gatorsmile* commented on this pull request.
------------------------------
In sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/
expressions/jsonExpressions.scala
<#19017 (comment)>:
@@ -447,7 +448,18 @@ case class JsonTuple(children: Seq[Expression])
generator => copyCurrentStructure(generator, parser)
}
- row(idx) = UTF8String.fromBytes(output.toByteArray)
+ val jsonValue = UTF8String.fromBytes(output.toByteArray)
+ row(idx) = jsonValue
+ idx = idx + 1
+
+ // SPARK-21804: json_tuple returns null values within
repeated columns
+ // except the first one; so that we need to check the
remaining fields.
+ while (idx < fieldNames.length) {
+ if (fieldNames(idx) == jsonField) {
+ row(idx) = jsonValue
+ }
+ idx = idx + 1
+ }
I am also thinking if we should use a Hash table. However,,, the number of
columns is not large. Thus, it might not get a noticeable benefit.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#19017 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AAEM9-RZHNPUotQGkeHthOBHoOfixaESks5sbQ7ogaJpZM4O-LpB>
.
|
|
I really doubt we can see a measurable performance difference among these different solutions. I just did not want to challenge it. Maybe you can write an end-to-end test and see the difference. Thus, I prefer to the simplest one. |
|
If we assume the performance difference is negligible, functional transform actually is concise more. @HyukjinKwon What do you think? |
|
Current status looks fine enough. I don't think we should prefer simplicity in a hot path. This follows obviously the guide lines and should be safe enough to go. This does not hurt my eyes. |
|
Sorry, I do not think the current code is ready to merge. |
Could you explain why? |
|
I think my suggestion is better: #19017 (comment) If you think mine is slower, please provide an end-to-end test to show the performance number. If this really impact the performance, I think using Hash table might be better |
|
For me, either way is fine but personally prefer the current way because it exactly follows the guides. BTW, I think you should do the perf tests if you think your suggestion is better. |
|
OK. @jmchung Please change it based on my comment. |
|
@gatorsmile ok and really thanks for all the nice comments. |
|
Test build #81072 has finished for PR 19017 at commit
|
|
LGTM |
|
Merged to master. |
|
Thanks @HyukjinKwon @gatorsmile |
|
Thanks @viirya, @HyukjinKwon and @gatorsmile. |
What changes were proposed in this pull request?
When json_tuple in extracting values from JSON it returns null values within repeated columns except the first one as below:
I think this should be consistent with Hive's implementation:
In this PR, we located all the matched indices in
fieldNamesinstead of returning the first matched index, i.e., indexOf.How was this patch tested?
Added test in JsonExpressionsSuite.