-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-31009][SQL] Support json_object_keys function #27836
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
MaxGekk
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@iRakson Could you show the use case of the function.
|
We can get the same result using existing functions: scala> val df = Seq("""{"a":1, "b":2}""").toDF("json")
df: org.apache.spark.sql.DataFrame = [json: string]
scala> df.select(map_keys(from_json($"json", MapType(StringType, StringType)))).show
+-----------------+
|map_keys(entries)|
+-----------------+
| [a, b]|
+-----------------+What's the benefit of having the function in public API? |
|
|
@HyukjinKwon @cloud-fan @maropu |
|
ok to test |
|
Do you know how much faster it is?
|
|
Test build #119637 has finished for PR 27836 at commit
|
|
Test build #119648 has finished for PR 27836 at commit
|
|
retest this please |
1 similar comment
|
retest this please |
|
@MaxGekk does it look good to you? |
Has the @dongjoon-hyun comment above been already adderssed? If this func has performance benefits, I think its better to show the perf. numbers in the PR description. |
I have updated the time required for |
|
Test build #120032 has finished for PR 27836 at commit
|
| var arrayBufferOfKeys = ArrayBuffer.empty[UTF8String] | ||
| // this handles `NULL` case | ||
| if (parser.nextToken() == null) { | ||
| throw new AnalysisException(s"$prettyName expect a JSON object but nothing is provided.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the exception thrown at analysis phase? I think AnalysisException should be replaced by RuntimeException.
|
The function works for literals but I got NPE on a column with JSON objects: test("json_object_keys") {
spark.sql("""select json_object_keys('{"a": 1, "b": 2, "c": 3}')""").show(false)
val df = Seq(
"""{"a":1}""",
"""{"a":1, "b":2}""",
"""{"a": 1, "b": 2, "c": 3}""").toDF("json")
df.show(false)
val dfKeys = df.selectExpr("json_object_keys(json)")
dfKeys.show(false)
} |
I fixed this issue and also added a new test case for this particular issue. |
|
Test build #120162 has finished for PR 27836 at commit
|
|
|
||
| override def eval(input: InternalRow): Any = { | ||
| try { | ||
| lazy val json = child.eval(input).asInstanceOf[UTF8String] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why we need lazy here?
| arguments = """ | ||
| Arguments: | ||
| * json_object - A JSON object is required as argument. `Null` is returned, if an invalid JSON | ||
| string is given. `Runtime Exception` is thrown, if null string or JSON array is given. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about this?
* json_object - A JSON object. If it is an invalid string, the function returns null.
If it is a JSON array or null, a runtime exception will be thrown.
btw, the error handling is consistent with the other JSON functions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I checked the other json functions. IllegalArgumentException is thrown by from_json for invalid inputs and it is better than RuntimeException.
|
|
||
| test("json_object_keys") { | ||
| val df = Seq("""{"a": 1, "b": 2, "c": 3}""".stripMargin) | ||
| .toDF("json") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: you don't need the line break here.
|
Test build #120213 has finished for PR 27836 at commit
|
|
Test build #120215 has finished for PR 27836 at commit
|
|
retest this please |
|
Test build #120264 has finished for PR 27836 at commit
|
|
gentle ping @HyukjinKwon @maropu @MaxGekk |
| -- !query schema | ||
| struct<> | ||
| -- !query output | ||
| java.lang.IllegalArgumentException |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do you throw IllegalArgumentException for '[1, 2, 3]' but NULL for '{"key": 45, "random_string"}'? It looks slightly inconsistent though both input are invalid.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For invalid JSON strings NULL is thrown. '[1, 2, 3]' is a valid JSON string. But not a JSON object. So invalid argument.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regarding this issue. I am thinking of implementing a function ISJSON(). It will return NULL for null, 1 for valid JSON string and 0 for invalid JSON string.
This function is supported by most of the major DBMSs.
Currently we return NULL for invalid JSON.
In this PR as well as in #27759 , NULL is returned for null JSON strings. So behaviour might be confusing as we are returning NULL for invalid JSON strings as well.
|
Test build #120874 has finished for PR 27836 at commit
|
|
Hi, @iRakson . |
| ("""{"key": 45, "random_string"}""", null), | ||
| ("", null), | ||
| ("[]", null), | ||
| ("""[{"key": "JSON"}]""", null) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you sort the test cases a little meaningfully, please?
| ("", null), | ||
| ("[]", null), | ||
| ("""[{"key": "JSON"}]""", null) | ||
| ).foreach{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit. ).foreach{ -> ).foreach {
| select json_object_keys(200); | ||
| select json_object_keys(''); | ||
| select json_object_keys('{"key": 1}'); | ||
| select json_object_keys('{}'); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we switch line 67 and 68?
| val df = Seq("""{"a": 1, "b": 2, "c": 3}""").toDF("json") | ||
| val dfKeys = df.selectExpr("json_object_keys(json)") | ||
| checkAnswer(dfKeys, Row(Array("a", "b", "c"))) | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can revert this change from this PR since we have a superset already.
| * A function which returns all the keys of the outmost JSON object. | ||
| */ | ||
| @ExpressionDescription( | ||
| usage = "_FUNC_(json_object) - returns all the keys of the outmost JSON object as an array.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit. returns -> Returns?
| parser => getJsonKeys(parser, input) | ||
| } | ||
| } catch { | ||
| case _: JsonProcessingException => null |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a question: There is no need to handle IOException here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, nextToken() throws IOException. I will update it with other changes.
| // return null if an empty string or any other valid JSON string is encountered | ||
| if (parser.nextToken() == null || parser.currentToken() != JsonToken.START_OBJECT) { | ||
| return null | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we move this exception handling part (line 849 ~ 851) from this getJsonKey into line 839 (outside of this function)?
|
@dongjoon-hyun I resolved the conflicts and also have tried to handle all the review comments. |
|
Test build #120963 has finished for PR 27836 at commit
|
|
Test build #120964 has finished for PR 27836 at commit
|
dongjoon-hyun
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, LGTM. Thank you, @iRakson and all.
Merged to master for Apache Spark 3.1.0.
### What changes were proposed in this pull request?
A new function `json_object_keys` is proposed in this PR. This function will return all the keys of the outmost json object. It takes Json Object as an argument.
- If invalid json expression is given, `NULL` will be returned.
- If an empty string or json array is given, `NULL` will be returned.
- If valid json object is given, all the keys of the outmost object will be returned as an array.
- For empty json object, empty array is returned.
We can also get JSON object keys using `map_keys+from_json`. But `json_object_keys` is more efficient.
```
Performance result for json_object = {"a":[1,2,3,4,5], "b":[2,4,5,12333321]}
Intel(R) Core(TM) i7-9750H CPU 2.60GHz
JSON functions: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
json_object_keys 11666 12361 673 0.9 1166.6 1.0X
from_json+map_keys 15309 15973 701 0.7 1530.9 0.8X
```
### Why are the changes needed?
This function will help naive users in directly extracting the keys from json string and its fairly intuitive as well. Also its extends the functionality of spark-sql for json strings.
Some of the most popular DBMSs supports this function.
- PostgreSQL
- MySQL
- MariaDB
### Does this PR introduce any user-facing change?
Yes. Now users can extract keys of json objects using this function.
### How was this patch tested?
UTs added.
Closes apache#27836 from iRakson/jsonKeys.
Authored-by: iRakson <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
We don't need to define the schema in Json_object_keys |
What changes were proposed in this pull request?
A new function
json_object_keysis proposed in this PR. This function will return all the keys of the outmost json object. It takes Json Object as an argument.NULLwill be returned.NULLwill be returned.We can also get JSON object keys using
map_keys+from_json. Butjson_object_keysis more efficient.Why are the changes needed?
This function will help naive users in directly extracting the keys from json string and its fairly intuitive as well. Also its extends the functionality of spark-sql for json strings.
Some of the most popular DBMSs supports this function.
Does this PR introduce any user-facing change?
Yes. Now users can extract keys of json objects using this function.
How was this patch tested?
UTs added.