-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-31008][SQL] Support json_array_length function #27759
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
MaxGekk
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@iRakson What is the use case when you need to know length w/o full parsing? Whereas you can apply size after from_json().
There are cases where we need json arrays whose size is greater than a certain threshold. At the moment we have to parse all the json arrays and then only we can decide whether that json array is needed or not. |
You can avoid deep parsing by specifying string as element type. For example: scala> val df = Seq("""[{"a":1}, {"a": 2}]""").toDF("json")
df: org.apache.spark.sql.DataFrame = [json: string]
scala> df.select(size(from_json($"json", ArrayType(StringType)))).show
+---------------------+
|size(from_json(json))|
+---------------------+
| 2|
+---------------------+It does actually the same as your expression. Maybe it is less optimal because |
I believe public API might serve better as user are more familiar with |
|
Could you check more databases, e.g., oracle, sql server, snowflake, ...? |
After checking Databases, i found out that these DBMSs support
While these DBMSs support
|
|
ok to test |
| > SELECT _FUNC_('[1,2,3,{"f1":1,"f2":[5,6]},4]'); | ||
| 5 | ||
| """, | ||
| since = "3.0.0" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-> 3.1.0
| * A function that returns number of elements in outer Json Array. | ||
| */ | ||
| @ExpressionDescription( | ||
| usage = "_FUNC_(jsonArray) - Returns length of the jsonArray", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you also add arguments description?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
arguments description is added.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I like a consistent word: length of or number of?
|
Test build #119263 has finished for PR 27759 at commit
|
| """, | ||
| since = "3.1.0" | ||
| ) | ||
| case class LengthOfJsonArray(child: Expression) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The semantic of LengthOfJsonArray is similar to Size + JsonToStructs. What about to extend RuntimeReplaceable and map LengthOfJsonArray to the combination of existing expressions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking about this but seems JsonToStructs requires a schema, which is unknown in this expression.
| override def prettyName: String = "json_array_length" | ||
|
|
||
| override def eval(input: InternalRow): Any = { | ||
| @transient |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this annotation really needed?
|
|
||
| private def parseCounter(parser: JsonParser, input: InternalRow): Int = { | ||
| // Counter for length of array | ||
| var array_length: Int = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
variable name should be in camelCase, see https://github.com/databricks/scala-style-guide#variable-naming-convention
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I forgot to change the name. I will rename it and comment will be removed as well.
| } | ||
|
|
||
| private def parseCounter(parser: JsonParser, input: InternalRow): Int = { | ||
| // Counter for length of array |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the comment can be removed. I would rename array_length to counter or length
| } | ||
| } | ||
| } catch { | ||
| case _: JsonProcessingException => null |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nextToken can throw IOException. see:
public abstract JsonToken nextToken() throws IOException;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just in case, the exception are handled by JacksonParser:
spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala
Line 433 in c198620
| case e @ (_: RuntimeException | _: JsonProcessingException | _: MalformedInputException) => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will handle IOException. Missed that. Thanks for pointing out.
| Examples: | ||
| > SELECT _FUNC_('[1,2,3,4]'); | ||
| 4 | ||
| > SELECT _FUNC_('[1,2,3,{"f1":1,"f2":[5,6]},4]'); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should the expression support array of different element types? For instance, from_json() can parse arrays only of particular type. So, you can get length but cannot parse it by from_json.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually i wanted it to work like other json_array_length functions, which take any input and result length. I can change its implementation, if required.
| throw new AnalysisException(s"$prettyName can only be called on Json Array.") | ||
| } | ||
| // Keep traversing until the end of Json Array | ||
| while(parser.nextToken() != JsonToken.END_ARRAY) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can nextToken return null? Looks like it can:
spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonUtils.scala
Lines 28 to 33 in 8e280ce
| def nextUntil(parser: JsonParser, stopOn: JsonToken): Boolean = { | |
| parser.nextToken() match { | |
| case null => false | |
| case x => x != stopOn | |
| } | |
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It returns null when end of input is reached.
If it returns null before returning END_ARRAY then our json is invalid. Invalid input was already handled.
Anyway now i will add one more check for null.
|
Test build #119279 has finished for PR 27759 at commit
|
|
Test build #119303 has finished for PR 27759 at commit
|
| } | ||
|
|
||
| /** | ||
| * A function that returns number of elements in outer Json Array. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: the number of?
| usage = "_FUNC_(jsonArray) - Returns length of the jsonArray", | ||
| arguments = """ | ||
| jsonArray - A JSON array is required as argument. `Analysis Exception` is thrown if any other | ||
| valid JSON expression is passed. `NULL` is returned in case of invalid JSON. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok. I will make the changes
|
Test build #119365 has finished for PR 27759 at commit
|
|
Test build #119367 has finished for PR 27759 at commit
|
|
retest this please |
|
@HyukjinKwon @maropu @MaxGekk I have handled review comments from my side. Please Review once. |
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala
Outdated
Show resolved
Hide resolved
| arguments = """ | ||
| Arguments: | ||
| * jsonArray - A JSON array. An exception is thrown if any other valid JSON strings are passed. | ||
| `NULL` is returned in case of an invalid JSON. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we mention NULL input explicitly?
- `NULL` is returned in case of an invalid JSON.
+ `NULL` is returned in case of `NULL` or an invalid JSON
| } | ||
|
|
||
| /** | ||
| * A function that returns the number of elements in outer JSON array. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since JSON can be a nested structure, there might be multiple inner and outer. Can we use the outmost instead of outer?
| } | ||
|
|
||
| private def parseCounter(parser: JsonParser, input: InternalRow): Int = { | ||
| var length = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ur, shall we remove ; since this is Scala? Please check the other code together.
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala
Show resolved
Hide resolved
| df.selectExpr("json_array_length(json)") | ||
| }.getMessage | ||
| assert(errMsg.contains("due to data type mismatch")) | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we remove this because this is already covered at json-functions.sql more extensively.
|
Please update the PR description, too. It's important because it will be a commit log. |
Updated. :) |
|
Test build #120817 has finished for PR 27759 at commit
|
| options.asJava)), | ||
| Seq(Row("string"))) | ||
| } | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for deletion, but to be complete, you need to recover to the master branch version. You may need something like the following.
git checkout -f master sql/core/src/test/scala/org/apache/spark/sql/JsonFunctionsSuite.scala
| ("""[1,2,3,[33,44],{"key":[2,3,4]}]""", 5), | ||
| ("""[1,2,3,4,5""", null), | ||
| ("""Random String""", null) | ||
| ).foreach{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit. ).foreach{ -> ).foreach {
| ("""""", null), | ||
| ("""[1,2,3]""", 3), | ||
| ("""[]""", 0), | ||
| ("""[[1],[2,3],[]]""", 3), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is different from the example I gave you. Do you have any reason to prefer """ here? Usually, Apache Spark prefer to use a simple form " than """.
Seq(
("", null),
("[]", 0),
("[1,2,3]", 3),
("[[1],[2,3],[]]", 3),
| while(parser.nextToken() != JsonToken.END_ARRAY) { | ||
| // Null indicates end of input. | ||
| if (parser.currentToken == null) { | ||
| throw new IllegalArgumentException("Please provide a valid JSON array.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering if we can have a test coverage for this code path. Otherwise, this code path can be considered as a dead code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah. It is unreachable code. Because if we encounter null before END_ARRAY then our JSON is invalid. I will remove this check.
|
Hi, @iRakson . Thank you for updating. I left only a few comments. The other things look okay to me. |
|
@dongjoon-hyun Done. |
|
Test build #120860 has finished for PR 27759 at commit
|
|
retest this please |
|
Test build #120862 has finished for PR 27759 at commit
|
|
@cloud-fan I made the changes which you asked for in #27836. |
|
Test build #120872 has finished for PR 27759 at commit
|
dongjoon-hyun
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, LGTM. Thank you, @iRakson and all!
Merged to master for Apache Spark 3.1.0.
|
Thank you all for patiently reviewing the PR. |
### What changes were proposed in this pull request?
At the moment we do not have any function to compute length of JSON array directly.
I propose a `json_array_length` function which will return the length of the outermost JSON array.
- This function will return length of the outermost JSON array, if JSON array is valid.
```
scala> spark.sql("select json_array_length('[1,2,3,[33,44],{\"key\":[2,3,4]}]')").show
+--------------------------------------------------+
|json_array_length([1,2,3,[33,44],{"key":[2,3,4]}])|
+--------------------------------------------------+
| 5|
+--------------------------------------------------+
scala> spark.sql("select json_array_length('[[1],[2,3]]')").show
+------------------------------+
|json_array_length([[1],[2,3]])|
+------------------------------+
| 2|
+------------------------------+
```
- In case of any other valid JSON string, invalid JSON string or null array or `NULL` input , `NULL` will be returned.
```
scala> spark.sql("select json_array_length('')").show
+-------------------+
|json_array_length()|
+-------------------+
| null|
+-------------------+
```
### Why are the changes needed?
- As mentioned in JIRA, this function is supported by presto, postgreSQL, redshift, SQLite, MySQL, MariaDB, IBM DB2.
- for better user experience and ease of use.
```
Performance Result for Json array - [1, 2, 3, 4]
Intel(R) Core(TM) i7-9750H CPU 2.60GHz
JSON functions: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
json_array_length 7728 7762 53 1.3 772.8 1.0X
size+from_json 12739 12895 199 0.8 1273.9 0.6X
```
### Does this PR introduce any user-facing change?
Yes, now users can get length of a json array by using `json_array_length`.
### How was this patch tested?
Added UT.
Closes apache#27759 from iRakson/jsonArrayLength.
Authored-by: iRakson <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
…1284) * [SPARK-31008][SQL] Support json_array_length function ### What changes were proposed in this pull request? At the moment we do not have any function to compute length of JSON array directly. I propose a `json_array_length` function which will return the length of the outermost JSON array. - This function will return length of the outermost JSON array, if JSON array is valid. ``` scala> spark.sql("select json_array_length('[1,2,3,[33,44],{\"key\":[2,3,4]}]')").show +--------------------------------------------------+ |json_array_length([1,2,3,[33,44],{"key":[2,3,4]}])| +--------------------------------------------------+ | 5| +--------------------------------------------------+ scala> spark.sql("select json_array_length('[[1],[2,3]]')").show +------------------------------+ |json_array_length([[1],[2,3]])| +------------------------------+ | 2| +------------------------------+ ``` - In case of any other valid JSON string, invalid JSON string or null array or `NULL` input , `NULL` will be returned. ``` scala> spark.sql("select json_array_length('')").show +-------------------+ |json_array_length()| +-------------------+ | null| +-------------------+ ``` ### Why are the changes needed? - As mentioned in JIRA, this function is supported by presto, postgreSQL, redshift, SQLite, MySQL, MariaDB, IBM DB2. - for better user experience and ease of use. ``` Performance Result for Json array - [1, 2, 3, 4] Intel(R) Core(TM) i7-9750H CPU 2.60GHz JSON functions: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ json_array_length 7728 7762 53 1.3 772.8 1.0X size+from_json 12739 12895 199 0.8 1273.9 0.6X ``` ### Does this PR introduce any user-facing change? Yes, now users can get length of a json array by using `json_array_length`. ### How was this patch tested? Added UT. Closes #27759 from iRakson/jsonArrayLength. Authored-by: iRakson <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]> (cherry picked from commit 71022d7)
What changes were proposed in this pull request?
At the moment we do not have any function to compute length of JSON array directly.
I propose a
json_array_lengthfunction which will return the length of the outermost JSON array.NULLinput ,NULLwill be returned.Why are the changes needed?
As mentioned in JIRA, this function is supported by presto, postgreSQL, redshift, SQLite, MySQL, MariaDB, IBM DB2.
for better user experience and ease of use.
Does this PR introduce any user-facing change?
Yes, now users can get length of a json array by using
json_array_length.How was this patch tested?
Added UT.