Skip to content

Conversation

@iRakson
Copy link
Contributor

@iRakson iRakson commented Mar 2, 2020

What changes were proposed in this pull request?

At the moment we do not have any function to compute length of JSON array directly.
I propose a json_array_length function which will return the length of the outermost JSON array.

  • This function will return length of the outermost JSON array, if JSON array is valid.

scala> spark.sql("select json_array_length('[1,2,3,[33,44],{\"key\":[2,3,4]}]')").show
+--------------------------------------------------+
|json_array_length([1,2,3,[33,44],{"key":[2,3,4]}])|
+--------------------------------------------------+
|                                                 5|
+--------------------------------------------------+


scala> spark.sql("select json_array_length('[[1],[2,3]]')").show
+------------------------------+
|json_array_length([[1],[2,3]])|
+------------------------------+
|                             2|
+------------------------------+

  • In case of any other valid JSON string, invalid JSON string or null array or NULL input , NULL will be returned.
scala> spark.sql("select json_array_length('')").show
+-------------------+                                                           
|json_array_length()|
+-------------------+
|               null|
+-------------------+

Why are the changes needed?

  • As mentioned in JIRA, this function is supported by presto, postgreSQL, redshift, SQLite, MySQL, MariaDB, IBM DB2.

  • for better user experience and ease of use.

Performance Result for Json array - [1, 2, 3, 4]

Intel(R) Core(TM) i7-9750H CPU @ 2.60GHz
JSON functions:                           Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
json_array_length                                  7728           7762          53          1.3         772.8       1.0X
size+from_json                                    12739          12895         199          0.8        1273.9       0.6X
 

Does this PR introduce any user-facing change?

Yes, now users can get length of a json array by using json_array_length.

How was this patch tested?

Added UT.

@iRakson
Copy link
Contributor Author

iRakson commented Mar 2, 2020

Copy link
Member

@MaxGekk MaxGekk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@iRakson What is the use case when you need to know length w/o full parsing? Whereas you can apply size after from_json().

@iRakson
Copy link
Contributor Author

iRakson commented Mar 2, 2020

@iRakson What is the use case when you need to know length w/o full parsing? Whereas you can apply size after from_json().

There are cases where we need json arrays whose size is greater than a certain threshold. At the moment we have to parse all the json arrays and then only we can decide whether that json array is needed or not.
This problem can be solved by using this function.
There can be many other such scenarios where users first want to know the size of json array.

@MaxGekk
Copy link
Member

MaxGekk commented Mar 2, 2020

At the moment we have to parse all ...

You can avoid deep parsing by specifying string as element type. For example:

scala> val df = Seq("""[{"a":1}, {"a": 2}]""").toDF("json")
df: org.apache.spark.sql.DataFrame = [json: string]

scala> df.select(size(from_json($"json", ArrayType(StringType)))).show
+---------------------+
|size(from_json(json))|
+---------------------+
|                    2|
+---------------------+

It does actually the same as your expression. Maybe it is less optimal because from_json() materializes arrays but this is another question how to optimize the combination of size + from_json of array of strings. I would add an optimization rule instead of extending public API.

@iRakson
Copy link
Contributor Author

iRakson commented Mar 2, 2020

I would add an optimization rule instead of extending public API.

I believe public API might serve better as user are more familiar with json_array_length as this function is supported by most of the database engines . Also, it seems more intuitive than size+from_json.

@maropu
Copy link
Member

maropu commented Mar 3, 2020

Could you check more databases, e.g., oracle, sql server, snowflake, ...?

@iRakson
Copy link
Contributor Author

iRakson commented Mar 3, 2020

Could you check more databases, e.g., oracle, sql server, snowflake, ...?

After checking Databases, i found out that these DBMSs support json_array_length function

  • PostgreSQL
  • IBM DB2
  • SQLite
  • Teradata
  • Redshift
  • Presto
  • Amazon athena

While these DBMSs support json_length function

  • MySQL
  • MariaDB

@HyukjinKwon
Copy link
Member

ok to test

> SELECT _FUNC_('[1,2,3,{"f1":1,"f2":[5,6]},4]');
5
""",
since = "3.0.0"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-> 3.1.0

* A function that returns number of elements in outer Json Array.
*/
@ExpressionDescription(
usage = "_FUNC_(jsonArray) - Returns length of the jsonArray",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you also add arguments description?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

arguments description is added.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I like a consistent word: length of or number of?

@SparkQA
Copy link

SparkQA commented Mar 4, 2020

Test build #119263 has finished for PR 27759 at commit d8ec950.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • case class LengthOfJsonArray(child: Expression)

@iRakson iRakson requested a review from HyukjinKwon March 4, 2020 05:37
""",
since = "3.1.0"
)
case class LengthOfJsonArray(child: Expression)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The semantic of LengthOfJsonArray is similar to Size + JsonToStructs. What about to extend RuntimeReplaceable and map LengthOfJsonArray to the combination of existing expressions.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking about this but seems JsonToStructs requires a schema, which is unknown in this expression.

override def prettyName: String = "json_array_length"

override def eval(input: InternalRow): Any = {
@transient
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this annotation really needed?


private def parseCounter(parser: JsonParser, input: InternalRow): Int = {
// Counter for length of array
var array_length: Int = 0;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I forgot to change the name. I will rename it and comment will be removed as well.

}

private def parseCounter(parser: JsonParser, input: InternalRow): Int = {
// Counter for length of array
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the comment can be removed. I would rename array_length to counter or length

}
}
} catch {
case _: JsonProcessingException => null
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nextToken can throw IOException. see:

    public abstract JsonToken nextToken() throws IOException;

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just in case, the exception are handled by JacksonParser:

case e @ (_: RuntimeException | _: JsonProcessingException | _: MalformedInputException) =>

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will handle IOException. Missed that. Thanks for pointing out.

Examples:
> SELECT _FUNC_('[1,2,3,4]');
4
> SELECT _FUNC_('[1,2,3,{"f1":1,"f2":[5,6]},4]');
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the expression support array of different element types? For instance, from_json() can parse arrays only of particular type. So, you can get length but cannot parse it by from_json.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually i wanted it to work like other json_array_length functions, which take any input and result length. I can change its implementation, if required.

throw new AnalysisException(s"$prettyName can only be called on Json Array.")
}
// Keep traversing until the end of Json Array
while(parser.nextToken() != JsonToken.END_ARRAY) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can nextToken return null? Looks like it can:

def nextUntil(parser: JsonParser, stopOn: JsonToken): Boolean = {
parser.nextToken() match {
case null => false
case x => x != stopOn
}
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It returns null when end of input is reached.
If it returns null before returning END_ARRAY then our json is invalid. Invalid input was already handled.
Anyway now i will add one more check for null.

@SparkQA
Copy link

SparkQA commented Mar 4, 2020

Test build #119279 has finished for PR 27759 at commit 58a9e0d.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 4, 2020

Test build #119303 has finished for PR 27759 at commit 34d915d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

}

/**
* A function that returns number of elements in outer Json Array.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: the number of?

usage = "_FUNC_(jsonArray) - Returns length of the jsonArray",
arguments = """
jsonArray - A JSON array is required as argument. `Analysis Exception` is thrown if any other
valid JSON expression is passed. `NULL` is returned in case of invalid JSON.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok. I will make the changes

@iRakson iRakson requested a review from maropu March 5, 2020 05:34
@SparkQA
Copy link

SparkQA commented Mar 5, 2020

Test build #119365 has finished for PR 27759 at commit d00fe19.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 5, 2020

Test build #119367 has finished for PR 27759 at commit 0cb4f84.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@iRakson
Copy link
Contributor Author

iRakson commented Mar 5, 2020

retest this please

@iRakson
Copy link
Contributor Author

iRakson commented Mar 6, 2020

@HyukjinKwon @maropu @MaxGekk I have handled review comments from my side. Please Review once.

arguments = """
Arguments:
* jsonArray - A JSON array. An exception is thrown if any other valid JSON strings are passed.
`NULL` is returned in case of an invalid JSON.
Copy link
Member

@dongjoon-hyun dongjoon-hyun Apr 4, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we mention NULL input explicitly?

- `NULL` is returned in case of an invalid JSON.
+ `NULL` is returned in case of `NULL` or an invalid JSON

}

/**
* A function that returns the number of elements in outer JSON array.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since JSON can be a nested structure, there might be multiple inner and outer. Can we use the outmost instead of outer?

}

private def parseCounter(parser: JsonParser, input: InternalRow): Int = {
var length = 0;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ur, shall we remove ; since this is Scala? Please check the other code together.

df.selectExpr("json_array_length(json)")
}.getMessage
assert(errMsg.contains("due to data type mismatch"))
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we remove this because this is already covered at json-functions.sql more extensively.

@iRakson iRakson requested a review from dongjoon-hyun April 4, 2020 20:57
@dongjoon-hyun
Copy link
Member

Please update the PR description, too. It's important because it will be a commit log.

@iRakson
Copy link
Contributor Author

iRakson commented Apr 4, 2020

Please update the PR description, too. It's important because it will be a commit log.

Updated. :)

@SparkQA
Copy link

SparkQA commented Apr 5, 2020

Test build #120817 has finished for PR 27759 at commit 391f33d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

options.asJava)),
Seq(Row("string")))
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for deletion, but to be complete, you need to recover to the master branch version. You may need something like the following.

git checkout -f master sql/core/src/test/scala/org/apache/spark/sql/JsonFunctionsSuite.scala

("""[1,2,3,[33,44],{"key":[2,3,4]}]""", 5),
("""[1,2,3,4,5""", null),
("""Random String""", null)
).foreach{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit. ).foreach{ -> ).foreach {

("""""", null),
("""[1,2,3]""", 3),
("""[]""", 0),
("""[[1],[2,3],[]]""", 3),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is different from the example I gave you. Do you have any reason to prefer """ here? Usually, Apache Spark prefer to use a simple form " than """.

Seq(
  ("", null),
  ("[]", 0),
  ("[1,2,3]", 3),
  ("[[1],[2,3],[]]", 3),

while(parser.nextToken() != JsonToken.END_ARRAY) {
// Null indicates end of input.
if (parser.currentToken == null) {
throw new IllegalArgumentException("Please provide a valid JSON array.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if we can have a test coverage for this code path. Otherwise, this code path can be considered as a dead code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah. It is unreachable code. Because if we encounter null before END_ARRAY then our JSON is invalid. I will remove this check.

@dongjoon-hyun
Copy link
Member

Hi, @iRakson . Thank you for updating. I left only a few comments. The other things look okay to me.

@iRakson
Copy link
Contributor Author

iRakson commented Apr 6, 2020

@dongjoon-hyun Done.

@iRakson iRakson requested a review from dongjoon-hyun April 6, 2020 06:23
@SparkQA
Copy link

SparkQA commented Apr 6, 2020

Test build #120860 has finished for PR 27759 at commit 313151f.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@maropu
Copy link
Member

maropu commented Apr 6, 2020

retest this please

@SparkQA
Copy link

SparkQA commented Apr 6, 2020

Test build #120862 has finished for PR 27759 at commit 313151f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@iRakson
Copy link
Contributor Author

iRakson commented Apr 6, 2020

@cloud-fan I made the changes which you asked for in #27836.
Kindly review. I will also update the PR description accordingly.

@SparkQA
Copy link

SparkQA commented Apr 6, 2020

Test build #120872 has finished for PR 27759 at commit f44e24e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM. Thank you, @iRakson and all!
Merged to master for Apache Spark 3.1.0.

@iRakson
Copy link
Contributor Author

iRakson commented Apr 8, 2020

Thank you all for patiently reviewing the PR.

sjincho pushed a commit to sjincho/spark that referenced this pull request Apr 15, 2020
### What changes were proposed in this pull request?
At the moment we do not have any function to compute length of JSON array directly.
I propose a  `json_array_length` function which will return the length of the outermost JSON array.

- This function will return length of the outermost JSON array, if JSON array is valid.
```

scala> spark.sql("select json_array_length('[1,2,3,[33,44],{\"key\":[2,3,4]}]')").show
+--------------------------------------------------+
|json_array_length([1,2,3,[33,44],{"key":[2,3,4]}])|
+--------------------------------------------------+
|                                                 5|
+--------------------------------------------------+

scala> spark.sql("select json_array_length('[[1],[2,3]]')").show
+------------------------------+
|json_array_length([[1],[2,3]])|
+------------------------------+
|                             2|
+------------------------------+

```
- In case of any other valid JSON string, invalid JSON string or null array or `NULL` input , `NULL` will be returned.
```
scala> spark.sql("select json_array_length('')").show
+-------------------+
|json_array_length()|
+-------------------+
|               null|
+-------------------+
```

### Why are the changes needed?

- As mentioned in JIRA, this function is supported by presto, postgreSQL, redshift, SQLite, MySQL, MariaDB, IBM DB2.

- for better user experience and ease of use.

```
Performance Result for Json array - [1, 2, 3, 4]

Intel(R) Core(TM) i7-9750H CPU  2.60GHz
JSON functions:                           Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
json_array_length                                  7728           7762          53          1.3         772.8       1.0X
size+from_json                                    12739          12895         199          0.8        1273.9       0.6X

```

### Does this PR introduce any user-facing change?
Yes, now users can get length of a json array by using `json_array_length`.

### How was this patch tested?
Added UT.

Closes apache#27759 from iRakson/jsonArrayLength.

Authored-by: iRakson <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
wangyum added a commit that referenced this pull request May 26, 2023
…1284)

* [SPARK-31008][SQL] Support json_array_length function

### What changes were proposed in this pull request?
At the moment we do not have any function to compute length of JSON array directly.
I propose a  `json_array_length` function which will return the length of the outermost JSON array.

- This function will return length of the outermost JSON array, if JSON array is valid.
```

scala> spark.sql("select json_array_length('[1,2,3,[33,44],{\"key\":[2,3,4]}]')").show
+--------------------------------------------------+
|json_array_length([1,2,3,[33,44],{"key":[2,3,4]}])|
+--------------------------------------------------+
|                                                 5|
+--------------------------------------------------+

scala> spark.sql("select json_array_length('[[1],[2,3]]')").show
+------------------------------+
|json_array_length([[1],[2,3]])|
+------------------------------+
|                             2|
+------------------------------+

```
- In case of any other valid JSON string, invalid JSON string or null array or `NULL` input , `NULL` will be returned.
```
scala> spark.sql("select json_array_length('')").show
+-------------------+
|json_array_length()|
+-------------------+
|               null|
+-------------------+
```

### Why are the changes needed?

- As mentioned in JIRA, this function is supported by presto, postgreSQL, redshift, SQLite, MySQL, MariaDB, IBM DB2.

- for better user experience and ease of use.

```
Performance Result for Json array - [1, 2, 3, 4]

Intel(R) Core(TM) i7-9750H CPU  2.60GHz
JSON functions:                           Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
json_array_length                                  7728           7762          53          1.3         772.8       1.0X
size+from_json                                    12739          12895         199          0.8        1273.9       0.6X

```

### Does this PR introduce any user-facing change?
Yes, now users can get length of a json array by using `json_array_length`.

### How was this patch tested?
Added UT.

Closes #27759 from iRakson/jsonArrayLength.

Authored-by: iRakson <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>

(cherry picked from commit 71022d7)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants