-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-19828][R] Support array type in from_json in R #17178
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
|
||
| schemas <- list(structType(structField("age", "integer"), structField("height", "double")), | ||
| "struct<age:integer,height:double>") | ||
| for (schema in schemas) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I re-used the existing tests codes with the loop below:
- df <- as.DataFrame(j)
- schema <- structType(structField("age", "integer"),
- structField("height", "double"))
+ schemas <- list(structType(structField("age", "integer"), structField("height", "double")),
+ "struct<age:integer,height:double>")
+ for (schema in schemas) {
+ df <- as.DataFrame(j)
...
+ }
R/pkg/R/functions.R
Outdated
| column(jc) | ||
| }) | ||
|
|
||
| setClassUnion("characterOrstructType", c("character", "structType")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can I use a class union here in this file?
|
@felixcheung, this is a bit different with what we talked but I opened this because I thought you might like this more. This can take the type string given to Could you check if this makes sense to you? |
|
Test build #73997 has finished for PR 17178 at commit
|
| # check if array type in string is correctly supported. | ||
| jsonArr <- "[{\"name\":\"Bob\"}, {\"name\":\"Alice\"}]" | ||
| df <- as.DataFrame(list(list("people" = jsonArr))) | ||
| arr <- collect(select(df, alias(from_json(df$people, "array<struct<name:string>>"), "arrcol"))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Just in case, from_json takes both DataType and json string from DataType.json. In that sense, I thought it'd be nicer if it takes what structField takes in R)
| expect_equal(collect(select(df, from_json(df$a, schema)))[[1]][[1]], NA) | ||
|
|
||
| schemas <- list(structType(structField("age", "integer"), structField("height", "double")), | ||
| "struct<age:integer,height:double>") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if this is too loosely-typed and format hard to explain/illustrate to R user
struct<age:integer,height:double>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes.. I persuaded myself that it is okay because it's a valid type string for structField. If you prefer optional parameter one, I could try. Otherwise, let me close this for now If you are not sure of both ways :).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it might be ok if this is well documented and well checked.
checkType in schema.R is strangely largely unused or untested though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me give a shot at my best.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh to clarify
"it might be ok if this is well documented and well checked. "-> if it's something we could formally document it might be ok
"checkType in schema.R is strangely largely unused or untested though. "-> I don't know the string specification is actually accidental - I see no use of it outside of checkType and it's not referenced anywhere. It's definitely not being tested as well
Actually, on 2nd thought, I recall there's a JIRA on accepting the JSON string as schema that is supported on Scala side. That might be a better way to go if we are to take a string, instead of inventing our own format. But that could be a bigger change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, @felixcheung, I think this resembles catalog string, maybe we could reuse CatalystSqlParser.parseDataType to make this more formal and to do not duplicate the efforts for defining a format or documentation. This might be a big change but if this is what we want in the future, I would like to argue that we should keep this way.
For JSON string schema, there is an overloaded version of from_json that takes that schema string. If we are going to expose it, I think it can be easily done.
However, I think you meant it is a bigger change because we need to provide a way to produce this JSON string from types. Up to my knowledge, we can only manually specify the type via this calalog string in R. Is this true? If so, I don't have a good idea for now to support this and I would rather close this if you so as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it is a format that is documented and globally defined in Spark, it shouldn't be a problem to use that. If a variant of from_json is already taking a schema string, we should be able call it directly from R without having to make it public, although it might make sense to do so, if, say, read.format("json") supports it too.
I'm not super worry about having a way to produce JSON schema string for the R user - we would just accept such JSON schema string and assume the user can create it.
But yea, overall this feels very heavy-weighed to support multiple JSON objects in a column value. How about just take a bool flag like we started with? :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doh, I meant there is a public from_json that takes JSON string schema -
spark/sql/core/src/main/scala/org/apache/spark/sql/functions.scala
Lines 3061 to 3062 in 369a148
| def from_json(e: Column, schema: String, options: java.util.Map[String, String]): Column = | |
| from_json(e, DataType.fromJson(schema), options) |
Yup, let me give a shot to provide an option for it first to show if it looks okay to you.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yap that's what I thought. sounds good!
|
@felixcheung, I tried to add an optional parameter, |
|
Test build #74326 has finished for PR 17178 at commit
|
|
a couple of thoughts
|
| function(x, schema, asArray = FALSE, ...) { | ||
| if (asArray) { | ||
| jschema <- callJStatic("org.apache.spark.sql.api.r.SQLUtils", | ||
| "createArrayType", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need a wrapper, actually? can't we call newJObject?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, sure. Let me try.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems we'd better use
DataTypes.createArrayType
per
spark/sql/catalyst/src/main/scala/org/apache/spark/sql/types/ArrayType.scala
Lines 49 to 56 in 04ee8cf
| * The data type for collections of multiple values. | |
| * Internally these are represented as columns that contain a ``scala.collection.Seq``. | |
| * | |
| * Please use `DataTypes.createArrayType()` to create a specific instance. | |
| * | |
| * An [[ArrayType]] object comprises two fields, `elementType: [[DataType]]` and | |
| * `containsNull: Boolean`. The field of `elementType` is used to specify the type of | |
| * array elements. The field of `containsNull` is used to specify if the array has `null` values. |
Let me remove that new wrapper and use the original one.
|
Thank you @felixcheung. Let me try to handle the comments soon. |
Hmmm.. I am not too sure. Maybe,
Yup, it requires that input is a JSON array (multiple line case should be already fine regardless of this option because that limitation came from initially reading the data from files before parsing JSONs. If these multiple line JSON strings are already read in a column, it should be fine for parsing.). |
|
How about |
|
seems like |
|
Let me go for |
|
Doh! it seems it goes failed for lint (surprising) ... Let me go for |
| #' @note from_json since 2.2.0 | ||
| setMethod("from_json", signature(x = "Column", schema = "structType"), | ||
| function(x, schema, ...) { | ||
| function(x, schema, asJsonArray = FALSE, ...) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@felixcheung, I am still not fully sure of this name, asJsonArray. I am okay if you have a better one.
|
Test build #74507 has finished for PR 17178 at commit
|
|
that's ridiculous ... dot in names is the norm https://stat.ethz.ch/R-manual/R-devel/library/utils/html/read.table.html anyway |
felixcheung
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's fine then.
one comment
R/pkg/R/functions.R
Outdated
| #' | ||
| #' @param x Column containing the JSON string. | ||
| #' @param schema a structType object to use as the schema to use when parsing the JSON string. | ||
| #' @param asJsonArray indicating if input string is JSON array or object. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe clarifying
JSON array or object. -> JSON array of objects or a single object.
|
@felixcheung, thank you for your close look and asking my opinion. |
|
Test build #74559 has finished for PR 17178 at commit
|
|
thanks! |
What changes were proposed in this pull request?
Since we could not directly define the array type in R, this PR proposes to support array types in R as string types that are used in
structFieldas below:prints
How was this patch tested?
Unit tests in
test_sparkSQL.R.