Skip to content

Conversation

@HyukjinKwon
Copy link
Member

@HyukjinKwon HyukjinKwon commented Mar 6, 2017

What changes were proposed in this pull request?

Since we could not directly define the array type in R, this PR proposes to support array types in R as string types that are used in structField as below:

jsonArr <- "[{\"name\":\"Bob\"}, {\"name\":\"Alice\"}]"
df <- as.DataFrame(list(list("people" = jsonArr)))
collect(select(df, alias(from_json(df$people, schema, asJsonArray = TRUE), "arrcol")))

prints

      arrcol
1 Bob, Alice

How was this patch tested?

Unit tests in test_sparkSQL.R.


schemas <- list(structType(structField("age", "integer"), structField("height", "double")),
"struct<age:integer,height:double>")
for (schema in schemas) {
Copy link
Member Author

@HyukjinKwon HyukjinKwon Mar 6, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I re-used the existing tests codes with the loop below:

- df <- as.DataFrame(j)
- schema <- structType(structField("age", "integer"),
-                      structField("height", "double"))
+ schemas <- list(structType(structField("age", "integer"), structField("height", "double")),
+                 "struct<age:integer,height:double>")
+ for (schema in schemas) {
+   df <- as.DataFrame(j)
  ...
+ }

column(jc)
})

setClassUnion("characterOrstructType", c("character", "structType"))
Copy link
Member Author

@HyukjinKwon HyukjinKwon Mar 6, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can I use a class union here in this file?

@HyukjinKwon
Copy link
Member Author

HyukjinKwon commented Mar 6, 2017

@felixcheung, this is a bit different with what we talked but I opened this because I thought you might like this more. This can take the type string given to structField now. If you are worried of this, let me change it back to the optional parameter one.

Could you check if this makes sense to you?

@SparkQA
Copy link

SparkQA commented Mar 6, 2017

Test build #73997 has finished for PR 17178 at commit 10feb9d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

# check if array type in string is correctly supported.
jsonArr <- "[{\"name\":\"Bob\"}, {\"name\":\"Alice\"}]"
df <- as.DataFrame(list(list("people" = jsonArr)))
arr <- collect(select(df, alias(from_json(df$people, "array<struct<name:string>>"), "arrcol")))
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Just in case, from_json takes both DataType and json string from DataType.json. In that sense, I thought it'd be nicer if it takes what structField takes in R)

expect_equal(collect(select(df, from_json(df$a, schema)))[[1]][[1]], NA)

schemas <- list(structType(structField("age", "integer"), structField("height", "double")),
"struct<age:integer,height:double>")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if this is too loosely-typed and format hard to explain/illustrate to R user
struct<age:integer,height:double>

Copy link
Member Author

@HyukjinKwon HyukjinKwon Mar 7, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes.. I persuaded myself that it is okay because it's a valid type string for structField. If you prefer optional parameter one, I could try. Otherwise, let me close this for now If you are not sure of both ways :).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it might be ok if this is well documented and well checked.
checkType in schema.R is strangely largely unused or untested though.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me give a shot at my best.

Copy link
Member

@felixcheung felixcheung Mar 7, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh to clarify

"it might be ok if this is well documented and well checked. "-> if it's something we could formally document it might be ok

"checkType in schema.R is strangely largely unused or untested though. "-> I don't know the string specification is actually accidental - I see no use of it outside of checkType and it's not referenced anywhere. It's definitely not being tested as well

Actually, on 2nd thought, I recall there's a JIRA on accepting the JSON string as schema that is supported on Scala side. That might be a better way to go if we are to take a string, instead of inventing our own format. But that could be a bigger change.

Copy link
Member Author

@HyukjinKwon HyukjinKwon Mar 8, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, @felixcheung, I think this resembles catalog string, maybe we could reuse CatalystSqlParser.parseDataType to make this more formal and to do not duplicate the efforts for defining a format or documentation. This might be a big change but if this is what we want in the future, I would like to argue that we should keep this way.

For JSON string schema, there is an overloaded version of from_json that takes that schema string. If we are going to expose it, I think it can be easily done.

However, I think you meant it is a bigger change because we need to provide a way to produce this JSON string from types. Up to my knowledge, we can only manually specify the type via this calalog string in R. Is this true? If so, I don't have a good idea for now to support this and I would rather close this if you so as well.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it is a format that is documented and globally defined in Spark, it shouldn't be a problem to use that. If a variant of from_json is already taking a schema string, we should be able call it directly from R without having to make it public, although it might make sense to do so, if, say, read.format("json") supports it too.

I'm not super worry about having a way to produce JSON schema string for the R user - we would just accept such JSON schema string and assume the user can create it.

But yea, overall this feels very heavy-weighed to support multiple JSON objects in a column value. How about just take a bool flag like we started with? :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doh, I meant there is a public from_json that takes JSON string schema -

def from_json(e: Column, schema: String, options: java.util.Map[String, String]): Column =
from_json(e, DataType.fromJson(schema), options)

Yup, let me give a shot to provide an option for it first to show if it looks okay to you.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yap that's what I thought. sounds good!

@HyukjinKwon
Copy link
Member Author

@felixcheung, I tried to add an optional parameter, asArray. Could you take a look and see if it makes sense?

@SparkQA
Copy link

SparkQA commented Mar 10, 2017

Test build #74326 has finished for PR 17178 at commit 5c2450d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@felixcheung
Copy link
Member

a couple of thoughts

  • a default FALSE parameter might be less intuitive - do you think we could come up with a default TRUE parameter (ok, single?)
  • if it's asX, typically it would be more R-like to name it as.X, eg. as.array
  • array might be a bit vague - does that require the column value to be a JSON array, or just multiple JSON object spanning multiple line?
[
    {
        "title": "1",
        edition: 3,
        year: 2011
    },
    {
        "title": "2",
        edition: 2,
        year: 2009
    }
]

function(x, schema, asArray = FALSE, ...) {
if (asArray) {
jschema <- callJStatic("org.apache.spark.sql.api.r.SQLUtils",
"createArrayType",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need a wrapper, actually? can't we call newJObject?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, sure. Let me try.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems we'd better use

DataTypes.createArrayType

per

* The data type for collections of multiple values.
* Internally these are represented as columns that contain a ``scala.collection.Seq``.
*
* Please use `DataTypes.createArrayType()` to create a specific instance.
*
* An [[ArrayType]] object comprises two fields, `elementType: [[DataType]]` and
* `containsNull: Boolean`. The field of `elementType` is used to specify the type of
* array elements. The field of `containsNull` is used to specify if the array has `null` values.

Let me remove that new wrapper and use the original one.

@HyukjinKwon
Copy link
Member Author

Thank you @felixcheung. Let me try to handle the comments soon.

@HyukjinKwon
Copy link
Member Author

HyukjinKwon commented Mar 13, 2017

a default FALSE parameter might be less intuitive - do you think we could come up with a default TRUE parameter (ok, single?)
if it's asX, typically it would be more R-like to name it as.X, eg. as.array

Hmmm.. I am not too sure. Maybe, as.jsonobj.......? jsonobj?.. object?...as.single... struct... ? Not sure. Let me leave it for a day and revisit. If I can't think of a better one, then let me go for single.

array might be a bit vague - does that require the column value to be a JSON array, or just multiple JSON object spanning multiple line?

Yup, it requires that input is a JSON array (multiple line case should be already fine regardless of this option because that limitation came from initially reading the data from files before parsing JSONs. If these multiple line JSON strings are already read in a column, it should be fine for parsing.).

@HyukjinKwon
Copy link
Member Author

How about as.struct?

@felixcheung
Copy link
Member

seems like object is the right term in JSON lango - how about as.json.object? otherwise as.json.array (the opposite) or as.json.object.array might be good option

@HyukjinKwon
Copy link
Member Author

HyukjinKwon commented Mar 14, 2017

Let me go for as.json.array actually..

@HyukjinKwon
Copy link
Member Author

Doh! it seems it goes failed for lint (surprising) ...

R/functions.R:2456:31: style: Words within variable and function names should be separated by '_' rather than '.'.
          function(x, schema, as.json.array = TRUE, ...) {
                              ^~~~~~~~~~~~~
lintr checks failed.

Let me go for asJsonArray.

#' @note from_json since 2.2.0
setMethod("from_json", signature(x = "Column", schema = "structType"),
function(x, schema, ...) {
function(x, schema, asJsonArray = FALSE, ...) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@felixcheung, I am still not fully sure of this name, asJsonArray. I am okay if you have a better one.

@SparkQA
Copy link

SparkQA commented Mar 14, 2017

Test build #74507 has finished for PR 17178 at commit d01f952.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@felixcheung
Copy link
Member

that's ridiculous ... dot in names is the norm https://stat.ethz.ch/R-manual/R-devel/library/utils/html/read.table.html

anyway

Copy link
Member

@felixcheung felixcheung left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's fine then.
one comment

#'
#' @param x Column containing the JSON string.
#' @param schema a structType object to use as the schema to use when parsing the JSON string.
#' @param asJsonArray indicating if input string is JSON array or object.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe clarifying
JSON array or object. -> JSON array of objects or a single object.

@HyukjinKwon
Copy link
Member Author

@felixcheung, thank you for your close look and asking my opinion.

@SparkQA
Copy link

SparkQA commented Mar 15, 2017

Test build #74559 has finished for PR 17178 at commit 8ec7cd0.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@felixcheung
Copy link
Member

thanks!
merged to master.

@asfgit asfgit closed this in d1f6c64 Mar 15, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants