Skip to content

Conversation

@HyukjinKwon
Copy link
Member

What changes were proposed in this pull request?

This PR targets to add the ability for dealing with an array (JSON array) in tablePropertyValue rule.

SQL

CREATE TEMPORARY TABLE tableA USING csv
OPTIONS (nullValue [2012, 1.1, 'null'], ...)

How was this patch tested?

Manually tested and test cases added in DDLParserSuite.scala.

@HyukjinKwon
Copy link
Member Author

cc @gatorsmile could you take a look please?

@viirya
Copy link
Member

viirya commented Dec 31, 2017

Is this a special feature for SparkSQL only? Seems Hive doesn't have such support.

Btw, is this any difference than using string? Like:

CREATE TEMPORARY TABLE tableA USING csv
OPTIONS (nullValue "[2012, 1.1, 'null']", ...)

@HyukjinKwon
Copy link
Member Author

HyukjinKwon commented Dec 31, 2017

Yup, I was thinking of SparkSQL only feature.

For more details, the original intention was to support multiple values for nullValue but I realised such option support can be generallised - there have been several issues about this since CSV was thirdparty library (I will find and give some links if requested). Also, there is one reference in R too:

> d <- "col1,col2
+ 1,3
+ 2,4"
> df <- read.csv(text=d, na.strings=c("3", "2"))
> df
  col1 col2
1    1   NA
2   NA    4

For more context, original proposal (Scala/SQL/Python/Java) here - #16611 touched many files and I received an advice to make this smaller, which I liked.

@HyukjinKwon
Copy link
Member Author

Btw, is this any difference than using string? Like:

Nope, they will be the same but I was thinking this is a simplest fix.

@HyukjinKwon
Copy link
Member Author

HyukjinKwon commented Dec 31, 2017

I actually think #20125 (comment) are good points and I was hesitant about it. Although IMHO I think it might be fine in a way but let me cc @hvanhovell and @rxin too, who reviewed my related PRs before.

@SparkQA
Copy link

SparkQA commented Dec 31, 2017

Test build #85555 has finished for PR 20125 at commit 5cae64b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member Author

I am sorry it's been inactive. Let me update this one within a week.

@gatorsmile
Copy link
Member

Two general questions:

  • Any behavior change in the parser?
  • What is the corresponding interface for DataFrameReader and DataFrameWriter APIs?

@HyukjinKwon
Copy link
Member Author

1.. Any behavior change in the parser?

I believe there's no behaviour changes since option clause itself does not support [ and ] tokens:

CREATE TEMPORARY TABLE tableA USING csv
OPTIONS (nullValue [2012, 1.1, 'null'], ...)

Currently, option value takes, string`, integer, decimals, and bools. I believe it's not ambiguous or it doesn't introduce a behaviour change in our parser.

2.. What is the corresponding interface for DataFrameReader and DataFrameWriter APIs?

I wsa thinking about the interfaces as below:

Scala - Seq[String]

spark.read.format("csv")
  .option("nullValue", Seq("2012", "Tesla", "null"))
  ...

Java - String[]

spark.read().format("csv")
  .option("nullValue", new String[]{"", "null", "NA"})
  ...

Previous PR includes that APIs https://github.com/apache/spark/pull/16611/files

One concern is that:

OPTIONS (nullValue "[2012, 1.1, 'null']", ...)
option("[2012, 1.1, 'null']")

could work in the same way .. which is a bit ugly.

@HyukjinKwon
Copy link
Member Author

HyukjinKwon commented Dec 21, 2018

I was thinking about @cloud-fan's suggestion at #21192 (comment), and it looks feasible that we can additionally add an util that parses JSON array string to return an array but I wonder if that's possible in Data Source V1.

We can do that DataSourceOptions.getArrays for Data source V2. DataSourceOptions.paths uses a similar approach.

However, I don't think it's quite possible to Data Source V1, since we can't change the signatures from Map[String, String] to another.

Anyway I think we still need this change to follow @cloud-fan's suggestion.

@HyukjinKwon HyukjinKwon deleted the SPARK-17967-sql branch March 3, 2020 01:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants