[SPARK-19018][SQL] ADD csv write charset param #16428

cjuexuan · 2016-12-29T02:16:00Z

What changes were proposed in this pull request?

add csv write charset support jira

(Please fill in changes proposed in this fix)

How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)

Please review http://spark.apache.org/contributing.html before opening a pull request.

AmplabJenkins · 2016-12-29T02:17:14Z

Can one of the admins verify this patch?

HyukjinKwon · 2016-12-29T14:12:12Z

Do you mind if I ask wheather it writes the line separstor correctly as the encoding specified in the option?

srowen · 2016-12-30T09:48:06Z

I can understanding not hard-coding UTF-8 as the output encoding -- that's the core problem right? but how about just using the existing encoding parameter to control this? It's conceivable, but pretty obscure, that someone would want to output a different encoding.

cjuexuan · 2016-12-30T09:55:15Z

because if writer can't set encoding，we must convert utf8 2 gb18030 in chinese,so I think we should give setting about it

cjuexuan · 2016-12-30T10:00:30Z

@srowen microsoft office can't open csv file correctly with utf-8 encoding when it contains chinese

srowen · 2016-12-30T10:06:59Z

Let's start by not hard-coding UTF-8, only. If you're saying that the output is correctly rendered as UTF-8, and MS Office doesn't open that, I'd be really surprised. That's an office bug though.

HyukjinKwon · 2016-12-30T10:41:09Z

BTW, the reason I asked that in #16428 (comment) is I remember that I checked the reading/writing paths related with encodings before and the encoding should be set to line record reader.

I just now double-chekced that newlines were \n for each batch due to TextOutputFormats record writer but it seems it was changed in the recent commit. So, now, it seems the newlines seem fully dependent on univocity library.

We should add some tests for this for sure, in CSVSuite to verify this behaviour and prevent regressions.

As a small side note, we don't currently support non-ascii compatible encodings in reading path if I haven't missed some changes in this path.

cjuexuan · 2016-12-30T15:19:25Z

@HyukjinKwon ,I see ,because my version is 2.0.2,we use ByteArrayOutputStream and call toString method ,this will using Charset.defaultCharset() and bind with env ,and in master branch ,we are already fix it，so l agreed to @srowen ,we should only not using hard-coding UTF-8,users can set it by giving their writer encoding

cjuexuan · 2016-12-30T15:36:39Z

@HyukjinKwon ,I already run CSVSuite ,and all tests passed

HyukjinKwon · 2016-12-30T15:52:46Z

Ah, I meant to add a test here in this PR.

HyukjinKwon · 2016-12-30T16:01:25Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVOptions.scala

-  val charset = parameters.getOrElse("encoding",
+  val readCharSet = parameters.getOrElse("encoding",
+    parameters.getOrElse("charset", StandardCharsets.UTF_8.name()))
+  val writeCharSet = parameters.getOrElse("writeEncoding",


I think we should not necessarily introduce additional option. We could just use charset variable because other options such as nullValue are already applied to both reading and writing.

@HyukjinKwon I think so

HyukjinKwon · 2016-12-30T16:17:47Z

sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala

   * indicates a timestamp format. Custom date formats follow the formats at
   * `java.text.SimpleDateFormat`. This applies to timestamp type.</li>
   * </ul>
+   * <li>`writeEncoding`(default `utf-8`) save dataFrame 2 csv by giving encoding</li>


We also should add the same documentation in readwriter.py.

ok,I will write my unit test and modify this pull request

HyukjinKwon · 2016-12-31T13:03:24Z

python/pyspark/sql/readwriter.py


    @since(2.0)
-    def csv(self, path, mode=None, compression=None, sep=None, quote=None, escape=None,
+    def csv(self, path, mode=None, compression=None, sep=None, encoding=None, quote=None, escape=None,


We need to place this new option at the end. Otherwise, it will breaks existing codes that use this options as a positional argument (not keyword argument).

HyukjinKwon · 2016-12-31T13:13:45Z

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala

 import org.apache.spark.sql.test.{SharedSQLContext, SQLTestUtils}
 import org.apache.spark.sql.types._

+//noinspection ScalaStyle


We can disable only the lines with the block as below if you need this for non-ascii characters:

// scalastyle:off ... // scalastyle:on

HyukjinKwon · 2016-12-31T13:15:13Z

sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala

   * indicates a timestamp format. Custom date formats follow the formats at
   * `java.text.SimpleDateFormat`. This applies to timestamp type.</li>
   * </ul>
+   * <li>`encoding`(default `utf-8`) save dataFrame 2 csv by giving encoding</li>


Could we just resemble the documentation in DataFrameReader just for consistency?

<li>`encoding` (default `UTF-8`): decodes the CSV files by the given encoding * type.</li>

@HyukjinKwon ok,I refine it

@HyukjinKwon what about

<li>`encoding` (default `UTF-8`): encodes the CSV files by the given encoding * type.</li>

looks good.

Oh, also, it seems the newly added option here should be between existing <ul> and </ul> so that this can be rendered fine in Scala/Java API documentation.

HyukjinKwon · 2016-12-31T13:24:37Z

python/pyspark/sql/readwriter.py

        :param sep: sets the single character as a separator for each field and value. If None is
                    set, it uses the default value, ``,``.
+        :param encoding: sets writer CSV files by the given encoding type. If None is set,
+                         it uses the default value, ``UTF-8``.


Here too, let's resemble the one in DataFrameReader above in this file.

HyukjinKwon · 2016-12-31T13:24:50Z

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala

  }
+
+  test("save data with gb18030") {
+    withTempPath{ path =>


nit: it should be withTempPath { path =>.

HyukjinKwon · 2016-12-31T13:25:36Z

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala

+        .option("encoding", "GB18030")
+        .csv(path.getAbsolutePath)
+
+      checkAnswer(df, Row("1", "中文"))


Could we write this something like as below:

// scalastyle:off val df = Seq(("1", "中文")).toDF("num", "lanaguage") // scalastyle:on df.write .option("header", "true") .option("encoding", "GB18030") .csv(path.getAbsolutePath) val readBack = spark.read .option("header", "true") .option("encoding", "GB18030") .csv(path.getAbsolutePath) checkAnswer(df, readBack)

sounds good

cjuexuan · 2016-12-31T13:50:07Z

@HyukjinKwon I modify by this commit ,please review it ,thanks

[SPARK-19018][SQL] ADD csv write charset param

fafbb06

HyukjinKwon reviewed Dec 30, 2016

View reviewed changes

[SPARK-19018][SQL] add doc and unit test,refine csv writer settings

9b13f00

HyukjinKwon reviewed Dec 31, 2016

View reviewed changes

[SPARK-19018][SQL] refine code style

7243769

cjuexuan closed this May 9, 2017

HyukjinKwon mentioned this pull request Jun 14, 2017

[SPARK-21098] Set lineseparator csv multiline and csv write to \n #18304

Closed

[SPARK-19018][SQL] ADD csv write charset param #16428

[SPARK-19018][SQL] ADD csv write charset param #16428

Uh oh!

Conversation

cjuexuan commented Dec 29, 2016

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

AmplabJenkins commented Dec 29, 2016

Uh oh!

HyukjinKwon commented Dec 29, 2016

Uh oh!

srowen commented Dec 30, 2016

Uh oh!

cjuexuan commented Dec 30, 2016

Uh oh!

cjuexuan commented Dec 30, 2016

Uh oh!

srowen commented Dec 30, 2016

Uh oh!

HyukjinKwon commented Dec 30, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cjuexuan commented Dec 30, 2016

Uh oh!

cjuexuan commented Dec 30, 2016

Uh oh!

HyukjinKwon commented Dec 30, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon Dec 31, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cjuexuan commented Dec 31, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

HyukjinKwon commented Dec 30, 2016 •

edited

Loading

HyukjinKwon commented Dec 30, 2016 •

edited

Loading

HyukjinKwon Dec 31, 2016 •

edited

Loading