Skip to content

Conversation

@cjuexuan
Copy link

What changes were proposed in this pull request?

add csv write charset support jira

(Please fill in changes proposed in this fix)

How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)

Please review http://spark.apache.org/contributing.html before opening a pull request.

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@HyukjinKwon
Copy link
Member

Do you mind if I ask wheather it writes the line separstor correctly as the encoding specified in the option?

@srowen
Copy link
Member

srowen commented Dec 30, 2016

I can understanding not hard-coding UTF-8 as the output encoding -- that's the core problem right? but how about just using the existing encoding parameter to control this? It's conceivable, but pretty obscure, that someone would want to output a different encoding.

@cjuexuan
Copy link
Author

because if writer can't set encoding,we must convert utf8 2 gb18030 in chinese,so I think we should give setting about it

@cjuexuan
Copy link
Author

@srowen microsoft office can't open csv file correctly with utf-8 encoding when it contains chinese

@srowen
Copy link
Member

srowen commented Dec 30, 2016

Let's start by not hard-coding UTF-8, only. If you're saying that the output is correctly rendered as UTF-8, and MS Office doesn't open that, I'd be really surprised. That's an office bug though.

@HyukjinKwon
Copy link
Member

HyukjinKwon commented Dec 30, 2016

BTW, the reason I asked that in #16428 (comment) is I remember that I checked the reading/writing paths related with encodings before and the encoding should be set to line record reader.

I just now double-chekced that newlines were \n for each batch due to TextOutputFormats record writer but it seems it was changed in the recent commit. So, now, it seems the newlines seem fully dependent on univocity library.

We should add some tests for this for sure, in CSVSuite to verify this behaviour and prevent regressions.

As a small side note, we don't currently support non-ascii compatible encodings in reading path if I haven't missed some changes in this path.

@cjuexuan
Copy link
Author

@HyukjinKwon ,I see ,because my version is 2.0.2,we use ByteArrayOutputStream and call toString method ,this will using Charset.defaultCharset() and bind with env ,and in master branch ,we are already fix it,so l agreed to @srowen ,we should only not using hard-coding UTF-8,users can set it by giving their writer encoding

@cjuexuan
Copy link
Author

@HyukjinKwon ,I already run CSVSuite ,and all tests passed

@HyukjinKwon
Copy link
Member

HyukjinKwon commented Dec 30, 2016

Ah, I meant to add a test here in this PR.

val charset = parameters.getOrElse("encoding",
val readCharSet = parameters.getOrElse("encoding",
parameters.getOrElse("charset", StandardCharsets.UTF_8.name()))
val writeCharSet = parameters.getOrElse("writeEncoding",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should not necessarily introduce additional option. We could just use charset variable because other options such as nullValue are already applied to both reading and writing.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@HyukjinKwon I think so

* indicates a timestamp format. Custom date formats follow the formats at
* `java.text.SimpleDateFormat`. This applies to timestamp type.</li>
* </ul>
* <li>`writeEncoding`(default `utf-8`) save dataFrame 2 csv by giving encoding</li>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We also should add the same documentation in readwriter.py.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok,I will write my unit test and modify this pull request


@since(2.0)
def csv(self, path, mode=None, compression=None, sep=None, quote=None, escape=None,
def csv(self, path, mode=None, compression=None, sep=None, encoding=None, quote=None, escape=None,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to place this new option at the end. Otherwise, it will breaks existing codes that use this options as a positional argument (not keyword argument).

import org.apache.spark.sql.test.{SharedSQLContext, SQLTestUtils}
import org.apache.spark.sql.types._

//noinspection ScalaStyle
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can disable only the lines with the block as below if you need this for non-ascii characters:

// scalastyle:off
...
// scalastyle:on

* indicates a timestamp format. Custom date formats follow the formats at
* `java.text.SimpleDateFormat`. This applies to timestamp type.</li>
* </ul>
* <li>`encoding`(default `utf-8`) save dataFrame 2 csv by giving encoding</li>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we just resemble the documentation in DataFrameReader just for consistency?

 <li>`encoding` (default `UTF-8`): decodes the CSV files by the given encoding
   * type.</li>

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@HyukjinKwon ok,I refine it

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@HyukjinKwon what about

<li>`encoding` (default `UTF-8`): encodes the CSV files by the given encoding
  * type.</li>  

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good.

Copy link
Member

@HyukjinKwon HyukjinKwon Dec 31, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, also, it seems the newly added option here should be between existing <ul> and </ul> so that this can be rendered fine in Scala/Java API documentation.

:param sep: sets the single character as a separator for each field and value. If None is
set, it uses the default value, ``,``.
:param encoding: sets writer CSV files by the given encoding type. If None is set,
it uses the default value, ``UTF-8``.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here too, let's resemble the one in DataFrameReader above in this file.

}

test("save data with gb18030") {
withTempPath{ path =>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: it should be withTempPath { path =>.

.option("encoding", "GB18030")
.csv(path.getAbsolutePath)

checkAnswer(df, Row("1", "中文"))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we write this something like as below:

// scalastyle:off
val df = Seq(("1", "中文")).toDF("num", "lanaguage")
// scalastyle:on
df.write
  .option("header", "true")
  .option("encoding", "GB18030")
  .csv(path.getAbsolutePath)

val readBack = spark.read
  .option("header", "true")
  .option("encoding", "GB18030")
  .csv(path.getAbsolutePath)
 
checkAnswer(df, readBack)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds good

@cjuexuan
Copy link
Author

@HyukjinKwon I modify by this commit ,please review it ,thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants