-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-19018][SQL] ADD csv write charset param #16428
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Can one of the admins verify this patch? |
|
Do you mind if I ask wheather it writes the line separstor correctly as the encoding specified in the option? |
|
I can understanding not hard-coding UTF-8 as the output encoding -- that's the core problem right? but how about just using the existing encoding parameter to control this? It's conceivable, but pretty obscure, that someone would want to output a different encoding. |
|
because if writer can't set encoding,we must convert utf8 2 gb18030 in chinese,so I think we should give setting about it |
|
@srowen microsoft office can't open csv file correctly with utf-8 encoding when it contains chinese |
|
Let's start by not hard-coding UTF-8, only. If you're saying that the output is correctly rendered as UTF-8, and MS Office doesn't open that, I'd be really surprised. That's an office bug though. |
|
BTW, the reason I asked that in #16428 (comment) is I remember that I checked the reading/writing paths related with encodings before and the encoding should be set to line record reader. I just now double-chekced that newlines were We should add some tests for this for sure, in As a small side note, we don't currently support non-ascii compatible encodings in reading path if I haven't missed some changes in this path. |
|
@HyukjinKwon ,I see ,because my version is |
|
@HyukjinKwon ,I already run |
|
Ah, I meant to add a test here in this PR. |
| val charset = parameters.getOrElse("encoding", | ||
| val readCharSet = parameters.getOrElse("encoding", | ||
| parameters.getOrElse("charset", StandardCharsets.UTF_8.name())) | ||
| val writeCharSet = parameters.getOrElse("writeEncoding", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should not necessarily introduce additional option. We could just use charset variable because other options such as nullValue are already applied to both reading and writing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@HyukjinKwon I think so
| * indicates a timestamp format. Custom date formats follow the formats at | ||
| * `java.text.SimpleDateFormat`. This applies to timestamp type.</li> | ||
| * </ul> | ||
| * <li>`writeEncoding`(default `utf-8`) save dataFrame 2 csv by giving encoding</li> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We also should add the same documentation in readwriter.py.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok,I will write my unit test and modify this pull request
python/pyspark/sql/readwriter.py
Outdated
|
|
||
| @since(2.0) | ||
| def csv(self, path, mode=None, compression=None, sep=None, quote=None, escape=None, | ||
| def csv(self, path, mode=None, compression=None, sep=None, encoding=None, quote=None, escape=None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to place this new option at the end. Otherwise, it will breaks existing codes that use this options as a positional argument (not keyword argument).
| import org.apache.spark.sql.test.{SharedSQLContext, SQLTestUtils} | ||
| import org.apache.spark.sql.types._ | ||
|
|
||
| //noinspection ScalaStyle |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can disable only the lines with the block as below if you need this for non-ascii characters:
// scalastyle:off
...
// scalastyle:on| * indicates a timestamp format. Custom date formats follow the formats at | ||
| * `java.text.SimpleDateFormat`. This applies to timestamp type.</li> | ||
| * </ul> | ||
| * <li>`encoding`(default `utf-8`) save dataFrame 2 csv by giving encoding</li> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we just resemble the documentation in DataFrameReader just for consistency?
<li>`encoding` (default `UTF-8`): decodes the CSV files by the given encoding
* type.</li>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@HyukjinKwon ok,I refine it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@HyukjinKwon what about
<li>`encoding` (default `UTF-8`): encodes the CSV files by the given encoding
* type.</li>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, also, it seems the newly added option here should be between existing <ul> and </ul> so that this can be rendered fine in Scala/Java API documentation.
python/pyspark/sql/readwriter.py
Outdated
| :param sep: sets the single character as a separator for each field and value. If None is | ||
| set, it uses the default value, ``,``. | ||
| :param encoding: sets writer CSV files by the given encoding type. If None is set, | ||
| it uses the default value, ``UTF-8``. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here too, let's resemble the one in DataFrameReader above in this file.
| } | ||
|
|
||
| test("save data with gb18030") { | ||
| withTempPath{ path => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: it should be withTempPath { path =>.
| .option("encoding", "GB18030") | ||
| .csv(path.getAbsolutePath) | ||
|
|
||
| checkAnswer(df, Row("1", "中文")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we write this something like as below:
// scalastyle:off
val df = Seq(("1", "中文")).toDF("num", "lanaguage")
// scalastyle:on
df.write
.option("header", "true")
.option("encoding", "GB18030")
.csv(path.getAbsolutePath)
val readBack = spark.read
.option("header", "true")
.option("encoding", "GB18030")
.csv(path.getAbsolutePath)
checkAnswer(df, readBack)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sounds good
|
@HyukjinKwon I modify by this commit ,please review it ,thanks |
What changes were proposed in this pull request?
add csv write charset support jira
(Please fill in changes proposed in this fix)
How was this patch tested?
(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)
Please review http://spark.apache.org/contributing.html before opening a pull request.