Use spark.write.csv in to_csv of Series and DataFrame #749

HyukjinKwon · 2019-09-05T12:39:17Z

This PR proposes to use spark.write.csv API to enable distributed computation when path is specified. If path is not specified, it just calls pandas' to_csv as was.

Closes #677

codecov-io · 2019-09-05T14:27:54Z

Codecov Report

Merging #749 into master will decrease coverage by 0.07%.
The diff coverage is 86.36%.

@@            Coverage Diff             @@
##           master     #749      +/-   ##
==========================================
- Coverage   93.91%   93.84%   -0.08%     
==========================================
  Files          32       32              
  Lines        5669     5683      +14     
==========================================
+ Hits         5324     5333       +9     
- Misses        345      350       +5

Impacted Files	Coverage Δ
databricks/koalas/generic.py	`95.23% <86.36%> (-0.77%)`	⬇️
databricks/koalas/namespace.py	`81% <0%> (-1%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b12990a...7db2d08. Read the comment docs.

HyukjinKwon · 2019-09-06T00:04:04Z

databricks/koalas/tests/test_dataframe_conversion.py

@@ -43,32 +43,6 @@ def strip_all_whitespace(str):
        """A helper function to remove all whitespace from a string."""
        return str.translate({ord(c): None for c in string.whitespace})

-    def test_csv(self):


Moved to test_csv.py

HyukjinKwon · 2019-09-06T02:38:03Z

databricks/koalas/generic.py

-        """
+    def to_csv(self, path=None, sep=',', na_rep='', columns=None, header=True,
+               quotechar='"', date_format=None, escapechar=None, num_files=None,
+               **kwargs):


Hmm .. should I maybe not add this kwargs to directly allow PySpark' option for now @ueshin?

I think it's okay to have it. Actually I added options for read_sql()/_table()/_query().

softagram-bot · 2019-09-06T05:40:41Z

Softagram Impact Report for pull/749 (head commit: `7db2d08`)

⭐ Change Overview

(Open in Softagram Desktop for full details)

📄 Full report

Permalink: Full report for pull/749

Impact Report explained. Give feedback on this report to [email protected]

HyukjinKwon · 2019-09-06T06:05:03Z

Okie .. build passed.

ueshin

Otherwise, LGTM.

ueshin · 2019-09-09T18:28:23Z

databricks/koalas/generic.py

+        ...    date=list(pd.date_range('2012-1-1 12:00:00', periods=3, freq='M')),
+        ...    country=['KR', 'US', 'JP'],
+        ...    code=[1, 2 ,3]), columns=['date', 'country', 'code'])
+        >>> df.sort_values(by="date")  # doctest: +ELLIPSIS, +NORMALIZE_WHITESPACE


+ELLIPSIS is not needed?

oh that's needed for ... for index.

here:

Expected: date ... 2012-01-31 12:00:00 ... 2012-02-29 12:00:00 ... 2012-03-31 12:00:00 Got: date 0 2012-01-31 12:00:00 1 2012-02-29 12:00:00 2 2012-03-31 12:00:00

ueshin · 2019-09-09T18:28:32Z

databricks/koalas/generic.py

+        >>> df.to_csv(path=r'%s/to_csv/foo.csv' % path, num_files=1)
+        >>> ks.read_csv(
+        ...    path=r'%s/to_csv/foo.csv' % path
+        ... ).sort_values(by="date")  # doctest: +ELLIPSIS, +NORMALIZE_WHITESPACE


ueshin · 2019-09-09T18:28:40Z

databricks/koalas/generic.py

+        >>> df.date.to_csv(path=r'%s/to_csv/foo.csv' % path, num_files=1)
+        >>> ks.read_csv(
+        ...     path=r'%s/to_csv/foo.csv' % path
+        ... ).sort_values(by="date")  # doctest: +ELLIPSIS, +NORMALIZE_WHITESPACE


HyukjinKwon · 2019-09-10T00:24:25Z

Thanks, @ueshin. Merged to master.

HyukjinKwon requested a review from ueshin September 5, 2019 12:39

HyukjinKwon mentioned this pull request Sep 5, 2019

Implement DataFrame.to_csv and Series.to_csv base on spark.write.csv #677

Closed

HyukjinKwon force-pushed the to_csv branch from 15ba56f to 90462d3 Compare September 5, 2019 12:40

HyukjinKwon commented Sep 6, 2019

View reviewed changes

HyukjinKwon force-pushed the to_csv branch from 3a090aa to 361ac19 Compare September 6, 2019 01:57

move to_csv function from frame.py to generic.py

28f29da

HyukjinKwon force-pushed the to_csv branch 2 times, most recently from 4205fb9 to b41a9c3 Compare September 6, 2019 02:33

HyukjinKwon commented Sep 6, 2019

View reviewed changes

HyukjinKwon force-pushed the to_csv branch 6 times, most recently from 33e82e6 to e67a757 Compare September 6, 2019 04:03

Use spark.write.csv in to_csv of Series and DataFrame

7db2d08

HyukjinKwon force-pushed the to_csv branch from 4cb4d8d to 7db2d08 Compare September 6, 2019 05:40

ueshin approved these changes Sep 9, 2019

View reviewed changes

HyukjinKwon merged commit 5e476ed into databricks:master Sep 10, 2019

HyukjinKwon deleted the to_csv branch November 6, 2019 02:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use spark.write.csv in to_csv of Series and DataFrame #749

Use spark.write.csv in to_csv of Series and DataFrame #749

HyukjinKwon commented Sep 5, 2019

codecov-io commented Sep 5, 2019 •

edited

Loading

HyukjinKwon Sep 6, 2019

HyukjinKwon Sep 6, 2019 •

edited

Loading

ueshin Sep 6, 2019

softagram-bot commented Sep 6, 2019

HyukjinKwon commented Sep 6, 2019

ueshin left a comment

ueshin Sep 9, 2019

HyukjinKwon Sep 10, 2019

HyukjinKwon Sep 10, 2019

ueshin Sep 9, 2019

ueshin Sep 9, 2019

HyukjinKwon commented Sep 10, 2019

Use spark.write.csv in to_csv of Series and DataFrame #749

Use spark.write.csv in to_csv of Series and DataFrame #749

Conversation

HyukjinKwon commented Sep 5, 2019

codecov-io commented Sep 5, 2019 • edited Loading

Codecov Report

HyukjinKwon Sep 6, 2019

Choose a reason for hiding this comment

HyukjinKwon Sep 6, 2019 • edited Loading

Choose a reason for hiding this comment

ueshin Sep 6, 2019

Choose a reason for hiding this comment

softagram-bot commented Sep 6, 2019

Softagram Impact Report for pull/749 (head commit: 7db2d08)

⭐ Change Overview

📄 Full report

HyukjinKwon commented Sep 6, 2019

ueshin left a comment

Choose a reason for hiding this comment

ueshin Sep 9, 2019

Choose a reason for hiding this comment

HyukjinKwon Sep 10, 2019

Choose a reason for hiding this comment

HyukjinKwon Sep 10, 2019

Choose a reason for hiding this comment

ueshin Sep 9, 2019

Choose a reason for hiding this comment

ueshin Sep 9, 2019

Choose a reason for hiding this comment

HyukjinKwon commented Sep 10, 2019

codecov-io commented Sep 5, 2019 •

edited

Loading

HyukjinKwon Sep 6, 2019 •

edited

Loading

Softagram Impact Report for pull/749 (head commit: `7db2d08`)