Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use spark.write.csv in to_csv of Series and DataFrame #749

Merged
merged 2 commits into from
Sep 10, 2019

Conversation

HyukjinKwon
Copy link
Member

This PR proposes to use spark.write.csv API to enable distributed computation when path is specified. If path is not specified, it just calls pandas' to_csv as was.

Closes #677

@codecov-io
Copy link

codecov-io commented Sep 5, 2019

Codecov Report

Merging #749 into master will decrease coverage by 0.07%.
The diff coverage is 86.36%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #749      +/-   ##
==========================================
- Coverage   93.91%   93.84%   -0.08%     
==========================================
  Files          32       32              
  Lines        5669     5683      +14     
==========================================
+ Hits         5324     5333       +9     
- Misses        345      350       +5
Impacted Files Coverage Δ
databricks/koalas/generic.py 95.23% <86.36%> (-0.77%) ⬇️
databricks/koalas/namespace.py 81% <0%> (-1%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b12990a...7db2d08. Read the comment docs.

@@ -43,32 +43,6 @@ def strip_all_whitespace(str):
"""A helper function to remove all whitespace from a string."""
return str.translate({ord(c): None for c in string.whitespace})

def test_csv(self):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved to test_csv.py

@HyukjinKwon HyukjinKwon force-pushed the to_csv branch 2 times, most recently from 4205fb9 to b41a9c3 Compare September 6, 2019 02:33
"""
def to_csv(self, path=None, sep=',', na_rep='', columns=None, header=True,
quotechar='"', date_format=None, escapechar=None, num_files=None,
**kwargs):
Copy link
Member Author

@HyukjinKwon HyukjinKwon Sep 6, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm .. should I maybe not add this kwargs to directly allow PySpark' option for now @ueshin?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's okay to have it. Actually I added options for read_sql()/_table()/_query().

@HyukjinKwon HyukjinKwon force-pushed the to_csv branch 6 times, most recently from 33e82e6 to e67a757 Compare September 6, 2019 04:03
@softagram-bot
Copy link

Softagram Impact Report for pull/749 (head commit: 7db2d08)

⭐ Change Overview

Showing the changed files, dependency changes and the impact - click for full size
(Open in Softagram Desktop for full details)

📄 Full report

Impact Report explained. Give feedback on this report to [email protected]

@HyukjinKwon
Copy link
Member Author

Okie .. build passed.

Copy link
Collaborator

@ueshin ueshin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Otherwise, LGTM.

... date=list(pd.date_range('2012-1-1 12:00:00', periods=3, freq='M')),
... country=['KR', 'US', 'JP'],
... code=[1, 2 ,3]), columns=['date', 'country', 'code'])
>>> df.sort_values(by="date") # doctest: +ELLIPSIS, +NORMALIZE_WHITESPACE
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+ELLIPSIS is not needed?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh that's needed for ... for index.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here:

Expected:
                       date
    ... 2012-01-31 12:00:00
    ... 2012-02-29 12:00:00
    ... 2012-03-31 12:00:00
Got:
                     date
    0 2012-01-31 12:00:00
    1 2012-02-29 12:00:00
    2 2012-03-31 12:00:00

>>> df.to_csv(path=r'%s/to_csv/foo.csv' % path, num_files=1)
>>> ks.read_csv(
... path=r'%s/to_csv/foo.csv' % path
... ).sort_values(by="date") # doctest: +ELLIPSIS, +NORMALIZE_WHITESPACE
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto.

>>> df.date.to_csv(path=r'%s/to_csv/foo.csv' % path, num_files=1)
>>> ks.read_csv(
... path=r'%s/to_csv/foo.csv' % path
... ).sort_values(by="date") # doctest: +ELLIPSIS, +NORMALIZE_WHITESPACE
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto.

@HyukjinKwon
Copy link
Member Author

Thanks, @ueshin. Merged to master.

@HyukjinKwon HyukjinKwon merged commit 5e476ed into databricks:master Sep 10, 2019
@HyukjinKwon HyukjinKwon deleted the to_csv branch November 6, 2019 02:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants