-
Notifications
You must be signed in to change notification settings - Fork 358
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use spark.write.csv in to_csv of Series and DataFrame #749
Conversation
Codecov Report
@@ Coverage Diff @@
## master #749 +/- ##
==========================================
- Coverage 93.91% 93.84% -0.08%
==========================================
Files 32 32
Lines 5669 5683 +14
==========================================
+ Hits 5324 5333 +9
- Misses 345 350 +5
Continue to review full report at Codecov.
|
@@ -43,32 +43,6 @@ def strip_all_whitespace(str): | |||
"""A helper function to remove all whitespace from a string.""" | |||
return str.translate({ord(c): None for c in string.whitespace}) | |||
|
|||
def test_csv(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved to test_csv.py
4205fb9
to
b41a9c3
Compare
databricks/koalas/generic.py
Outdated
""" | ||
def to_csv(self, path=None, sep=',', na_rep='', columns=None, header=True, | ||
quotechar='"', date_format=None, escapechar=None, num_files=None, | ||
**kwargs): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm .. should I maybe not add this kwargs
to directly allow PySpark' option for now @ueshin?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's okay to have it. Actually I added options
for read_sql()/_table()/_query()
.
33e82e6
to
e67a757
Compare
Softagram Impact Report for pull/749 (head commit: 7db2d08)⭐ Change Overview
📄 Full report
Impact Report explained. Give feedback on this report to [email protected] |
Okie .. build passed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Otherwise, LGTM.
... date=list(pd.date_range('2012-1-1 12:00:00', periods=3, freq='M')), | ||
... country=['KR', 'US', 'JP'], | ||
... code=[1, 2 ,3]), columns=['date', 'country', 'code']) | ||
>>> df.sort_values(by="date") # doctest: +ELLIPSIS, +NORMALIZE_WHITESPACE |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+ELLIPSIS
is not needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh that's needed for ...
for index.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
here:
Expected:
date
... 2012-01-31 12:00:00
... 2012-02-29 12:00:00
... 2012-03-31 12:00:00
Got:
date
0 2012-01-31 12:00:00
1 2012-02-29 12:00:00
2 2012-03-31 12:00:00
>>> df.to_csv(path=r'%s/to_csv/foo.csv' % path, num_files=1) | ||
>>> ks.read_csv( | ||
... path=r'%s/to_csv/foo.csv' % path | ||
... ).sort_values(by="date") # doctest: +ELLIPSIS, +NORMALIZE_WHITESPACE |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto.
>>> df.date.to_csv(path=r'%s/to_csv/foo.csv' % path, num_files=1) | ||
>>> ks.read_csv( | ||
... path=r'%s/to_csv/foo.csv' % path | ||
... ).sort_values(by="date") # doctest: +ELLIPSIS, +NORMALIZE_WHITESPACE |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto.
Thanks, @ueshin. Merged to master. |
This PR proposes to use
spark.write.csv
API to enable distributed computation whenpath
is specified. Ifpath
is not specified, it just calls pandas'to_csv
as was.Closes #677