-
Notifications
You must be signed in to change notification settings - Fork 358
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix DataFrame.drop() to remove fields from Spark DataFrame also. #794
Conversation
Codecov Report
@@ Coverage Diff @@
## master #794 +/- ##
=========================================
+ Coverage 94.28% 94.3% +0.02%
=========================================
Files 32 32
Lines 5770 5828 +58
=========================================
+ Hits 5440 5496 +56
- Misses 330 332 +2
Continue to review full report at Codecov.
|
databricks/koalas/frame.py
Outdated
# make column string list to drop internal spark dataframe fields | ||
columns = [col[0] for col in columns] | ||
internal = self._internal.copy( | ||
sdf=self._sdf.drop(*columns), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, this seems not covering multi-index case.
I think you can just select existing data_columns:
sdf=self._sdf.select(
self._internal.index_scols + [self._internal.scol_for(col) for col in cols])
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for comment! 😃
I just tested it for multi-index case like below,
>>> import databricks.koalas as ks
>>> import pandas as pd
>>> import numpy as np
>>>
>>> arrays = [[1, 1, 2, 2], ['red', 'blue', 'red', 'blue']]
>>> idx = pd.MultiIndex.from_arrays(arrays, names=('Index', 'color'))
>>> pdf = pd.DataFrame(np.random.randn(4, 5), idx)
>>> kdf = ks.from_pandas(pdf)
>>>
>>> kdf
0 1 2 3 4
Index color
1 red 0.304819 1.373173 -0.095708 -0.165494 -0.922387
blue -0.538924 0.623598 0.705721 -0.006320 1.173270
2 red 1.397902 -1.870591 -0.294745 -0.100288 0.802501
blue 1.922724 -0.314832 1.279700 0.414461 -0.010711
>>>
>>> kdf = kdf.drop(['0', '1', '2'])
>>> kdf
3 4
Index color
1 red -0.165494 -0.922387
blue -0.006320 1.173270
2 red -0.100288 0.802501
blue 0.414461 -0.010711
>>>
>>> kdf._sdf.show()
+-----+-----+--------------------+--------------------+
|Index|color| 3| 4|
+-----+-----+--------------------+--------------------+
| 1| red| -0.1654939967509364| -0.922387275283718|
| 1| blue|-0.00632027805463...| 1.1732703308776347|
| 2| red|-0.10028785194871741| 0.8025013695853971|
| 2| blue| 0.41446113311041816|-0.01071114630144753|
+-----+-----+--------------------+--------------------+
I think it also works for multi-index case,
But maybe is there a something wrong that i had tested about multi-index?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think @HyukjinKwon meant multi-index columns.
Example:
import pandas as pd
pdf = pd.DataFrame({'a': [1, 2], 'b': [3, 4], 'c': [5, 6], 'd': [7, 8]})
columns = [('e', 'a'), ('e', 'b'), ('f', 'c'), ('f', 'd')]
pdf.columns = pd.MultiIndex.from_tuples(columns)
kdf = ks.DataFrame(pdf)
>>> kdf
e f
a b c d
0 1 3 5 7
1 2 4 6 8
>>> kdf._sdf.show()
+-----------------+----------+----------+----------+----------+
|__index_level_0__|('e', 'a')|('e', 'b')|('f', 'c')|('f', 'd')|
+-----------------+----------+----------+----------+----------+
| 0| 1| 3| 5| 7|
| 1| 2| 4| 6| 8|
+-----------------+----------+----------+----------+----------+
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@harupy ,
You're the best! it is really helpful for me. Thanks ! 😃
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For example, kdf.drop('e')
should remove both "('e', 'a')"
and "('e', 'b')"
from _sdf
, but the current implementation just does _sdf.drop('e')
which has no effect at all because there is no column named 'e'
in _sdf
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@harupy ,Really appreciate for your helping. Now maybe i really finished. could you check this out if when you available??
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@harupy Could i use your multi-index example as an additional doctest for DataFrame.drop()
maybe if you don't mind??
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@itholic
Yes of course you can 😃😃😃
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@harupy Thanks. i just added it. 😸
Let me leave it to @ueshin |
databricks/koalas/frame.py
Outdated
@@ -4344,24 +4344,6 @@ def drop(self, labels=None, axis=1, | |||
0 1 7 | |||
1 2 8 | |||
|
|||
>>> pdf = pd.DataFrame({'x': [1, 2], 'y': [3, 4], 'z': [5, 6], 'w': [7, 8]}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's have this doctests.
Need to add columns=[...]
for py3.5.
>>> df = ks.DataFrame({'x': [1, 2], 'y': [3, 4], 'z': [5, 6], 'w': [7, 8]}, columns=['x', 'y', 'z', 'w'])
or we can do:
>>> df = ks.DataFrame({('a', 'x'): [1, 2], ('a', 'y'): [3, 4], ('b', 'z'): [5, 6], ('b', 'w'): [7, 8]},
... columns=[('a', 'x'), ('a', 'y'), ('b', 'z'), ('b', 'w')])
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ueshin Really appreciate your comment Takuya!! I was really suffered from this failed for all night long 😭
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@itholic
In python < 3.6, pandas.DataFrame
sorts the key order when it takes dict
without columns
using the function below because the key insertion order is NOT preserved in python < 3.6.
def dict_keys_to_ordered_list(mapping):
# when pandas drops support for Python < 3.6, this function
# can be replaced by a simple list(mapping.keys())
if PY36 or isinstance(mapping, OrderedDict):
keys = list(mapping.keys())
else:
keys = try_sort(mapping)
return keys
Example:
>>> sys.version
'3.5.6 |Anaconda, Inc.| (default, Aug 26 2018, 16:05:27) [MSC v.1900 64 bit (AMD64)]'
>>> data = {
... 'x': [1, 2],
... 'y': [3, 4],
... 'z': [5, 6],
... 'w': [7, 8]
... }
>>> data
{'y': [3, 4], 'z': [5, 6], 'x': [1, 2], 'w': [7, 8]}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@harupy I love this comment! finally my question all solved. Thanks for sharing your knowledge. :)
databricks/koalas/frame.py
Outdated
internal = self._internal.copy(data_columns=list(cols), column_index=list(idx)) | ||
internal = self._internal.copy( | ||
sdf=self._sdf.select( | ||
self._internal.index_scols + [self._internal.scol_for(col) for col in cols]), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should use idx
rather than cols
.
Maybe we should rename idx
as idxes
or something, then:
[self._internal.scol_for(idx) for idx in idxes]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ueshin it makes sense. i fixed them. Thanks!! :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Otherwise, LGTM.
databricks/koalas/frame.py
Outdated
>>> columns = [('a', 'x'), ('a', 'y'), ('b', 'z'), ('b', 'w')] | ||
>>> pdf.columns = pd.MultiIndex.from_tuples(columns) | ||
>>> kdf = ks.DataFrame(pdf) | ||
>>> kdf # doctest: +NORMALIZE_WHITESPACE |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we skip pandas part?
>>> df = ks.DataFrame({'x': [1, 2], 'y': [3, 4], 'z': [5, 6], 'w': [7, 8]},
... columns=['x', 'y', 'z', 'w'])
>>> columns = [('a', 'x'), ('a', 'y'), ('b', 'z'), ('b', 'w')]
>>> df.columns = pd.MultiIndex.from_tuples(columns)
>>> df # doctest: +NORMALIZE_WHITESPACE
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ueshin Sure, It looks better! Thanks for review :)
Softagram Impact Report for pull/794 (head commit: e23f9ee)⭐ Change Overview
📄 Full report
Impact Report explained. Give feedback on this report to [email protected] |
Merged. |
@HyukjinKwon Thanks! i'm going to move on to #791 again. |
When we drop columns from dataframe with
DataFrame.drop()
,We can get a dataframe which columns are dropped properly like below.
But when we try to get an internal spark dataframe after then,
it shows us original one which is not delete columns like below.
(Although I dropped a column 'name' above example, it still shown in internal spark dataframe)
so i think maybe we need to drop them, too.
like: