Fix DataFrame.drop() to remove fields from Spark DataFrame also. #794

itholic · 2019-09-18T06:53:49Z

When we drop columns from dataframe with DataFrame.drop(),

We can get a dataframe which columns are dropped properly like below.

>>> df
     name   class  max_speed
0  falcon    bird      389.0
1  parrot    bird       24.0
2    lion  mammal       80.5
3  monkey  mammal        NaN
>>>
>>> df = df.drop('name')
>>> df
    class  max_speed
0    bird      389.0
1    bird       24.0
2  mammal       80.5
3  mammal        NaN

But when we try to get an internal spark dataframe after then,

it shows us original one which is not delete columns like below.

>>> df._sdf.show()
+-----------------+------+------+---------+
|__index_level_0__|  name| class|max_speed|
+-----------------+------+------+---------+
|                0|falcon|  bird|    389.0|
|                1|parrot|  bird|     24.0|
|                2|  lion|mammal|     80.5|
|                3|monkey|mammal|     null|
+-----------------+------+------+---------+

(Although I dropped a column 'name' above example, it still shown in internal spark dataframe)

so i think maybe we need to drop them, too.

like:

>>> df._sdf.show()
+-----------------+------+---------+
|__index_level_0__| class|max_speed|
+-----------------+------+---------+
|                0|  bird|    389.0|
|                1|  bird|     24.0|
|                2|mammal|     80.5|
|                3|mammal|     null|
+-----------------+------+---------+

codecov-io · 2019-09-18T07:18:06Z

Codecov Report

Merging #794 into master will increase coverage by 0.02%.
The diff coverage is n/a.

@@            Coverage Diff            @@
##           master    #794      +/-   ##
=========================================
+ Coverage   94.28%   94.3%   +0.02%     
=========================================
  Files          32      32              
  Lines        5770    5828      +58     
=========================================
+ Hits         5440    5496      +56     
- Misses        330     332       +2

Impacted Files	Coverage Δ
databricks/koalas/frame.py	`96.78% <ø> (-0.1%)`	⬇️
databricks/koalas/base.py	`94.8% <0%> (-0.07%)`	⬇️
databricks/koalas/internal.py	`95.85% <0%> (-0.04%)`	⬇️
databricks/koalas/missing/frame.py	`100% <0%> (ø)`	⬆️
databricks/koalas/namespace.py	`86.13% <0%> (ø)`	⬆️
databricks/koalas/missing/indexes.py	`100% <0%> (ø)`	⬆️
databricks/koalas/series.py	`95.16% <0%> (+0.02%)`	⬆️
databricks/koalas/groupby.py	`89.92% <0%> (+0.07%)`	⬆️
databricks/koalas/indexes.py	`99.22% <0%> (+0.09%)`	⬆️
... and 2 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e992200...e23f9ee. Read the comment docs.

databricks/koalas/frame.py

HyukjinKwon · 2019-09-18T11:23:24Z

databricks/koalas/frame.py

+            # make column string list to drop internal spark dataframe fields
+            columns = [col[0] for col in columns]
+            internal = self._internal.copy(
+                sdf=self._sdf.drop(*columns),


Actually, this seems not covering multi-index case.

I think you can just select existing data_columns:

sdf=self._sdf.select( self._internal.index_scols + [self._internal.scol_for(col) for col in cols])

Thanks for comment! 😃

I just tested it for multi-index case like below,

>>> import databricks.koalas as ks >>> import pandas as pd >>> import numpy as np >>> >>> arrays = [[1, 1, 2, 2], ['red', 'blue', 'red', 'blue']] >>> idx = pd.MultiIndex.from_arrays(arrays, names=('Index', 'color')) >>> pdf = pd.DataFrame(np.random.randn(4, 5), idx) >>> kdf = ks.from_pandas(pdf) >>> >>> kdf 0 1 2 3 4 Index color 1 red 0.304819 1.373173 -0.095708 -0.165494 -0.922387 blue -0.538924 0.623598 0.705721 -0.006320 1.173270 2 red 1.397902 -1.870591 -0.294745 -0.100288 0.802501 blue 1.922724 -0.314832 1.279700 0.414461 -0.010711 >>> >>> kdf = kdf.drop(['0', '1', '2']) >>> kdf 3 4 Index color 1 red -0.165494 -0.922387 blue -0.006320 1.173270 2 red -0.100288 0.802501 blue 0.414461 -0.010711 >>> >>> kdf._sdf.show() +-----+-----+--------------------+--------------------+ |Index|color| 3| 4| +-----+-----+--------------------+--------------------+ | 1| red| -0.1654939967509364| -0.922387275283718| | 1| blue|-0.00632027805463...| 1.1732703308776347| | 2| red|-0.10028785194871741| 0.8025013695853971| | 2| blue| 0.41446113311041816|-0.01071114630144753| +-----+-----+--------------------+--------------------+

I think it also works for multi-index case,

But maybe is there a something wrong that i had tested about multi-index?

I think @HyukjinKwon meant multi-index columns.

Example:

import pandas as pd pdf = pd.DataFrame({'a': [1, 2], 'b': [3, 4], 'c': [5, 6], 'd': [7, 8]}) columns = [('e', 'a'), ('e', 'b'), ('f', 'c'), ('f', 'd')] pdf.columns = pd.MultiIndex.from_tuples(columns) kdf = ks.DataFrame(pdf) >>> kdf e f a b c d 0 1 3 5 7 1 2 4 6 8 >>> kdf._sdf.show() +-----------------+----------+----------+----------+----------+ |__index_level_0__|('e', 'a')|('e', 'b')|('f', 'c')|('f', 'd')| +-----------------+----------+----------+----------+----------+ | 0| 1| 3| 5| 7| | 1| 2| 4| 6| 8| +-----------------+----------+----------+----------+----------+

@harupy ,
You're the best! it is really helpful for me. Thanks ! 😃

For example, kdf.drop('e') should remove both "('e', 'a')" and "('e', 'b')" from _sdf, but the current implementation just does _sdf.drop('e') which has no effect at all because there is no column named 'e' in _sdf.

@harupy ,Really appreciate for your helping. Now maybe i really finished. could you check this out if when you available??

@harupy Could i use your multi-index example as an additional doctest for DataFrame.drop() maybe if you don't mind??

@itholic
Yes of course you can 😃😃😃

@harupy Thanks. i just added it. 😸

HyukjinKwon · 2019-09-18T11:24:06Z

Let me leave it to @ueshin

ueshin · 2019-09-18T19:46:27Z

databricks/koalas/frame.py

@@ -4344,24 +4344,6 @@ def drop(self, labels=None, axis=1,
        0  1  7
        1  2  8

-        >>> pdf = pd.DataFrame({'x': [1, 2], 'y': [3, 4], 'z': [5, 6], 'w': [7, 8]})


Let's have this doctests.

Need to add columns=[...] for py3.5.

>>> df = ks.DataFrame({'x': [1, 2], 'y': [3, 4], 'z': [5, 6], 'w': [7, 8]}, columns=['x', 'y', 'z', 'w'])

or we can do:

>>> df = ks.DataFrame({('a', 'x'): [1, 2], ('a', 'y'): [3, 4], ('b', 'z'): [5, 6], ('b', 'w'): [7, 8]}, ... columns=[('a', 'x'), ('a', 'y'), ('b', 'z'), ('b', 'w')])

@ueshin Really appreciate your comment Takuya!! I was really suffered from this failed for all night long 😭

@itholic
In python < 3.6, pandas.DataFrame sorts the key order when it takes dict without columns using the function below because the key insertion order is NOT preserved in python < 3.6.

def dict_keys_to_ordered_list(mapping): # when pandas drops support for Python < 3.6, this function # can be replaced by a simple list(mapping.keys()) if PY36 or isinstance(mapping, OrderedDict): keys = list(mapping.keys()) else: keys = try_sort(mapping) return keys

https://github.com/pandas-dev/pandas/blob/0a082d40c4a750655a7f1ce127c58ca26cd5905e/pandas/core/common.py#L218-L225

Example:

>>> sys.version '3.5.6 |Anaconda, Inc.| (default, Aug 26 2018, 16:05:27) [MSC v.1900 64 bit (AMD64)]' >>> data = { ... 'x': [1, 2], ... 'y': [3, 4], ... 'z': [5, 6], ... 'w': [7, 8] ... } >>> data {'y': [3, 4], 'z': [5, 6], 'x': [1, 2], 'w': [7, 8]}

@harupy I love this comment! finally my question all solved. Thanks for sharing your knowledge. :)

ueshin · 2019-09-18T19:49:09Z

databricks/koalas/frame.py

-            internal = self._internal.copy(data_columns=list(cols), column_index=list(idx))
+            internal = self._internal.copy(
+                sdf=self._sdf.select(
+                    self._internal.index_scols + [self._internal.scol_for(col) for col in cols]),


We should use idx rather than cols.
Maybe we should rename idx as idxes or something, then:

[self._internal.scol_for(idx) for idx in idxes]

@ueshin it makes sense. i fixed them. Thanks!! :)

ueshin

Otherwise, LGTM.

ueshin · 2019-09-19T17:02:27Z

databricks/koalas/frame.py

+        >>> columns = [('a', 'x'), ('a', 'y'), ('b', 'z'), ('b', 'w')]
+        >>> pdf.columns = pd.MultiIndex.from_tuples(columns)
+        >>> kdf = ks.DataFrame(pdf)
+        >>> kdf  # doctest: +NORMALIZE_WHITESPACE


Shall we skip pandas part?

>>> df = ks.DataFrame({'x': [1, 2], 'y': [3, 4], 'z': [5, 6], 'w': [7, 8]}, ... columns=['x', 'y', 'z', 'w']) >>> columns = [('a', 'x'), ('a', 'y'), ('b', 'z'), ('b', 'w')] >>> df.columns = pd.MultiIndex.from_tuples(columns) >>> df # doctest: +NORMALIZE_WHITESPACE

@ueshin Sure, It looks better! Thanks for review :)

softagram-bot · 2019-09-19T23:58:45Z

Softagram Impact Report for pull/794 (head commit: `e23f9ee`)

⭐ Change Overview

(Open in Softagram Desktop for full details)

📄 Full report

Permalink: Full report for pull/794

Impact Report explained. Give feedback on this report to [email protected]

HyukjinKwon · 2019-09-20T05:27:30Z

Merged.

itholic · 2019-09-20T07:18:57Z

@HyukjinKwon Thanks! i'm going to move on to #791 again.

Drop internal spark dataframe columns also

7c9538e

itholic mentioned this pull request Sep 18, 2019

Implement DataFrame.pop #791

Merged

itholic changed the title ~~Drop internal dataframe also when drop dataframe.~~ Fix DataFrame.drop() to remove fields from Spark DataFrame also. Sep 18, 2019

HyukjinKwon reviewed Sep 18, 2019

View reviewed changes

databricks/koalas/frame.py Outdated Show resolved Hide resolved

HyukjinKwon reviewed Sep 18, 2019

View reviewed changes

databricks/koalas/frame.py Outdated Show resolved Hide resolved

Fix to not modifying internal

811aed0

HyukjinKwon reviewed Sep 18, 2019

View reviewed changes

itholic added 6 commits September 18, 2019 23:42

Support for MultiIndex column

43d3fbe

Add doctest for multiindex

bca770d

Add doctest NORMALIZE_WHITESPACE

fed4d46

Remove trailing whitespace

3f304ff

Add NORMALIZE

99f16ec

Remove multi-index test since lint failed

a50b104

ueshin reviewed Sep 18, 2019

View reviewed changes

itholic added 2 commits September 19, 2019 10:24

Fix & Readd doctest for python 3.5 compaitability

4792fed

Fix doctest

382987e

ueshin approved these changes Sep 19, 2019

View reviewed changes

Remove pandas part from doctest

e23f9ee

HyukjinKwon merged commit 275463a into databricks:master Sep 20, 2019

itholic deleted the fix_frame_drop branch October 14, 2019 04:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix DataFrame.drop() to remove fields from Spark DataFrame also. #794

Fix DataFrame.drop() to remove fields from Spark DataFrame also. #794

itholic commented Sep 18, 2019 •

edited

Loading

codecov-io commented Sep 18, 2019 •

edited

Loading

HyukjinKwon Sep 18, 2019 •

edited

Loading

itholic Sep 18, 2019

harupy Sep 18, 2019 •

edited

Loading

itholic Sep 18, 2019

harupy Sep 18, 2019 •

edited

Loading

itholic Sep 18, 2019

itholic Sep 18, 2019 •

edited

Loading

harupy Sep 18, 2019 •

edited

Loading

itholic Sep 18, 2019

HyukjinKwon commented Sep 18, 2019

ueshin Sep 18, 2019

itholic Sep 19, 2019 •

edited

Loading

harupy Sep 19, 2019 •

edited

Loading

itholic Sep 19, 2019 •

edited

Loading

ueshin Sep 18, 2019

itholic Sep 19, 2019

ueshin left a comment

ueshin Sep 19, 2019

itholic Sep 19, 2019

softagram-bot commented Sep 19, 2019

HyukjinKwon commented Sep 20, 2019

itholic commented Sep 20, 2019

Fix DataFrame.drop() to remove fields from Spark DataFrame also. #794

Fix DataFrame.drop() to remove fields from Spark DataFrame also. #794

Conversation

itholic commented Sep 18, 2019 • edited Loading

codecov-io commented Sep 18, 2019 • edited Loading

Codecov Report

HyukjinKwon Sep 18, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

harupy Sep 18, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

harupy Sep 18, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

itholic Sep 18, 2019 • edited Loading

Choose a reason for hiding this comment

harupy Sep 18, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HyukjinKwon commented Sep 18, 2019

Choose a reason for hiding this comment

itholic Sep 19, 2019 • edited Loading

Choose a reason for hiding this comment

harupy Sep 19, 2019 • edited Loading

Choose a reason for hiding this comment

itholic Sep 19, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ueshin left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

softagram-bot commented Sep 19, 2019

Softagram Impact Report for pull/794 (head commit: e23f9ee)

⭐ Change Overview

📄 Full report

HyukjinKwon commented Sep 20, 2019

itholic commented Sep 20, 2019

itholic commented Sep 18, 2019 •

edited

Loading

codecov-io commented Sep 18, 2019 •

edited

Loading

HyukjinKwon Sep 18, 2019 •

edited

Loading

harupy Sep 18, 2019 •

edited

Loading

harupy Sep 18, 2019 •

edited

Loading

itholic Sep 18, 2019 •

edited

Loading

harupy Sep 18, 2019 •

edited

Loading

itholic Sep 19, 2019 •

edited

Loading

harupy Sep 19, 2019 •

edited

Loading

itholic Sep 19, 2019 •

edited

Loading

Softagram Impact Report for pull/794 (head commit: `e23f9ee`)