Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix DataFrame.drop() to remove fields from Spark DataFrame also. #794

Merged
merged 11 commits into from
Sep 20, 2019

Conversation

itholic
Copy link
Contributor

@itholic itholic commented Sep 18, 2019

When we drop columns from dataframe with DataFrame.drop(),

We can get a dataframe which columns are dropped properly like below.

>>> df
     name   class  max_speed
0  falcon    bird      389.0
1  parrot    bird       24.0
2    lion  mammal       80.5
3  monkey  mammal        NaN
>>>
>>> df = df.drop('name')
>>> df
    class  max_speed
0    bird      389.0
1    bird       24.0
2  mammal       80.5
3  mammal        NaN

But when we try to get an internal spark dataframe after then,

it shows us original one which is not delete columns like below.

>>> df._sdf.show()
+-----------------+------+------+---------+
|__index_level_0__|  name| class|max_speed|
+-----------------+------+------+---------+
|                0|falcon|  bird|    389.0|
|                1|parrot|  bird|     24.0|
|                2|  lion|mammal|     80.5|
|                3|monkey|mammal|     null|
+-----------------+------+------+---------+

(Although I dropped a column 'name' above example, it still shown in internal spark dataframe)

so i think maybe we need to drop them, too.

like:

>>> df._sdf.show()
+-----------------+------+---------+
|__index_level_0__| class|max_speed|
+-----------------+------+---------+
|                0|  bird|    389.0|
|                1|  bird|     24.0|
|                2|mammal|     80.5|
|                3|mammal|     null|
+-----------------+------+---------+

@itholic itholic mentioned this pull request Sep 18, 2019
@itholic itholic changed the title Drop internal dataframe also when drop dataframe. Fix DataFrame.drop() to remove fields from Spark DataFrame also. Sep 18, 2019
@codecov-io
Copy link

codecov-io commented Sep 18, 2019

Codecov Report

Merging #794 into master will increase coverage by 0.02%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff            @@
##           master    #794      +/-   ##
=========================================
+ Coverage   94.28%   94.3%   +0.02%     
=========================================
  Files          32      32              
  Lines        5770    5828      +58     
=========================================
+ Hits         5440    5496      +56     
- Misses        330     332       +2
Impacted Files Coverage Δ
databricks/koalas/frame.py 96.78% <ø> (-0.1%) ⬇️
databricks/koalas/base.py 94.8% <0%> (-0.07%) ⬇️
databricks/koalas/internal.py 95.85% <0%> (-0.04%) ⬇️
databricks/koalas/missing/frame.py 100% <0%> (ø) ⬆️
databricks/koalas/namespace.py 86.13% <0%> (ø) ⬆️
databricks/koalas/missing/indexes.py 100% <0%> (ø) ⬆️
databricks/koalas/series.py 95.16% <0%> (+0.02%) ⬆️
databricks/koalas/groupby.py 89.92% <0%> (+0.07%) ⬆️
databricks/koalas/indexes.py 99.22% <0%> (+0.09%) ⬆️
... and 2 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e992200...e23f9ee. Read the comment docs.

# make column string list to drop internal spark dataframe fields
columns = [col[0] for col in columns]
internal = self._internal.copy(
sdf=self._sdf.drop(*columns),
Copy link
Member

@HyukjinKwon HyukjinKwon Sep 18, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, this seems not covering multi-index case.

I think you can just select existing data_columns:

sdf=self._sdf.select(
    self._internal.index_scols + [self._internal.scol_for(col) for col in cols])

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for comment! 😃

I just tested it for multi-index case like below,

>>> import databricks.koalas as ks
>>> import pandas as pd
>>> import numpy as np
>>>
>>> arrays = [[1, 1, 2, 2], ['red', 'blue', 'red', 'blue']]
>>> idx = pd.MultiIndex.from_arrays(arrays, names=('Index', 'color'))
>>> pdf = pd.DataFrame(np.random.randn(4, 5), idx)
>>> kdf = ks.from_pandas(pdf)
>>>
>>> kdf
                    0         1         2         3         4
Index color
1     red    0.304819  1.373173 -0.095708 -0.165494 -0.922387
      blue  -0.538924  0.623598  0.705721 -0.006320  1.173270
2     red    1.397902 -1.870591 -0.294745 -0.100288  0.802501
      blue   1.922724 -0.314832  1.279700  0.414461 -0.010711
>>>
>>> kdf = kdf.drop(['0', '1', '2'])
>>> kdf
                    3         4
Index color
1     red   -0.165494 -0.922387
      blue  -0.006320  1.173270
2     red   -0.100288  0.802501
      blue   0.414461 -0.010711
>>>
>>> kdf._sdf.show()
+-----+-----+--------------------+--------------------+
|Index|color|                   3|                   4|
+-----+-----+--------------------+--------------------+
|    1|  red| -0.1654939967509364|  -0.922387275283718|
|    1| blue|-0.00632027805463...|  1.1732703308776347|
|    2|  red|-0.10028785194871741|  0.8025013695853971|
|    2| blue| 0.41446113311041816|-0.01071114630144753|
+-----+-----+--------------------+--------------------+

I think it also works for multi-index case,

But maybe is there a something wrong that i had tested about multi-index?

Copy link
Contributor

@harupy harupy Sep 18, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think @HyukjinKwon meant multi-index columns.

Example:

import pandas as pd

pdf = pd.DataFrame({'a': [1, 2], 'b': [3, 4], 'c': [5, 6], 'd': [7, 8]})
columns = [('e', 'a'), ('e', 'b'), ('f', 'c'), ('f', 'd')]
pdf.columns = pd.MultiIndex.from_tuples(columns)
kdf = ks.DataFrame(pdf)

>>> kdf
   e     f
   a  b  c  d
0  1  3  5  7
1  2  4  6  8

>>> kdf._sdf.show()
+-----------------+----------+----------+----------+----------+
|__index_level_0__|('e', 'a')|('e', 'b')|('f', 'c')|('f', 'd')|
+-----------------+----------+----------+----------+----------+
|                0|         1|         3|         5|         7|
|                1|         2|         4|         6|         8|
+-----------------+----------+----------+----------+----------+

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@harupy ,
You're the best! it is really helpful for me. Thanks ! 😃

Copy link
Contributor

@harupy harupy Sep 18, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For example, kdf.drop('e') should remove both "('e', 'a')" and "('e', 'b')" from _sdf, but the current implementation just does _sdf.drop('e') which has no effect at all because there is no column named 'e' in _sdf.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@harupy ,Really appreciate for your helping. Now maybe i really finished. could you check this out if when you available??

Copy link
Contributor Author

@itholic itholic Sep 18, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@harupy Could i use your multi-index example as an additional doctest for DataFrame.drop() maybe if you don't mind??

Copy link
Contributor

@harupy harupy Sep 18, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@itholic
Yes of course you can 😃😃😃

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@harupy Thanks. i just added it. 😸

@HyukjinKwon
Copy link
Member

Let me leave it to @ueshin

@@ -4344,24 +4344,6 @@ def drop(self, labels=None, axis=1,
0 1 7
1 2 8

>>> pdf = pd.DataFrame({'x': [1, 2], 'y': [3, 4], 'z': [5, 6], 'w': [7, 8]})
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's have this doctests.

Need to add columns=[...] for py3.5.

>>> df = ks.DataFrame({'x': [1, 2], 'y': [3, 4], 'z': [5, 6], 'w': [7, 8]}, columns=['x', 'y', 'z', 'w'])

or we can do:

>>> df = ks.DataFrame({('a', 'x'): [1, 2], ('a', 'y'): [3, 4], ('b', 'z'): [5, 6], ('b', 'w'): [7, 8]},
...                   columns=[('a', 'x'), ('a', 'y'), ('b', 'z'), ('b', 'w')])

Copy link
Contributor Author

@itholic itholic Sep 19, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ueshin Really appreciate your comment Takuya!! I was really suffered from this failed for all night long 😭

Copy link
Contributor

@harupy harupy Sep 19, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@itholic
In python < 3.6, pandas.DataFrame sorts the key order when it takes dict without columns using the function below because the key insertion order is NOT preserved in python < 3.6.

def dict_keys_to_ordered_list(mapping):
    # when pandas drops support for Python < 3.6, this function
    # can be replaced by a simple list(mapping.keys())
    if PY36 or isinstance(mapping, OrderedDict):
        keys = list(mapping.keys())
    else:
        keys = try_sort(mapping)
    return keys

https://github.com/pandas-dev/pandas/blob/0a082d40c4a750655a7f1ce127c58ca26cd5905e/pandas/core/common.py#L218-L225

Example:

>>> sys.version
'3.5.6 |Anaconda, Inc.| (default, Aug 26 2018, 16:05:27) [MSC v.1900 64 bit (AMD64)]'

>>> data = {
...   'x': [1, 2],
...   'y': [3, 4],
...   'z': [5, 6],
...   'w': [7, 8]
... }
>>> data
{'y': [3, 4], 'z': [5, 6], 'x': [1, 2], 'w': [7, 8]}

Copy link
Contributor Author

@itholic itholic Sep 19, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@harupy I love this comment! finally my question all solved. Thanks for sharing your knowledge. :)

internal = self._internal.copy(data_columns=list(cols), column_index=list(idx))
internal = self._internal.copy(
sdf=self._sdf.select(
self._internal.index_scols + [self._internal.scol_for(col) for col in cols]),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should use idx rather than cols.
Maybe we should rename idx as idxes or something, then:

[self._internal.scol_for(idx) for idx in idxes]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ueshin it makes sense. i fixed them. Thanks!! :)

Copy link
Collaborator

@ueshin ueshin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Otherwise, LGTM.

>>> columns = [('a', 'x'), ('a', 'y'), ('b', 'z'), ('b', 'w')]
>>> pdf.columns = pd.MultiIndex.from_tuples(columns)
>>> kdf = ks.DataFrame(pdf)
>>> kdf # doctest: +NORMALIZE_WHITESPACE
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we skip pandas part?

>>> df = ks.DataFrame({'x': [1, 2], 'y': [3, 4], 'z': [5, 6], 'w': [7, 8]},
...                   columns=['x', 'y', 'z', 'w'])
>>> columns = [('a', 'x'), ('a', 'y'), ('b', 'z'), ('b', 'w')]
>>> df.columns = pd.MultiIndex.from_tuples(columns)
>>> df  # doctest: +NORMALIZE_WHITESPACE

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ueshin Sure, It looks better! Thanks for review :)

@softagram-bot
Copy link

Softagram Impact Report for pull/794 (head commit: e23f9ee)

⭐ Change Overview

Showing the changed files, dependency changes and the impact - click for full size
(Open in Softagram Desktop for full details)

📄 Full report

Impact Report explained. Give feedback on this report to [email protected]

@HyukjinKwon HyukjinKwon merged commit 275463a into databricks:master Sep 20, 2019
@HyukjinKwon
Copy link
Member

Merged.

@itholic
Copy link
Contributor Author

itholic commented Sep 20, 2019

@HyukjinKwon Thanks! i'm going to move on to #791 again.

@itholic itholic deleted the fix_frame_drop branch October 14, 2019 04:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants