Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow Series.__getitem__ to take boolean Series #1075

Merged
merged 4 commits into from
Dec 3, 2019

Conversation

harupy
Copy link
Contributor

@harupy harupy commented Nov 25, 2019

Resolves #1073

@harupy harupy closed this Nov 25, 2019
@harupy harupy reopened this Nov 25, 2019
@codecov-io
Copy link

codecov-io commented Nov 25, 2019

Codecov Report

Merging #1075 into master will increase coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #1075      +/-   ##
==========================================
+ Coverage    95.2%   95.21%   +<.01%     
==========================================
  Files          34       34              
  Lines        6889     6893       +4     
==========================================
+ Hits         6559     6563       +4     
  Misses        330      330
Impacted Files Coverage Δ
databricks/koalas/series.py 96.44% <100%> (+0.01%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 8a7a640...f8e4272. Read the comment docs.

@harupy
Copy link
Contributor Author

harupy commented Dec 2, 2019

ping @HyukjinKwon @ueshin

@@ -4383,6 +4383,12 @@ def __len__(self):
return len(self.to_dataframe())

def __getitem__(self, key):
if isinstance(key, Series):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@harupy, can we also check if the given Series's type is bool to prevent such cases:

>>> pd.DataFrame({'a': [1,2,3]})[pd.Series(['a'])]
   a
0  1
1  2
2  3

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will fix it

@@ -4383,6 +4383,12 @@ def __len__(self):
return len(self.to_dataframe())

def __getitem__(self, key):
if isinstance(key, Series):
bcol = key._scol.cast(BooleanType())
sdf = self._internal.sdf.filter(bcol)
Copy link
Member

@HyukjinKwon HyukjinKwon Dec 2, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, ideally we should also consider about different Series from different DataFrames.

We can do it via using existing logic that handles such series from different DataFrames:

# To allows 'compute.ops_on_diff_frames', it assigns the given column to the current Frame.
self._kdf["__temp_col__"] = key
sdf = self._internal.sdf.filter(F.col("__temp_col__"))
internal = self._internal.copy(sdf=sdf)
...

^^ It's not tested so should be double checked.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the comment. Does the code above work for both different and same dataframes?

Copy link
Member

@HyukjinKwon HyukjinKwon Dec 2, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think so because I suggested the similar trick before and it seemed working (e.g., 7070bc6#diff-5ba644c40e914732959b9b4867f30c53R2158)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could add one test under OpsOnDiffFramesEnabledTest (e.g., kser[ks.Series([True])])

Copy link
Member

@HyukjinKwon HyukjinKwon Dec 2, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh! wait, can we reuse self.to_dataframe().where(...)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me try

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

self.to_dataframe().where(...) didn't work. It just fills False with NaN.

@softagram-bot
Copy link

Softagram Impact Report for pull/1075 (head commit: f8e4272)

⚠️ Copy paste found

ℹ️ series.py: Copy paste fragment on line 1250 shared with ../frame.py:


    def to_latex(self, buf=None, columns=None, col_space=None, header=True, index=True,
                 na_rep='NaN',...(truncated 256 chars)

ℹ️ series.py: Copy paste fragment on line 1937 shared with ../frame.py:

                   level: Optional[Union[int, List[int]]] = None, ascending: bool = True,
                   inplace: bool = False, kind: str = None, na_positio...(truncated 56 chars)

ℹ️ series.py: Copy paste fragment inside the same file on lines 3091, 3183:

        results = sdf.select([scol] + index_scols).take(1)
        if len(results) == 0:
            raise ValueError(\"attempt to get...(truncated 376 chars)

ℹ️ series.py: Copy paste fragment inside the same file on lines 4073, 4232:

        sdf = self._internal.sdf \
            .select(cols) \
            .where(reduce(lambda x, y: x & y, rows))

        if len(self._inter...(truncated 255 chars)

ℹ️ series.py: Copy paste fragment inside the same file on lines 3350, 4245:

            internal = _InternalFrame(sdf=sdf, index_map=[(SPARK_INDEX_NAME_FORMAT(0), None)])
            return ...(truncated 148 chars)

ℹ️ test_ops_on_diff_frames.py: Copy paste fragment on line 508 shared with ../test_dataframe.py, ../test_indexes.py:

        return pd.DataFrame({
            'a': [1, 2, 3, 4, 5, 6, 7, 8, 9],
            'b': [4, 5, 6, 3, 2, 1, 0, 0, 0],
        }, index=[0, 1, 3, 5, 6, 8, 9, 9, 9])

    @propert...(truncated 20 chars)

ℹ️ test_ops_on_diff_frames.py: Copy paste fragment on line 37 shared with ../test_dataframe.py, ../test_indexes.py:

        return pd.DataFrame({
            'a': [1, 2, 3, 4, 5, 6, 7, 8, 9],
            'b': [4, 5, 6, 3, 2, 1, 0, 0, 0],
        }, index=[0, 1, 3, 5, 6, 8, 9, 9, 9])

ℹ️ test_ops_on_diff_frames.py: Copy paste fragment on line 37, 65, 508 shared with ../test_dataframe.py, ../test_indexes.py:

        return pd.DataFrame({
            'a': [1, 2, 3, 4, 5, 6, 7, 8, 9],
            'b': [4, 5, 6, 3, 2, 1, 0, 0, 0],

ℹ️ test_ops_on_diff_frames.py: Copy paste fragment on line 37, 65, 508 shared with ../test_indexing.py:

        return pd.DataFrame({
            'a': [1, 2, 3, 4, 5, 6, 7, 8, 9],
            'b': [4, 5, 6, 3, 2, 1, 0, 0, 0]

ℹ️ test_ops_on_diff_frames.py: Copy paste fragment on line 439, 547 shared with ../test_indexing.py:

        pdf = pd.DataFrame([[1, 2], [4, 5], [7, 8]],
                           index=['cobra', 'viper', 'sidewinder'],
                           columns=['max_speed', 'sh...(truncated 41 chars)

ℹ️ test_ops_on_diff_frames.py: Copy paste fragment on line 38, 509 shared with ../test_dataframe.py, ../test_indexes.py:

            'a': [1, 2, 3, 4, 5, 6, 7, 8, 9],
            'b': [4, 5, 6, 3, 2, 1, 0, 0, 0],
        }, index=[0, 1, 3, 5, 6, 8, 9, 9, 9])

ℹ️ test_ops_on_diff_frames.py: Copy paste fragment inside the same file on lines 458, 565:

                       repr(kdf1.where(kdf2 > 100).sort_index()))

        pdf1 = pd.DataFrame({'A': [-1, -2, -3, -4, -5], 'B': [-100, -200, -300, ...(truncated 174 chars)

ℹ️ test_ops_on_diff_frames.py: Copy paste fragment inside the same file on lines 475, 584:

                       repr(kdf1.mask(kdf2 < 100).sort_index()))

        pdf1 = pd.DataFrame({'A': [-1, -2, -3, -4, -5], 'B': [-100, -200, -300, -4...(truncated 172 chars)

ℹ️ test_ops_on_diff_frames.py: Copy paste fragment inside the same file on lines 466, 574:

                       repr(kdf1.where(kdf2 < -250).sort_index()))

    def test_mask(self):
        pdf1 = pd.DataFrame({'A': [0, 1, 2, 3, 4], 'B...(truncated 195 chars)

ℹ️ test_ops_on_diff_frames.py: Copy paste fragment inside the same file on lines 566, 585:


        pdf1 = pd.DataFrame({'A': [-1, -2, -3, -4, -5], 'B': [-100, -200, -300, -400, -500]})
        pdf2 = pd.DataFrame({'A': [...(truncated 275 chars)

ℹ️ test_ops_on_diff_frames.py: Copy paste fragment inside the same file on lines 459, 476:


        pdf1 = pd.DataFrame({'A': [-1, -2, -3, -4, -5], 'B': [-100, -200, -300, -400, -500]})
        pdf2 = pd.DataFrame({'A': [-10,...(truncated 176 chars)

ℹ️ test_ops_on_diff_frames.py: Copy paste fragment inside the same file on lines 558, 577:

        pdf1 = pd.DataFrame({'A': [0, 1, 2, 3, 4], 'B': [100, 200, 300, 400, 500]})
        pdf2 = pd.DataFrame({'A': [0, -1, -2, -3, -4], 'B': [-100, -200, -300, -400, -500]...(truncated 223 chars)

ℹ️ test_ops_on_diff_frames.py: Copy paste fragment inside the same file on lines 459, 585:


        pdf1 = pd.DataFrame({'A': [-1, -2, -3, -4, -5], 'B': [-100, -200, -300, -400, -500]})
        pdf2 = pd.DataFrame({'A': [-10,...(truncated 121 chars)

ℹ️ test_ops_on_diff_frames.py: Copy paste fragment inside the same file on lines 476, 566:


        pdf1 = pd.DataFrame({'A': [-1, -2, -3, -4, -5], 'B': [-100, -200, -300, -400, -500]})
        pdf2 = pd.DataFrame({'A': [-10, ...(truncated 120 chars)

ℹ️ test_ops_on_diff_frames.py: Copy paste fragment inside the same file on lines 452, 469:

        pdf1 = pd.DataFrame({'A': [0, 1, 2, 3, 4], 'B': [100, 200, 300, 400, 500]})
        pdf2 = pd.DataFrame({'A': [0, -1, -2, -3, -4], 'B': [-100, -200, -300, -400, -500]...(truncated 128 chars)

ℹ️ test_ops_on_diff_frames.py: Copy paste fragment inside the same file on lines 220, 452, 469:

        pdf1 = pd.DataFrame({'A': [0, 1, 2, 3, 4], 'B': [100, 200, 300, 400, 500]})
        pdf2 = pd.DataFrame({'A': [0, -1, -2, -3, -4], 'B': [-100, -200, ...(truncated 137 chars)

ℹ️ test_ops_on_diff_frames.py: Copy paste fragment inside the same file on lines 452, 558:

        pdf1 = pd.DataFrame({'A': [0, 1, 2, 3, 4], 'B': [100, 200, 300, 400, 500]})
        pdf2 = pd.DataFrame({'A': [0, -1, -2, -3, -4], 'B': [-100, -200, -300, -400, -500]...(truncated 74 chars)

ℹ️ test_ops_on_diff_frames.py: Copy paste fragment inside the same file on lines 220, 452, 469, 558, 577:

        pdf1 = pd.DataFrame({'A': [0, 1, 2, 3, 4], 'B': [100, 200, 300, 400, 500]})
        pdf2 = pd.DataFrame({'A': [0, -1, -2, -3, -4], 'B': [-100, -200, ...(truncated 91 chars)

ℹ️ test_ops_on_diff_frames.py: Copy paste fragment inside the same file on lines 34, 505:


    @property
    def pdf1(self):
        return pd.DataFrame({
            'a': [1, 2, 3, 4, 5, 6, 7, 8, 9],
            'b': [4, 5, 6,...(truncated 67 chars)

ℹ️ test_ops_on_diff_frames.py: Copy paste fragment inside the same file on lines 41, 512:


    @property
    def pdf2(self):
        return pd.DataFrame({
            'a': [9, 8, 7, 6, 5, 4, 3, 2, 1],
            'b': [0, 0, 0, 4, 5, 6, 1, 2,...(truncated 72 chars)

ℹ️ test_ops_on_diff_frames.py: Copy paste fragment inside the same file on lines 439, 547:

        pdf = pd.DataFrame(
            [[1, 2], [4, 5], [7, 8]],
            index=['cobra', 'viper', 'sidewinder'],
            columns=['max_speed', 'shield'])
     ...(truncated 66 chars)

ℹ️ test_ops_on_diff_frames.py: Copy paste fragment inside the same file on lines 44, 73, 515:

        return pd.DataFrame({
            'a': [9, 8, 7, 6, 5, 4, 3, 2, 1],
            'b': [0, 0, 0, 4, 5, 6, 1, 2, 3],

ℹ️ test_ops_on_diff_frames.py: Copy paste fragment inside the same file on lines 231, 239:

        kser1 = ks.from_pandas(pser1)
        kser2 = ks.from_pandas(pser2)

        self.assert_eq(pser1 | p...(truncated 103 chars)

ℹ️ test_ops_on_diff_frames.py: Copy paste fragment inside the same file on lines 138, 190:


        # Multi-index columns
        columns = pd.MultiIndex.from_tuples([('x', 'a'), ('x', 'b')])
        kdf1.columns = columns
        kdf2.co...(truncated 77 chars)

ℹ️ test_series.py: Copy paste fragment on line 553 shared with ../test_series_plot.py:

            'a': [1, 2, 3, 4, 5, 6, 7, 8, 9, 15, 50],
        }, index=[0, 1, 3, 5, 6, 8, 9, 9, 9, 10, 10])

ℹ️ test_series.py: Copy paste fragment on line 559 shared with ../test_frame_plot.py, ../test_series_plot.py:

        bytes_data = BytesIO()
        ax.figure.savefig(bytes_data, format='png')
        bytes_data.seek(0)
        b64_data = base64.b64encode(bytes_data.read())
       ...(truncated 45 chars)

ℹ️ test_series.py: Copy paste fragment on line 50 shared with ../test_series_conversion.py:

    def pser(self):
        return pd.Series([1, 2, 3, 4, 5, 6, 7], name='x')

    @property
    def kser(self):
        return ks.from_pandas(self.pser)

    def test_series(self):

ℹ️ test_series.py: Copy paste fragment inside the same file on lines 312, 329:

                     pd.Series([True, False], name='x'),
                     pd.Series([0, 1], name='x'),
                     pd.Series([1, 2,...(truncated 330 chars)

ℹ️ test_series.py: Copy paste fragment inside the same file on lines 744, 905:

        midx = pd.MultiIndex([['lama', 'cow', 'falcon'],
                              ['speed', 'weight', 'length']],
                             [[0, 0, 0, 1, 1, 1, 2, 2, 2]...(truncated 280 chars)

ℹ️ test_series.py: Copy paste fragment inside the same file on lines 719, 745, 906:

                              ['speed', 'weight', 'length']],
                             [[0, 0, 0, 1, 1, 1, 2, 2, 2],
                      ...(truncated 117 chars)

ℹ️ test_series.py: Copy paste fragment inside the same file on lines 827, 844:


        pser1 = pd.Series([-1, -2, -3, -4, -5], name=0)
        pser2 = pd.Series([-100, -200, -300, -400, -500], name=0)
        k...(truncated 123 chars)

ℹ️ test_series.py: Copy paste fragment inside the same file on lines 180, 190:

        pdf = pd.DataFrame({
            'left':  [True, False, True, False, np.nan, np.nan, True, False, np.nan],
            'right': [True, False, False, True, True, False, n...(truncated 119 chars)

ℹ️ test_series.py: Copy paste fragment inside the same file on lines 747, 788, 908:

                              [0, 1, 2, 0, 1, 2, 0, 1, 2]])
        kser = ks.Series([45, 200, 1.2, 30, 250, 1.5, 320, 1, 0.3],
             ...(truncated 137 chars)

ℹ️ test_series.py: Copy paste fragment inside the same file on lines 649, 668:


        index = pd.MultiIndex.from_arrays([
            ['a', 'a', 'b', 'b'], ['c', 'd', 'e', 'f']], names=('first', 'se...(truncated 151 chars)

ℹ️ test_series.py: Copy paste fragment inside the same file on lines 820, 837:

        pser1 = pd.Series([0, 1, 2, 3, 4], name=0)
        pser2 = pd.Series([100, 200, 300, 400, 500], name=0)
        kser1 = ks.from_pandas(pser1)
        kser2 = ks.from_...(truncated 69 chars)

ℹ️ test_series.py: Copy paste fragment inside the same file on lines 721, 788:

                              [0, 1, 2, 0, 1, 2, 0, 1, 2]])
        kser = ks.Series([45, 200, 1.2, 30, 250, 1.5, 320, 1, 0.3], index=midx)

ℹ️ test_series.py: Copy paste fragment inside the same file on lines 522, 532:


        pser = pd.Series([1, 2, 3], name='0',
                         index=pd.MultiIndex.from_tuples([('A', 'X'), ('A', 'Y...(truncated 128 chars)

ℹ️ test_series.py: Copy paste fragment inside the same file on lines 102, 107:

        self.assertTrue(ks.from_pandas(a).toPandas().isnull().all())
        self.assertRaises(ValueError, lambda: ks.from_pandas(b))

ℹ️ test_series.py: Copy paste fragment inside the same file on lines 282, 290:

        sample_lst = [1, 2, 3, 4, np.nan, 6]
        pser = pd.Series(sample_lst, name='x')
        kser = ks.Series(sample_lst, name='x')
        self.assert_eq(kser.nsm...(truncated 33 chars)

Now that you are on the file, it would be easier to pay back some tech. debt.

⭐ Change Overview

Showing the changed files, dependency changes and the impact - click for full size
(Open in Softagram Desktop for full details)

💡 Insights

  • Co-change Alert: You modified series.py. Often frame.py (databricks/koalas) is modified at the same time.

📄 Full report

Impact Report explained. Give feedback on this report to [email protected]

@harupy harupy closed this Dec 2, 2019
@harupy harupy reopened this Dec 2, 2019
if isinstance(key, Series) and isinstance(key.spark_type, BooleanType):
self._kdf["__temp_col__"] = key
sdf = self._kdf._sdf.filter(F.col("__temp_col__")).drop("__temp_col__")
return _col(DataFrame(_InternalFrame(sdf=sdf, index_map=self._internal.index_map)))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hm, actually it should set column index. In where too .. I see some related problems .. let me fix it separately.

@HyukjinKwon HyukjinKwon merged commit 75664fa into databricks:master Dec 3, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Series.__getitem__ doesn't work when Series is given
4 participants