Allow Series.getitem to take boolean Series #1075

harupy · 2019-11-25T14:39:36Z

Resolves #1073

codecov-io · 2019-11-25T16:08:39Z

Codecov Report

Merging #1075 into master will increase coverage by <.01%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master    #1075      +/-   ##
==========================================
+ Coverage    95.2%   95.21%   +<.01%     
==========================================
  Files          34       34              
  Lines        6889     6893       +4     
==========================================
+ Hits         6559     6563       +4     
  Misses        330      330

Impacted Files	Coverage Δ
databricks/koalas/series.py	`96.44% <100%> (+0.01%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 8a7a640...f8e4272. Read the comment docs.

harupy · 2019-12-02T01:09:14Z

ping @HyukjinKwon @ueshin

HyukjinKwon · 2019-12-02T03:31:47Z

databricks/koalas/series.py

@@ -4383,6 +4383,12 @@ def __len__(self):
        return len(self.to_dataframe())

    def __getitem__(self, key):
+        if isinstance(key, Series):


@harupy, can we also check if the given Series's type is bool to prevent such cases:

>>> pd.DataFrame({'a': [1,2,3]})[pd.Series(['a'])] a 0 1 1 2 2 3

will fix it

HyukjinKwon · 2019-12-02T03:35:25Z

databricks/koalas/series.py

@@ -4383,6 +4383,12 @@ def __len__(self):
        return len(self.to_dataframe())

    def __getitem__(self, key):
+        if isinstance(key, Series):
+            bcol = key._scol.cast(BooleanType())
+            sdf = self._internal.sdf.filter(bcol)


Also, ideally we should also consider about different Series from different DataFrames.

We can do it via using existing logic that handles such series from different DataFrames:

# To allows 'compute.ops_on_diff_frames', it assigns the given column to the current Frame. self._kdf["__temp_col__"] = key sdf = self._internal.sdf.filter(F.col("__temp_col__")) internal = self._internal.copy(sdf=sdf) ...

^^ It's not tested so should be double checked.

Thanks for the comment. Does the code above work for both different and same dataframes?

Yeah, I think so because I suggested the similar trick before and it seemed working (e.g., 7070bc6#diff-5ba644c40e914732959b9b4867f30c53R2158)

We could add one test under OpsOnDiffFramesEnabledTest (e.g., kser[ks.Series([True])])

Oh! wait, can we reuse self.to_dataframe().where(...)?

self.to_dataframe().where(...) didn't work. It just fills False with NaN.

softagram-bot · 2019-12-02T15:33:58Z

Softagram Impact Report for pull/1075 (head commit: `f8e4272`)

⚠️ Copy paste found

ℹ️ series.py: Copy paste fragment on line 1250 shared with ../frame.py:


    def to_latex(self, buf=None, columns=None, col_space=None, header=True, index=True,
                 na_rep='NaN',...(truncated 256 chars)

ℹ️ series.py: Copy paste fragment on line 1937 shared with ../frame.py:

                   level: Optional[Union[int, List[int]]] = None, ascending: bool = True,
                   inplace: bool = False, kind: str = None, na_positio...(truncated 56 chars)

ℹ️ series.py: Copy paste fragment inside the same file on lines 3091, 3183:

        results = sdf.select([scol] + index_scols).take(1)
        if len(results) == 0:
            raise ValueError(\"attempt to get...(truncated 376 chars)

ℹ️ series.py: Copy paste fragment inside the same file on lines 4073, 4232:

        sdf = self._internal.sdf \
            .select(cols) \
            .where(reduce(lambda x, y: x & y, rows))

        if len(self._inter...(truncated 255 chars)

ℹ️ series.py: Copy paste fragment inside the same file on lines 3350, 4245:

            internal = _InternalFrame(sdf=sdf, index_map=[(SPARK_INDEX_NAME_FORMAT(0), None)])
            return ...(truncated 148 chars)

ℹ️ test_ops_on_diff_frames.py: Copy paste fragment on line 508 shared with ../test_dataframe.py, ../test_indexes.py:

        return pd.DataFrame({
            'a': [1, 2, 3, 4, 5, 6, 7, 8, 9],
            'b': [4, 5, 6, 3, 2, 1, 0, 0, 0],
        }, index=[0, 1, 3, 5, 6, 8, 9, 9, 9])

    @propert...(truncated 20 chars)

ℹ️ test_ops_on_diff_frames.py: Copy paste fragment on line 37 shared with ../test_dataframe.py, ../test_indexes.py:

        return pd.DataFrame({
            'a': [1, 2, 3, 4, 5, 6, 7, 8, 9],
            'b': [4, 5, 6, 3, 2, 1, 0, 0, 0],
        }, index=[0, 1, 3, 5, 6, 8, 9, 9, 9])

ℹ️ test_ops_on_diff_frames.py: Copy paste fragment on line 37, 65, 508 shared with ../test_dataframe.py, ../test_indexes.py:

        return pd.DataFrame({
            'a': [1, 2, 3, 4, 5, 6, 7, 8, 9],
            'b': [4, 5, 6, 3, 2, 1, 0, 0, 0],

ℹ️ test_ops_on_diff_frames.py: Copy paste fragment on line 37, 65, 508 shared with ../test_indexing.py:

        return pd.DataFrame({
            'a': [1, 2, 3, 4, 5, 6, 7, 8, 9],
            'b': [4, 5, 6, 3, 2, 1, 0, 0, 0]

ℹ️ test_ops_on_diff_frames.py: Copy paste fragment on line 439, 547 shared with ../test_indexing.py:

        pdf = pd.DataFrame([[1, 2], [4, 5], [7, 8]],
                           index=['cobra', 'viper', 'sidewinder'],
                           columns=['max_speed', 'sh...(truncated 41 chars)

ℹ️ test_ops_on_diff_frames.py: Copy paste fragment on line 38, 509 shared with ../test_dataframe.py, ../test_indexes.py:

            'a': [1, 2, 3, 4, 5, 6, 7, 8, 9],
            'b': [4, 5, 6, 3, 2, 1, 0, 0, 0],
        }, index=[0, 1, 3, 5, 6, 8, 9, 9, 9])

ℹ️ test_ops_on_diff_frames.py: Copy paste fragment inside the same file on lines 458, 565:

                       repr(kdf1.where(kdf2 > 100).sort_index()))

        pdf1 = pd.DataFrame({'A': [-1, -2, -3, -4, -5], 'B': [-100, -200, -300, ...(truncated 174 chars)

ℹ️ test_ops_on_diff_frames.py: Copy paste fragment inside the same file on lines 475, 584:

                       repr(kdf1.mask(kdf2 < 100).sort_index()))

        pdf1 = pd.DataFrame({'A': [-1, -2, -3, -4, -5], 'B': [-100, -200, -300, -4...(truncated 172 chars)

ℹ️ test_ops_on_diff_frames.py: Copy paste fragment inside the same file on lines 466, 574:

                       repr(kdf1.where(kdf2 < -250).sort_index()))

    def test_mask(self):
        pdf1 = pd.DataFrame({'A': [0, 1, 2, 3, 4], 'B...(truncated 195 chars)

ℹ️ test_ops_on_diff_frames.py: Copy paste fragment inside the same file on lines 566, 585:


        pdf1 = pd.DataFrame({'A': [-1, -2, -3, -4, -5], 'B': [-100, -200, -300, -400, -500]})
        pdf2 = pd.DataFrame({'A': [...(truncated 275 chars)

ℹ️ test_ops_on_diff_frames.py: Copy paste fragment inside the same file on lines 459, 476:


        pdf1 = pd.DataFrame({'A': [-1, -2, -3, -4, -5], 'B': [-100, -200, -300, -400, -500]})
        pdf2 = pd.DataFrame({'A': [-10,...(truncated 176 chars)

ℹ️ test_ops_on_diff_frames.py: Copy paste fragment inside the same file on lines 558, 577:

        pdf1 = pd.DataFrame({'A': [0, 1, 2, 3, 4], 'B': [100, 200, 300, 400, 500]})
        pdf2 = pd.DataFrame({'A': [0, -1, -2, -3, -4], 'B': [-100, -200, -300, -400, -500]...(truncated 223 chars)

ℹ️ test_ops_on_diff_frames.py: Copy paste fragment inside the same file on lines 459, 585:


        pdf1 = pd.DataFrame({'A': [-1, -2, -3, -4, -5], 'B': [-100, -200, -300, -400, -500]})
        pdf2 = pd.DataFrame({'A': [-10,...(truncated 121 chars)

ℹ️ test_ops_on_diff_frames.py: Copy paste fragment inside the same file on lines 476, 566:


        pdf1 = pd.DataFrame({'A': [-1, -2, -3, -4, -5], 'B': [-100, -200, -300, -400, -500]})
        pdf2 = pd.DataFrame({'A': [-10, ...(truncated 120 chars)

ℹ️ test_ops_on_diff_frames.py: Copy paste fragment inside the same file on lines 452, 469:

        pdf1 = pd.DataFrame({'A': [0, 1, 2, 3, 4], 'B': [100, 200, 300, 400, 500]})
        pdf2 = pd.DataFrame({'A': [0, -1, -2, -3, -4], 'B': [-100, -200, -300, -400, -500]...(truncated 128 chars)

ℹ️ test_ops_on_diff_frames.py: Copy paste fragment inside the same file on lines 220, 452, 469:

        pdf1 = pd.DataFrame({'A': [0, 1, 2, 3, 4], 'B': [100, 200, 300, 400, 500]})
        pdf2 = pd.DataFrame({'A': [0, -1, -2, -3, -4], 'B': [-100, -200, ...(truncated 137 chars)

ℹ️ test_ops_on_diff_frames.py: Copy paste fragment inside the same file on lines 452, 558:

        pdf1 = pd.DataFrame({'A': [0, 1, 2, 3, 4], 'B': [100, 200, 300, 400, 500]})
        pdf2 = pd.DataFrame({'A': [0, -1, -2, -3, -4], 'B': [-100, -200, -300, -400, -500]...(truncated 74 chars)

ℹ️ test_ops_on_diff_frames.py: Copy paste fragment inside the same file on lines 220, 452, 469, 558, 577:

        pdf1 = pd.DataFrame({'A': [0, 1, 2, 3, 4], 'B': [100, 200, 300, 400, 500]})
        pdf2 = pd.DataFrame({'A': [0, -1, -2, -3, -4], 'B': [-100, -200, ...(truncated 91 chars)

ℹ️ test_ops_on_diff_frames.py: Copy paste fragment inside the same file on lines 34, 505:


    @property
    def pdf1(self):
        return pd.DataFrame({
            'a': [1, 2, 3, 4, 5, 6, 7, 8, 9],
            'b': [4, 5, 6,...(truncated 67 chars)

ℹ️ test_ops_on_diff_frames.py: Copy paste fragment inside the same file on lines 41, 512:


    @property
    def pdf2(self):
        return pd.DataFrame({
            'a': [9, 8, 7, 6, 5, 4, 3, 2, 1],
            'b': [0, 0, 0, 4, 5, 6, 1, 2,...(truncated 72 chars)

ℹ️ test_ops_on_diff_frames.py: Copy paste fragment inside the same file on lines 439, 547:

        pdf = pd.DataFrame(
            [[1, 2], [4, 5], [7, 8]],
            index=['cobra', 'viper', 'sidewinder'],
            columns=['max_speed', 'shield'])
     ...(truncated 66 chars)

ℹ️ test_ops_on_diff_frames.py: Copy paste fragment inside the same file on lines 44, 73, 515:

        return pd.DataFrame({
            'a': [9, 8, 7, 6, 5, 4, 3, 2, 1],
            'b': [0, 0, 0, 4, 5, 6, 1, 2, 3],

ℹ️ test_ops_on_diff_frames.py: Copy paste fragment inside the same file on lines 231, 239:

        kser1 = ks.from_pandas(pser1)
        kser2 = ks.from_pandas(pser2)

        self.assert_eq(pser1 | p...(truncated 103 chars)

ℹ️ test_ops_on_diff_frames.py: Copy paste fragment inside the same file on lines 138, 190:


        # Multi-index columns
        columns = pd.MultiIndex.from_tuples([('x', 'a'), ('x', 'b')])
        kdf1.columns = columns
        kdf2.co...(truncated 77 chars)

ℹ️ test_series.py: Copy paste fragment on line 553 shared with ../test_series_plot.py:

            'a': [1, 2, 3, 4, 5, 6, 7, 8, 9, 15, 50],
        }, index=[0, 1, 3, 5, 6, 8, 9, 9, 9, 10, 10])

ℹ️ test_series.py: Copy paste fragment on line 559 shared with ../test_frame_plot.py, ../test_series_plot.py:

        bytes_data = BytesIO()
        ax.figure.savefig(bytes_data, format='png')
        bytes_data.seek(0)
        b64_data = base64.b64encode(bytes_data.read())
       ...(truncated 45 chars)

ℹ️ test_series.py: Copy paste fragment on line 50 shared with ../test_series_conversion.py:

    def pser(self):
        return pd.Series([1, 2, 3, 4, 5, 6, 7], name='x')

    @property
    def kser(self):
        return ks.from_pandas(self.pser)

    def test_series(self):

ℹ️ test_series.py: Copy paste fragment inside the same file on lines 312, 329:

                     pd.Series([True, False], name='x'),
                     pd.Series([0, 1], name='x'),
                     pd.Series([1, 2,...(truncated 330 chars)

ℹ️ test_series.py: Copy paste fragment inside the same file on lines 744, 905:

        midx = pd.MultiIndex([['lama', 'cow', 'falcon'],
                              ['speed', 'weight', 'length']],
                             [[0, 0, 0, 1, 1, 1, 2, 2, 2]...(truncated 280 chars)

ℹ️ test_series.py: Copy paste fragment inside the same file on lines 719, 745, 906:

                              ['speed', 'weight', 'length']],
                             [[0, 0, 0, 1, 1, 1, 2, 2, 2],
                      ...(truncated 117 chars)

ℹ️ test_series.py: Copy paste fragment inside the same file on lines 827, 844:


        pser1 = pd.Series([-1, -2, -3, -4, -5], name=0)
        pser2 = pd.Series([-100, -200, -300, -400, -500], name=0)
        k...(truncated 123 chars)

ℹ️ test_series.py: Copy paste fragment inside the same file on lines 180, 190:

        pdf = pd.DataFrame({
            'left':  [True, False, True, False, np.nan, np.nan, True, False, np.nan],
            'right': [True, False, False, True, True, False, n...(truncated 119 chars)

ℹ️ test_series.py: Copy paste fragment inside the same file on lines 747, 788, 908:

                              [0, 1, 2, 0, 1, 2, 0, 1, 2]])
        kser = ks.Series([45, 200, 1.2, 30, 250, 1.5, 320, 1, 0.3],
             ...(truncated 137 chars)

ℹ️ test_series.py: Copy paste fragment inside the same file on lines 649, 668:


        index = pd.MultiIndex.from_arrays([
            ['a', 'a', 'b', 'b'], ['c', 'd', 'e', 'f']], names=('first', 'se...(truncated 151 chars)

ℹ️ test_series.py: Copy paste fragment inside the same file on lines 820, 837:

        pser1 = pd.Series([0, 1, 2, 3, 4], name=0)
        pser2 = pd.Series([100, 200, 300, 400, 500], name=0)
        kser1 = ks.from_pandas(pser1)
        kser2 = ks.from_...(truncated 69 chars)

ℹ️ test_series.py: Copy paste fragment inside the same file on lines 721, 788:

                              [0, 1, 2, 0, 1, 2, 0, 1, 2]])
        kser = ks.Series([45, 200, 1.2, 30, 250, 1.5, 320, 1, 0.3], index=midx)

ℹ️ test_series.py: Copy paste fragment inside the same file on lines 522, 532:


        pser = pd.Series([1, 2, 3], name='0',
                         index=pd.MultiIndex.from_tuples([('A', 'X'), ('A', 'Y...(truncated 128 chars)

ℹ️ test_series.py: Copy paste fragment inside the same file on lines 102, 107:

        self.assertTrue(ks.from_pandas(a).toPandas().isnull().all())
        self.assertRaises(ValueError, lambda: ks.from_pandas(b))

ℹ️ test_series.py: Copy paste fragment inside the same file on lines 282, 290:

        sample_lst = [1, 2, 3, 4, np.nan, 6]
        pser = pd.Series(sample_lst, name='x')
        kser = ks.Series(sample_lst, name='x')
        self.assert_eq(kser.nsm...(truncated 33 chars)

Now that you are on the file, it would be easier to pay back some tech. debt.

⭐ Change Overview

(Open in Softagram Desktop for full details)

💡 Insights

Co-change Alert: You modified series.py. Often frame.py (databricks/koalas) is modified at the same time.

📄 Full report

Permalink: Full report for pull/1075

Impact Report explained. Give feedback on this report to [email protected]

HyukjinKwon · 2019-12-03T03:57:53Z

databricks/koalas/series.py

+        if isinstance(key, Series) and isinstance(key.spark_type, BooleanType):
+            self._kdf["__temp_col__"] = key
+            sdf = self._kdf._sdf.filter(F.col("__temp_col__")).drop("__temp_col__")
+            return _col(DataFrame(_InternalFrame(sdf=sdf, index_map=self._internal.index_map)))


hm, actually it should set column index. In where too .. I see some related problems .. let me fix it separately.

harupy added 2 commits November 25, 2019 23:37

Fix getitem

5dd4476

Fix indent

86b9b6d

harupy closed this Nov 25, 2019

harupy reopened this Nov 25, 2019

HyukjinKwon reviewed Dec 2, 2019

View reviewed changes

harupy added 2 commits December 2, 2019 21:15

Merge branch 'master' into fix-series-getitem

6d9b8dc

fix for different frames

f8e4272

harupy closed this Dec 2, 2019

harupy reopened this Dec 2, 2019

HyukjinKwon reviewed Dec 3, 2019

View reviewed changes

HyukjinKwon approved these changes Dec 3, 2019

View reviewed changes

HyukjinKwon merged commit 75664fa into databricks:master Dec 3, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow Series.getitem to take boolean Series #1075

Allow Series.getitem to take boolean Series #1075

harupy commented Nov 25, 2019

codecov-io commented Nov 25, 2019 •

edited

Loading

harupy commented Dec 2, 2019

HyukjinKwon Dec 2, 2019

harupy Dec 2, 2019

HyukjinKwon Dec 2, 2019 •

edited

Loading

harupy Dec 2, 2019

HyukjinKwon Dec 2, 2019 •

edited

Loading

HyukjinKwon Dec 2, 2019

HyukjinKwon Dec 2, 2019 •

edited

Loading

harupy Dec 2, 2019

harupy Dec 2, 2019

softagram-bot commented Dec 2, 2019

HyukjinKwon Dec 3, 2019

Allow Series.__getitem__ to take boolean Series #1075

Allow Series.__getitem__ to take boolean Series #1075

Conversation

harupy commented Nov 25, 2019

codecov-io commented Nov 25, 2019 • edited Loading

Codecov Report

harupy commented Dec 2, 2019

HyukjinKwon Dec 2, 2019

Choose a reason for hiding this comment

harupy Dec 2, 2019

Choose a reason for hiding this comment

HyukjinKwon Dec 2, 2019 • edited Loading

Choose a reason for hiding this comment

harupy Dec 2, 2019

Choose a reason for hiding this comment

HyukjinKwon Dec 2, 2019 • edited Loading

Choose a reason for hiding this comment

HyukjinKwon Dec 2, 2019

Choose a reason for hiding this comment

HyukjinKwon Dec 2, 2019 • edited Loading

Choose a reason for hiding this comment

harupy Dec 2, 2019

Choose a reason for hiding this comment

harupy Dec 2, 2019

Choose a reason for hiding this comment

softagram-bot commented Dec 2, 2019

Softagram Impact Report for pull/1075 (head commit: f8e4272)

⚠️ Copy paste found

⭐ Change Overview

💡 Insights

📄 Full report

HyukjinKwon Dec 3, 2019

Choose a reason for hiding this comment

Allow Series.getitem to take boolean Series #1075

Allow Series.getitem to take boolean Series #1075

codecov-io commented Nov 25, 2019 •

edited

Loading

HyukjinKwon Dec 2, 2019 •

edited

Loading

HyukjinKwon Dec 2, 2019 •

edited

Loading

HyukjinKwon Dec 2, 2019 •

edited

Loading

Softagram Impact Report for pull/1075 (head commit: `f8e4272`)