-
Notifications
You must be signed in to change notification settings - Fork 358
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow Series.__getitem__ to take boolean Series #1075
Conversation
Codecov Report
@@ Coverage Diff @@
## master #1075 +/- ##
==========================================
+ Coverage 95.2% 95.21% +<.01%
==========================================
Files 34 34
Lines 6889 6893 +4
==========================================
+ Hits 6559 6563 +4
Misses 330 330
Continue to review full report at Codecov.
|
ping @HyukjinKwon @ueshin |
databricks/koalas/series.py
Outdated
@@ -4383,6 +4383,12 @@ def __len__(self): | |||
return len(self.to_dataframe()) | |||
|
|||
def __getitem__(self, key): | |||
if isinstance(key, Series): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@harupy, can we also check if the given Series's type is bool
to prevent such cases:
>>> pd.DataFrame({'a': [1,2,3]})[pd.Series(['a'])]
a
0 1
1 2
2 3
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will fix it
databricks/koalas/series.py
Outdated
@@ -4383,6 +4383,12 @@ def __len__(self): | |||
return len(self.to_dataframe()) | |||
|
|||
def __getitem__(self, key): | |||
if isinstance(key, Series): | |||
bcol = key._scol.cast(BooleanType()) | |||
sdf = self._internal.sdf.filter(bcol) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, ideally we should also consider about different Series from different DataFrames.
We can do it via using existing logic that handles such series from different DataFrames:
# To allows 'compute.ops_on_diff_frames', it assigns the given column to the current Frame.
self._kdf["__temp_col__"] = key
sdf = self._internal.sdf.filter(F.col("__temp_col__"))
internal = self._internal.copy(sdf=sdf)
...
^^ It's not tested so should be double checked.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the comment. Does the code above work for both different and same dataframes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I think so because I suggested the similar trick before and it seemed working (e.g., 7070bc6#diff-5ba644c40e914732959b9b4867f30c53R2158)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could add one test under OpsOnDiffFramesEnabledTest
(e.g., kser[ks.Series([True])]
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh! wait, can we reuse self.to_dataframe().where(...)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me try
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
self.to_dataframe().where(...)
didn't work. It just fills False
with NaN
.
Softagram Impact Report for pull/1075 (head commit: f8e4272)
|
if isinstance(key, Series) and isinstance(key.spark_type, BooleanType): | ||
self._kdf["__temp_col__"] = key | ||
sdf = self._kdf._sdf.filter(F.col("__temp_col__")).drop("__temp_col__") | ||
return _col(DataFrame(_InternalFrame(sdf=sdf, index_map=self._internal.index_map))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hm, actually it should set column index. In where
too .. I see some related problems .. let me fix it separately.
Resolves #1073