-
Notifications
You must be signed in to change notification settings - Fork 360
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Select the series correctly in SeriesGroupBy APIs #1224
Conversation
|
||
@property | ||
def _kdf(self) -> DataFrame: | ||
return self._kser._kdf |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The fix is here.
@@ -2016,6 +2012,15 @@ class SeriesGroupBy(GroupBy): | |||
def __init__(self, kser: Series, by: List[Series], as_index: bool = True): | |||
self._kser = kser | |||
self._groupkeys = by |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and here.
13c0af0
to
721b171
Compare
Codecov Report
@@ Coverage Diff @@
## master #1224 +/- ##
=========================================
+ Coverage 95.19% 95.2% +<.01%
=========================================
Files 35 35
Lines 7263 7271 +8
=========================================
+ Hits 6914 6922 +8
Misses 349 349
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that we should refactor these classes since the base infrastructure is rather old. We should revisit and refine later.
self._groupkeys_scols = [F.col(name_like_string(s.name)) for s in self._groupkeys] | ||
self._agg_columns_scols = [F.col(name_like_string(s.name)) for s in self._agg_columns] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
F.col(s._internal.data_columns[0])
instead of F.col(name_like_string(s.name))
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I switched but realised that some tests fail with that change. Seems like there's some cases when internal column names and series names are different:
>>> (ks.range(10).id + 1)._internal.data_columns
['(id + 1)']
>>> (ks.range(10).id + 1).name
'id'
We use:
@property
def _kdf(self) -> DataFrame:
series = [self._kser] + [s for s in self._groupkeys if not s._equals(self._kser)]
return DataFrame(self._kser._kdf._internal.with_new_columns(series))
which always aliases via using column_index
at with_new_columns
. So, it seems guaranteed to have the internal Spark column names same as s.name
. I think using name
here for this workaround is correct.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah, I see. hm, maybe renaming the columns in with_new_columns
was not a good idea. I'll fix it later.
databricks/koalas/groupby.py
Outdated
return self._kser._kdf | ||
# TODO: if names from _kser and _groupkeys are name, grouping key is just ignored. | ||
# it can be a problem when both series have the same name but different operations. | ||
series = [self._kser] + [s for s in self._groupkeys if s.name != self._kser.name] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about using s._equals(self._kser)
which compares the ._scol._jc
s?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In that case, seems it fails because the same names exist in the same DataFrame AnalysisException: Reference 'b' is ambiguous, could be: b, b.;
... but yeah I think it's better to fail fast for now rather than returning a wrong result.
self.assert_eq(kdf.groupby(['b'])['a'].apply(lambda x: x).sort_index(), | ||
pdf.groupby(['b'])['a'].apply(lambda x: x).sort_index()) | ||
self.assert_eq(kdf.groupby(['b'])['b'].apply(lambda x: x).sort_index(), | ||
pdf.groupby(['b'])['b'].apply(lambda x: x).sort_index()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, we will not support this case for now, which I think it's fine (?).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
self._groupkeys_scols = [F.col(name_like_string(s.name)) for s in self._groupkeys] | ||
self._agg_columns_scols = [F.col(name_like_string(s.name)) for s in self._agg_columns] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah, I see. hm, maybe renaming the columns in with_new_columns
was not a good idea. I'll fix it later.
Thanks! merging. |
Is this issue fixed? |
Yup, this will be available for the next week's release. |
…different values (#1233) A small followup of databricks/koalas#1224 and databricks/koalas#1229 Now, we can cover this case
Can be tested with the codes below:
Basic idea is that, it creates a DataFrame from Series to reuse existing implementations, and resolves the columns by name.