-
Notifications
You must be signed in to change notification settings - Fork 358
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement Series.factorize() #1972
Conversation
Codecov Report
@@ Coverage Diff @@
## master #1972 +/- ##
==========================================
+ Coverage 94.52% 94.58% +0.06%
==========================================
Files 50 50
Lines 10952 11041 +89
==========================================
+ Hits 10352 10443 +91
+ Misses 600 598 -2
Continue to review full report at Codecov.
|
d26b899
to
a86af14
Compare
databricks/koalas/series.py
Outdated
raise ValueError( | ||
"Please set 'compute.max_rows' by using 'databricks.koalas.config.set_option' " | ||
"to restrict the total number of unique values of the current Series." | ||
"Note that, before changing the 'compute.max_rows', " | ||
"this operation is considerably expensive." | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this case, we should just collect all the data? cc @HyukjinKwon
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does it mean by collect all the data
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just do toPandas()
without limits.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it! Modified to toPandas()
without limits for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Otherwise, LGTM.
databricks/koalas/series.py
Outdated
if na_sentinel is not None: | ||
# Drops the NaN from the uniques of the values | ||
non_na_list = [x for x in uniques_list if not pd.isna(x)] | ||
if len(non_na_list) == 0: | ||
uniques = pd.Index(non_na_list) | ||
else: | ||
uniques = ks.Index(non_na_list) | ||
else: | ||
uniques = ks.Index(uniques_list) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can always return pd.Index
as uniques
..? cc @HyukjinKwon
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Modified it to pd.Index
for now.
databricks/koalas/series.py
Outdated
raise ValueError( | ||
"Please set 'compute.max_rows' by using 'databricks.koalas.config.set_option' " | ||
"to restrict the total number of unique values of the current Series." | ||
"Note that, before changing the 'compute.max_rows', " | ||
"this operation is considerably expensive." | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just do toPandas()
without limits.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, pending tests.
@xinrong-databricks Could you try the following as well? >>> kser = ks.Series([1, 2, np.nan, 4, 5])
>>> kser.loc[3] = np.nan
>>> kser.factorize(na_sentinel=None)
(0 0
1 1
2 4
3 4
4 2
dtype: int32, Float64Index([1.0, 2.0, 5.0, nan, nan], dtype='float64'))
>>> kser.to_pandas().factorize(na_sentinel=None)
(array([0, 1, 3, 3, 2]), Float64Index([1.0, 2.0, 5.0, nan], dtype='float64')) |
@ueshin Good catch! Let me look into this. |
Thanks! merging. |
Thank you for reviewing and merging the PR! @ueshin :) |
ref #1929