Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement Series.factorize() #1972

Merged
merged 19 commits into from
Jan 13, 2021
Merged

Conversation

xinrong-meng
Copy link
Contributor

@xinrong-meng xinrong-meng commented Dec 16, 2020

ref #1929

        >>> kser = ks.Series(['b', None, 'a', 'c', 'b'])
        >>> codes, uniques = kser.factorize()
        >>> codes
        0    1
        1   -1
        2    0
        3    2
        4    1
        dtype: int64
        >>> uniques
        Index(['a', 'b', 'c'], dtype='object')

        >>> codes, uniques = kser.factorize(na_sentinel=None)
        >>> codes
        0    1
        1    3
        2    0
        3    2
        4    1
        dtype: int64
        >>> uniques
        Index(['a', 'b', 'c', None], dtype='object')

        >>> codes, uniques = kser.factorize(na_sentinel=-2)
        >>> codes
        0    1
        1   -2
        2    0
        3    2
        4    1
        dtype: int64
        >>> uniques
        Index(['a', 'b', 'c'], dtype='object')

@codecov-io
Copy link

codecov-io commented Dec 16, 2020

Codecov Report

Merging #1972 (2e538a7) into master (0e44bc7) will increase coverage by 0.06%.
The diff coverage is 95.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #1972      +/-   ##
==========================================
+ Coverage   94.52%   94.58%   +0.06%     
==========================================
  Files          50       50              
  Lines       10952    11041      +89     
==========================================
+ Hits        10352    10443      +91     
+ Misses        600      598       -2     
Impacted Files Coverage Δ
databricks/koalas/missing/series.py 100.00% <ø> (ø)
databricks/koalas/series.py 96.75% <95.00%> (+<0.01%) ⬆️
databricks/koalas/plot/matplotlib.py 92.62% <0.00%> (-0.73%) ⬇️
databricks/koalas/generic.py 92.60% <0.00%> (-0.01%) ⬇️
databricks/koalas/missing/frame.py 100.00% <0.00%> (ø)
databricks/koalas/tests/plot/test_frame_plot.py 100.00% <0.00%> (ø)
...s/koalas/tests/plot/test_series_plot_matplotlib.py 100.00% <0.00%> (ø)
databricks/koalas/groupby.py 91.60% <0.00%> (+0.02%) ⬆️
... and 4 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0e44bc7...2e538a7. Read the comment docs.

@xinrong-meng xinrong-meng requested a review from ueshin December 21, 2020 17:32
@xinrong-meng xinrong-meng marked this pull request as ready for review December 21, 2020 17:34
databricks/koalas/series.py Outdated Show resolved Hide resolved
databricks/koalas/series.py Outdated Show resolved Hide resolved
Comment on lines 1998 to 2003
raise ValueError(
"Please set 'compute.max_rows' by using 'databricks.koalas.config.set_option' "
"to restrict the total number of unique values of the current Series."
"Note that, before changing the 'compute.max_rows', "
"this operation is considerably expensive."
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case, we should just collect all the data? cc @HyukjinKwon

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does it mean by collect all the data?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just do toPandas() without limits.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it! Modified to toPandas() without limits for now.

databricks/koalas/series.py Outdated Show resolved Hide resolved
databricks/koalas/series.py Outdated Show resolved Hide resolved
databricks/koalas/series.py Outdated Show resolved Hide resolved
databricks/koalas/tests/test_series.py Show resolved Hide resolved
Copy link
Collaborator

@ueshin ueshin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Otherwise, LGTM.

databricks/koalas/series.py Outdated Show resolved Hide resolved
databricks/koalas/series.py Outdated Show resolved Hide resolved
Comment on lines 2032 to 2040
if na_sentinel is not None:
# Drops the NaN from the uniques of the values
non_na_list = [x for x in uniques_list if not pd.isna(x)]
if len(non_na_list) == 0:
uniques = pd.Index(non_na_list)
else:
uniques = ks.Index(non_na_list)
else:
uniques = ks.Index(uniques_list)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can always return pd.Index as uniques ..? cc @HyukjinKwon

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Modified it to pd.Index for now.

databricks/koalas/series.py Outdated Show resolved Hide resolved
databricks/koalas/series.py Outdated Show resolved Hide resolved
Comment on lines 1998 to 2003
raise ValueError(
"Please set 'compute.max_rows' by using 'databricks.koalas.config.set_option' "
"to restrict the total number of unique values of the current Series."
"Note that, before changing the 'compute.max_rows', "
"this operation is considerably expensive."
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just do toPandas() without limits.

databricks/koalas/series.py Outdated Show resolved Hide resolved
databricks/koalas/series.py Outdated Show resolved Hide resolved
databricks/koalas/series.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@ueshin ueshin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, pending tests.

@ueshin
Copy link
Collaborator

ueshin commented Jan 11, 2021

@xinrong-databricks Could you try the following as well?

>>> kser = ks.Series([1, 2, np.nan, 4, 5])
>>> kser.loc[3] = np.nan
>>> kser.factorize(na_sentinel=None)
(0    0
1    1
2    4
3    4
4    2
dtype: int32, Float64Index([1.0, 2.0, 5.0, nan, nan], dtype='float64'))

>>> kser.to_pandas().factorize(na_sentinel=None)
(array([0, 1, 3, 3, 2]), Float64Index([1.0, 2.0, 5.0, nan], dtype='float64'))

@xinrong-meng
Copy link
Contributor Author

@ueshin Good catch! Let me look into this.

@ueshin
Copy link
Collaborator

ueshin commented Jan 13, 2021

Thanks! merging.

@ueshin ueshin merged commit ce2d260 into databricks:master Jan 13, 2021
@xinrong-meng
Copy link
Contributor Author

Thank you for reviewing and merging the PR! @ueshin :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants