Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix value_counts() to work properly when dropna is True #1116

Merged
merged 11 commits into from
Dec 18, 2019

Conversation

itholic
Copy link
Contributor

@itholic itholic commented Dec 11, 2019

This PR Resolves comment #949 (comment)

>>> kdf
   a    b
0  1  NaN
1  2  1.0
2  3  NaN

>>> kdf.a
0    1
1    2
2    3
Name: a, dtype: int64

>>> kdf.a.value_counts()
2    1
3    1
1    1
Name: a, dtype: int64

databricks/koalas/base.py Outdated Show resolved Hide resolved
@codecov-io
Copy link

codecov-io commented Dec 11, 2019

Codecov Report

Merging #1116 into master will decrease coverage by 0.03%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #1116      +/-   ##
==========================================
- Coverage   95.14%   95.11%   -0.04%     
==========================================
  Files          35       35              
  Lines        7003     7039      +36     
==========================================
+ Hits         6663     6695      +32     
- Misses        340      344       +4
Impacted Files Coverage Δ
databricks/koalas/base.py 96.17% <100%> (+0.11%) ⬆️
databricks/koalas/indexing.py 93.26% <0%> (-1.01%) ⬇️
databricks/koalas/indexes.py 96.24% <0%> (-0.12%) ⬇️
databricks/koalas/missing/frame.py 100% <0%> (ø) ⬆️
databricks/koalas/missing/indexes.py 100% <0%> (ø) ⬆️
databricks/koalas/frame.py 96.82% <0%> (+0.02%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 63d42c3...c04be15. Read the comment docs.

databricks/koalas/base.py Outdated Show resolved Hide resolved
databricks/koalas/tests/test_series.py Show resolved Hide resolved
databricks/koalas/base.py Outdated Show resolved Hide resolved
databricks/koalas/base.py Outdated Show resolved Hide resolved
pser.index.value_counts(ascending=True, dropna=False), almost=True)

# Series with MultiIndex some of index is NaN.
# This test only available for pandas >= 0.24.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are these only for pandas >= 0.24?

Copy link
Contributor Author

@itholic itholic Dec 13, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

because pandas < 0.24 doesn't support None for MultiIndex like below way.

>>> pd.__version__
'0.23.0'
>>> pidx = pd.MultiIndex.from_tuples([('x', 'a'), None, ('y', 'c')])
Traceback (most recent call last):
...
TypeError: object of type 'NoneType' has no len()

so i think this test should be ran on pandas >= 0.24

Comment on lines 249 to 253
set_arrow_conf = False
if LooseVersion(pyspark.__version__) < LooseVersion("2.4") and \
default_session().conf.get("spark.sql.execution.arrow.enabled") == "true":
default_session().conf.set("spark.sql.execution.arrow.enabled", "false")
set_arrow_conf = True
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need this?
If we need this, could you make sure to set back to the original conf?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we need this to test at pyspark2.3 same reason with #1116 (comment),

i fixed the test logic to make sure setting back to original conf value.

Thanks for the review :)

@softagram-bot
Copy link

Softagram Impact Report for pull/1116 (head commit: c04be15)

⚠️ Copy paste found

ℹ️ test_series.py: Copy paste fragment inside the same file on lines 307, 329:

        kser = ks.from_pandas(pser)

        self.assert_eq(kser.value_counts(normalize=True),
                       pser.value_counts(normaliz...(truncated 1204 chars)

ℹ️ test_series.py: Copy paste fragment inside the same file on lines 329, 351:

        kser = ks.from_pandas(pser)

        self.assert_eq(kser.value_counts(normalize=True),
                    ...(truncated 1114 chars)

ℹ️ test_series.py: Copy paste fragment inside the same file on lines 329, 351, 375:

        kser = ks.from_pandas(pser)

        self.assert_eq(kser.value_counts(normalize=True),
                    ...(truncated 1114 chars)

ℹ️ test_series.py: Copy paste fragment inside the same file on lines 307, 351, 375:

        kser = ks.from_pandas(pser)

        self.assert_eq(kser.value_counts(normalize=True),
                       pser.value_counts(normaliz...(truncated 1085 chars)

ℹ️ test_series.py: Copy paste fragment inside the same file on lines 256, 308, 330, 352, 376:


        self.assert_eq(kser.value_counts(normalize=True),
                       pser.value_counts(normalize=True), almost=True)
        self.assert_eq(k...(truncated 1039 chars)

ℹ️ test_series.py: Copy paste fragment inside the same file on lines 295, 317, 339:


        self.assert_eq(kser.index.value_counts(normalize=True),
                       pser.index.value_counts(no...(truncated 595 chars)

ℹ️ test_series.py: Copy paste fragment inside the same file on lines 265, 295, 361, 385:


        self.assert_eq(kser.index.value_counts(normalize=True),
                       pser.index.value_counts(nor...(truncated 505 chars)

ℹ️ test_series.py: Copy paste fragment inside the same file on lines 440, 457:

                     pd.Series([True, False], name='x'),
                     pd.Series([0, 1], name='x'),
                     pd.Series([1, 2,...(truncated 330 chars)

ℹ️ test_series.py: Copy paste fragment inside the same file on lines 884, 1025:

        midx = pd.MultiIndex([['lama', 'cow', 'falcon'],
                              ['speed', 'weight', 'length']],
                             [[0, 0, 0, 1, 1, 1, 2, 2, 2]...(truncated 280 chars)

ℹ️ test_series.py: Copy paste fragment inside the same file on lines 885, 1026, 1062:

                              ['speed', 'weight', 'length']],
                             [[0, 0, 0, 1, 1, 1, 2, 2, 2],
                      ...(truncated 256 chars)

ℹ️ test_series.py: Copy paste fragment inside the same file on lines 856, 1059:


        # For MultiIndex
        midx = pd.MultiIndex([['lama', 'cow', 'falcon'],
                              ['speed', 'weight', 'length']],
                             [[...(truncated 167 chars)

ℹ️ test_series.py: Copy paste fragment inside the same file on lines 859, 885, 1026:

                              ['speed', 'weight', 'length']],
                             [[0, 0, 0, 1, 1, 1, 2, 2, 2],
                      ...(truncated 117 chars)

Now that you are on the file, it would be easier to pay back some tech. debt.

⭐ Change Overview

Showing the changed files, dependency changes and the impact - click for full size
(Open in Softagram Desktop for full details)

💡 Insights

  • Co-change Alert: You modified test_series.py. Often series.py (databricks/koalas) is modified at the same time.

📄 Full report

Impact Report explained. Give feedback on this report to [email protected]

RuntimeError,
lambda: ks.MultiIndex.from_tuples([('x', 'a'), ('x', 'b')]).value_counts())
else:
self._test_value_counts()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you disable arrow execution only when pyspark < 2.4 and for MultiIndex?

Copy link
Contributor Author

@itholic itholic Dec 18, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay i will 👍
thanks !

Comment on lines +946 to +952
from databricks.koalas.indexes import MultiIndex
if LooseVersion(pyspark.__version__) < LooseVersion("2.4") and \
default_session().conf.get("spark.sql.execution.arrow.enabled") == "true" and \
isinstance(self, MultiIndex):
raise RuntimeError("if you're using pyspark < 2.4, set conf "
"'spark.sql.execution.arrow.enabled' to 'false' "
"for using this function with MultiIndex")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Btw, I think we should do this only in MultiIndex by overriding.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, that seems better. thanks for the advice!

Copy link
Member

@HyukjinKwon HyukjinKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@HyukjinKwon
Copy link
Member

Let me make a followup.

@HyukjinKwon HyukjinKwon merged commit 5a950c0 into databricks:master Dec 18, 2019
@itholic
Copy link
Contributor Author

itholic commented Dec 18, 2019

@HyukjinKwon ah, thanks! can you cc me after finished??

HyukjinKwon added a commit that referenced this pull request Dec 18, 2019
@itholic itholic deleted the fix_i_value_counts branch December 20, 2019 04:14
rising-star92 added a commit to rising-star92/databricks-koalas that referenced this pull request Jan 27, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants