Fix value_counts() to work properly when dropna is True #1116

itholic · 2019-12-11T03:02:46Z

This PR Resolves comment #949 (comment)

>>> kdf
   a    b
0  1  NaN
1  2  1.0
2  3  NaN

>>> kdf.a
0    1
1    2
2    3
Name: a, dtype: int64

>>> kdf.a.value_counts()
2    1
3    1
1    1
Name: a, dtype: int64

databricks/koalas/base.py

codecov-io · 2019-12-11T06:27:51Z

Codecov Report

Merging #1116 into master will decrease coverage by 0.03%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master    #1116      +/-   ##
==========================================
- Coverage   95.14%   95.11%   -0.04%     
==========================================
  Files          35       35              
  Lines        7003     7039      +36     
==========================================
+ Hits         6663     6695      +32     
- Misses        340      344       +4

Impacted Files	Coverage Δ
databricks/koalas/base.py	`96.17% <100%> (+0.11%)`	⬆️
databricks/koalas/indexing.py	`93.26% <0%> (-1.01%)`	⬇️
databricks/koalas/indexes.py	`96.24% <0%> (-0.12%)`	⬇️
databricks/koalas/missing/frame.py	`100% <0%> (ø)`	⬆️
databricks/koalas/missing/indexes.py	`100% <0%> (ø)`	⬆️
databricks/koalas/frame.py	`96.82% <0%> (+0.02%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 63d42c3...c04be15. Read the comment docs.

databricks/koalas/base.py

databricks/koalas/tests/test_series.py

databricks/koalas/base.py

ueshin · 2019-12-12T19:11:04Z

databricks/koalas/tests/test_series.py

+                       pser.index.value_counts(ascending=True, dropna=False), almost=True)
+
+        # Series with MultiIndex some of index is NaN.
+        # This test only available for pandas >= 0.24.


Why are these only for pandas >= 0.24?

because pandas < 0.24 doesn't support None for MultiIndex like below way.

>>> pd.__version__ '0.23.0' >>> pidx = pd.MultiIndex.from_tuples([('x', 'a'), None, ('y', 'c')]) Traceback (most recent call last): ... TypeError: object of type 'NoneType' has no len()

so i think this test should be ran on pandas >= 0.24

ueshin · 2019-12-12T19:13:28Z

databricks/koalas/tests/test_series.py

+        set_arrow_conf = False
+        if LooseVersion(pyspark.__version__) < LooseVersion("2.4") and \
+                default_session().conf.get("spark.sql.execution.arrow.enabled") == "true":
+            default_session().conf.set("spark.sql.execution.arrow.enabled", "false")
+            set_arrow_conf = True


Why do we need this?
If we need this, could you make sure to set back to the original conf?

we need this to test at pyspark2.3 same reason with #1116 (comment),

i fixed the test logic to make sure setting back to original conf value.

Thanks for the review :)

softagram-bot · 2019-12-13T06:44:40Z

Softagram Impact Report for pull/1116 (head commit: `c04be15`)

⚠️ Copy paste found

ℹ️ test_series.py: Copy paste fragment inside the same file on lines 307, 329:

        kser = ks.from_pandas(pser)

        self.assert_eq(kser.value_counts(normalize=True),
                       pser.value_counts(normaliz...(truncated 1204 chars)

ℹ️ test_series.py: Copy paste fragment inside the same file on lines 329, 351:

        kser = ks.from_pandas(pser)

        self.assert_eq(kser.value_counts(normalize=True),
                    ...(truncated 1114 chars)

ℹ️ test_series.py: Copy paste fragment inside the same file on lines 329, 351, 375:

        kser = ks.from_pandas(pser)

        self.assert_eq(kser.value_counts(normalize=True),
                    ...(truncated 1114 chars)

ℹ️ test_series.py: Copy paste fragment inside the same file on lines 307, 351, 375:

        kser = ks.from_pandas(pser)

        self.assert_eq(kser.value_counts(normalize=True),
                       pser.value_counts(normaliz...(truncated 1085 chars)

ℹ️ test_series.py: Copy paste fragment inside the same file on lines 256, 308, 330, 352, 376:


        self.assert_eq(kser.value_counts(normalize=True),
                       pser.value_counts(normalize=True), almost=True)
        self.assert_eq(k...(truncated 1039 chars)

ℹ️ test_series.py: Copy paste fragment inside the same file on lines 295, 317, 339:


        self.assert_eq(kser.index.value_counts(normalize=True),
                       pser.index.value_counts(no...(truncated 595 chars)

ℹ️ test_series.py: Copy paste fragment inside the same file on lines 265, 295, 361, 385:


        self.assert_eq(kser.index.value_counts(normalize=True),
                       pser.index.value_counts(nor...(truncated 505 chars)

ℹ️ test_series.py: Copy paste fragment inside the same file on lines 440, 457:

                     pd.Series([True, False], name='x'),
                     pd.Series([0, 1], name='x'),
                     pd.Series([1, 2,...(truncated 330 chars)

ℹ️ test_series.py: Copy paste fragment inside the same file on lines 884, 1025:

        midx = pd.MultiIndex([['lama', 'cow', 'falcon'],
                              ['speed', 'weight', 'length']],
                             [[0, 0, 0, 1, 1, 1, 2, 2, 2]...(truncated 280 chars)

ℹ️ test_series.py: Copy paste fragment inside the same file on lines 885, 1026, 1062:

                              ['speed', 'weight', 'length']],
                             [[0, 0, 0, 1, 1, 1, 2, 2, 2],
                      ...(truncated 256 chars)

ℹ️ test_series.py: Copy paste fragment inside the same file on lines 856, 1059:


        # For MultiIndex
        midx = pd.MultiIndex([['lama', 'cow', 'falcon'],
                              ['speed', 'weight', 'length']],
                             [[...(truncated 167 chars)

ℹ️ test_series.py: Copy paste fragment inside the same file on lines 859, 885, 1026:

                              ['speed', 'weight', 'length']],
                             [[0, 0, 0, 1, 1, 1, 2, 2, 2],
                      ...(truncated 117 chars)

Now that you are on the file, it would be easier to pay back some tech. debt.

⭐ Change Overview

(Open in Softagram Desktop for full details)

💡 Insights

Co-change Alert: You modified test_series.py. Often series.py (databricks/koalas) is modified at the same time.

📄 Full report

Permalink: Full report for pull/1116

Impact Report explained. Give feedback on this report to [email protected]

ueshin · 2019-12-17T19:24:50Z

databricks/koalas/tests/test_series.py

+                RuntimeError,
+                lambda: ks.MultiIndex.from_tuples([('x', 'a'), ('x', 'b')]).value_counts())
+        else:
+            self._test_value_counts()


Could you disable arrow execution only when pyspark < 2.4 and for MultiIndex?

okay i will 👍
thanks !

ueshin · 2019-12-17T19:27:27Z

databricks/koalas/base.py

+        from databricks.koalas.indexes import MultiIndex
+        if LooseVersion(pyspark.__version__) < LooseVersion("2.4") and \
+                default_session().conf.get("spark.sql.execution.arrow.enabled") == "true" and \
+                isinstance(self, MultiIndex):
+            raise RuntimeError("if you're using pyspark < 2.4, set conf "
+                               "'spark.sql.execution.arrow.enabled' to 'false' "
+                               "for using this function with MultiIndex")


Btw, I think we should do this only in MultiIndex by overriding.

ah, that seems better. thanks for the advice!

HyukjinKwon

Looks good except https://github.com/databricks/koalas/pull/1116/files#r358984275

HyukjinKwon · 2019-12-18T09:56:34Z

Let me make a followup.

itholic · 2019-12-18T10:09:26Z

@HyukjinKwon ah, thanks! can you cc me after finished??

This PR address the comment at #1116 (comment)

This PR address the comment at databricks/koalas#1116 (comment)

itholic added 2 commits December 11, 2019 12:00

fix

37206db

fix comment & rollback useless change

21c8431

ueshin reviewed Dec 11, 2019

View reviewed changes

databricks/koalas/base.py Outdated Show resolved Hide resolved

itholic added 2 commits December 11, 2019 13:35

fix

f2f3df5

fix for index

a895436

ueshin reviewed Dec 11, 2019

View reviewed changes

databricks/koalas/base.py Outdated Show resolved Hide resolved

databricks/koalas/tests/test_series.py Show resolved Hide resolved

ueshin reviewed Dec 11, 2019

View reviewed changes

databricks/koalas/base.py Outdated Show resolved Hide resolved

itholic added 4 commits December 12, 2019 10:51

fix & add tests

64d1fb6

support for pyspark2.3 with set option

0362cc2

fix

d2271db

fix

1299d0e

ueshin reviewed Dec 12, 2019

View reviewed changes

itholic added 3 commits December 13, 2019 12:39

fix test logic

0210ebe

raise error only for multiindex

204976e

fix test more properly

c04be15

ueshin reviewed Dec 17, 2019

View reviewed changes

HyukjinKwon approved these changes Dec 18, 2019

View reviewed changes

HyukjinKwon merged commit 5a950c0 into databricks:master Dec 18, 2019

HyukjinKwon mentioned this pull request Dec 18, 2019

Move value_counts for struct into only multi-index's #1142

Merged

HyukjinKwon added a commit that referenced this pull request Dec 18, 2019

Move value_counts for struct into only multi-index's (#1142)

468bf3a

This PR address the comment at #1116 (comment)

itholic deleted the fix_i_value_counts branch December 20, 2019 04:14

rising-star92 added a commit to rising-star92/databricks-koalas that referenced this pull request Jan 27, 2023

Move value_counts for struct into only multi-index's (#1142)

c28f0f3

This PR address the comment at databricks/koalas#1116 (comment)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix value_counts() to work properly when dropna is True #1116

Fix value_counts() to work properly when dropna is True #1116

itholic commented Dec 11, 2019

codecov-io commented Dec 11, 2019 •

edited

Loading

ueshin Dec 12, 2019

itholic Dec 13, 2019 •

edited

Loading

ueshin Dec 12, 2019

itholic Dec 13, 2019

softagram-bot commented Dec 13, 2019

ueshin Dec 17, 2019

itholic Dec 18, 2019 •

edited

Loading

ueshin Dec 17, 2019

itholic Dec 18, 2019

HyukjinKwon left a comment

HyukjinKwon commented Dec 18, 2019

itholic commented Dec 18, 2019

Fix value_counts() to work properly when dropna is True #1116

Fix value_counts() to work properly when dropna is True #1116

Conversation

itholic commented Dec 11, 2019

codecov-io commented Dec 11, 2019 • edited Loading

Codecov Report

ueshin Dec 12, 2019

Choose a reason for hiding this comment

itholic Dec 13, 2019 • edited Loading

Choose a reason for hiding this comment

ueshin Dec 12, 2019

Choose a reason for hiding this comment

itholic Dec 13, 2019

Choose a reason for hiding this comment

softagram-bot commented Dec 13, 2019

Softagram Impact Report for pull/1116 (head commit: c04be15)

⚠️ Copy paste found

⭐ Change Overview

💡 Insights

📄 Full report

ueshin Dec 17, 2019

Choose a reason for hiding this comment

itholic Dec 18, 2019 • edited Loading

Choose a reason for hiding this comment

ueshin Dec 17, 2019

Choose a reason for hiding this comment

itholic Dec 18, 2019

Choose a reason for hiding this comment

HyukjinKwon left a comment

Choose a reason for hiding this comment

HyukjinKwon commented Dec 18, 2019

itholic commented Dec 18, 2019

codecov-io commented Dec 11, 2019 •

edited

Loading

itholic Dec 13, 2019 •

edited

Loading

Softagram Impact Report for pull/1116 (head commit: `c04be15`)

itholic Dec 18, 2019 •

edited

Loading