Fix mod & rmod for matching with pandas. #1399

itholic · 2020-04-06T15:03:29Z

Resolves #1398

modulo calculation for negative numbers in Koalas is different from pandas', so fixed it.

For example.

When calculate the `10 % (-3)`.

1. the Koalas'(and the PySpark') way: (only wanna get the result as a positive number)

just switch the both numbers to positive and calculate.

10 % 3  # the result will be 1

2. the pandas' way:(the result can be a negative number)

first, switch the both numbers to positive and calculate just same as Koalas.

10 % 3  # the result will be 1

and then, plus the original negative number to the result

1 + (-3)  # the final result will be -2

You can refer the https://stackoverflow.com/questions/1082917/mod-of-negative-number-is-melting-my-brain for more detail about the difference of such a modulo.

itholic · 2020-04-06T15:11:39Z

I have no idea how can I handling the 'negative zero' like the below.

>>> pser
0    100.0
1      NaN
2   -300.0
3      NaN
4    500.0
5   -700.0
Name: Koalas, dtype: float64
>>> kser
0    100.0
1      NaN
2   -300.0
3      NaN
4    500.0
5   -700.0
Name: Koalas, dtype: float64

>>> pser.mod(150)
0    100.0
1      NaN
2      0.0
3      NaN
4     50.0
5     50.0
Name: Koalas, dtype: float64
>>> kser.mod(150)
0    100.0
1      NaN
2     -0.0  # << Here is the matter. how can we handle this negative zero?
3      NaN
4     50.0
5     50.0
Name: Koalas, dtype: float64

I tried to check the negative zero in the when condition and manage it, but It seems that PySpark doesn't distinguish 0.0 and -0.0.

Any idea?

codecov-io · 2020-04-06T15:29:15Z

Codecov Report

Merging #1399 into master will increase coverage by 0.00%.
The diff coverage is 100.00%.

@@           Coverage Diff           @@
##           master    #1399   +/-   ##
=======================================
  Coverage   95.24%   95.24%           
=======================================
  Files          34       34           
  Lines        7826     7830    +4     
=======================================
+ Hits         7454     7458    +4     
  Misses        372      372

Impacted Files	Coverage Δ
databricks/koalas/base.py	`97.55% <100.00%> (+0.04%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2579312...da68b8b. Read the comment docs.

HyukjinKwon · 2020-04-07T02:57:42Z

There have been ongoing issues about 0.0 vs -0.0 in Spark itself. I think you can avoid to address it here.

databricks/koalas/base.py

itholic · 2020-04-08T05:19:15Z

There have been ongoing issues about 0.0 vs -0.0 in Spark itself. I think you can avoid to address it here.

I made it works properly for every cases by fixing the existing calculation properly.

ueshin

LGTM.

ueshin · 2020-04-08T21:30:12Z

Thanks. merging.
Since I have a concern, I'll submit a follow-up PR soon.

This is a follow-up of #1399. When performing mod/rmod, if the operands are series from different dataframes, we needed three joins. ```py >>> kser = ks.Series([100, None, -300, None, 500, -700], name="Koalas") >>> (kser % ks.Series([150] * 6)).to_frame().explain() == Physical Plan == *(9) Project [CASE WHEN isnotnull(__index_level_0__#317L) THEN __index_level_0__#317L ELSE __index_level_0__#228L END AS __index_level_0__#378L, (Koalas#364 % cast(0#229L as double)) AS Koalas#425] +- SortMergeJoin [__index_level_0__#317L], [__index_level_0__#228L], FullOuter :- *(7) Sort [__index_level_0__#317L ASC NULLS FIRST], false, 0 : +- Exchange hashpartitioning(__index_level_0__#317L, 200) : +- *(6) Project [CASE WHEN isnotnull(__index_level_0__#254L) THEN __index_level_0__#254L ELSE __index_level_0__#228L END AS __index_level_0__#317L, (Koalas#303 + cast(0#229L as double)) AS Koalas#364] : +- SortMergeJoin [__index_level_0__#254L], [__index_level_0__#228L], FullOuter : :- *(4) Sort [__index_level_0__#254L ASC NULLS FIRST], false, 0 : : +- Exchange hashpartitioning(__index_level_0__#254L, 200) : : +- *(3) Project [CASE WHEN isnotnull(__index_level_0__#0L) THEN __index_level_0__#0L ELSE __index_level_0__#228L END AS __index_level_0__#254L, (Koalas#1 % cast(0#229L as double)) AS Koalas#303] : : +- SortMergeJoin [__index_level_0__#0L], [__index_level_0__#228L], FullOuter : : :- *(1) Sort [__index_level_0__#0L ASC NULLS FIRST], false, 0 : : : +- Exchange hashpartitioning(__index_level_0__#0L, 200) : : : +- Scan ExistingRDD[__index_level_0__#0L,Koalas#1] : : +- *(2) Sort [__index_level_0__#228L ASC NULLS FIRST], false, 0 : : +- Exchange hashpartitioning(__index_level_0__#228L, 200) : : +- Scan ExistingRDD[__index_level_0__#228L,0#229L] : +- *(5) Sort [__index_level_0__#228L ASC NULLS FIRST], false, 0 : +- ReusedExchange [__index_level_0__#228L, 0#229L], Exchange hashpartitioning(__index_level_0__#228L, 200) +- *(8) Sort [__index_level_0__#228L ASC NULLS FIRST], false, 0 +- ReusedExchange [__index_level_0__#228L, 0#229L], Exchange hashpartitioning(__index_level_0__#228L, 200) ``` We can reduce the number to only one. ```py >>> (kser % ks.Series([150] * 6)).to_frame().explain() == Physical Plan == *(3) Project [CASE WHEN isnotnull(__index_level_0__#0L) THEN __index_level_0__#0L ELSE __index_level_0__#98L END AS __index_level_0__#118L, (((Koalas#1 % cast(0#99L as double)) + cast(0#99L as double)) % cast(0#99L as double)) AS Koalas#165] +- SortMergeJoin [__index_level_0__#0L], [__index_level_0__#98L], FullOuter :- *(1) Sort [__index_level_0__#0L ASC NULLS FIRST], false, 0 : +- Exchange hashpartitioning(__index_level_0__#0L, 200) : +- Scan ExistingRDD[__index_level_0__#0L,Koalas#1] +- *(2) Sort [__index_level_0__#98L ASC NULLS FIRST], false, 0 +- Exchange hashpartitioning(__index_level_0__#98L, 200) +- Scan ExistingRDD[__index_level_0__#98L,0#99L] ```

itholic added 2 commits April 6, 2020 23:46

Matching to pandas

ca7fa6c

Check if the result is zero or not

da68b8b

HyukjinKwon reviewed Apr 7, 2020

View reviewed changes

databricks/koalas/base.py Outdated Show resolved Hide resolved

HyukjinKwon reviewed Apr 7, 2020

View reviewed changes

databricks/koalas/base.py Outdated Show resolved Hide resolved

itholic mentioned this pull request Apr 7, 2020

Add divmod & rdivmod for Series #1397

Merged

Using proper logic

7599002

rebase to origin/master

72cb81d

itholic changed the title ~~Fix mod for matching with pandas.~~ Fix mod & rmod for matching with pandas. Apr 8, 2020

itholic changed the title ~~Fix mod & rmod for matching with pandas.~~ [WIP] Fix mod & rmod for matching with pandas. Apr 8, 2020

itholic added 4 commits April 8, 2020 17:05

fix

9b534b0

fix rmod

5c5427e

Relocate to where as it is

ecc97c0

fix modulo logic

27fd523

itholic changed the title ~~[WIP] Fix mod & rmod for matching with pandas.~~ Fix mod & rmod for matching with pandas. Apr 8, 2020

ueshin approved these changes Apr 8, 2020

View reviewed changes

ueshin merged commit 2554970 into databricks:master Apr 8, 2020

ueshin mentioned this pull request Apr 8, 2020

Reduce the number of joins for mod/rmod. #1409

Merged

itholic deleted the fix_mod branch April 9, 2020 00:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix mod & rmod for matching with pandas. #1399

Fix mod & rmod for matching with pandas. #1399

itholic commented Apr 6, 2020 •

edited

Loading

itholic commented Apr 6, 2020 •

edited

Loading

codecov-io commented Apr 6, 2020 •

edited

Loading

HyukjinKwon commented Apr 7, 2020

itholic commented Apr 8, 2020

ueshin left a comment

ueshin commented Apr 8, 2020

Fix mod & rmod for matching with pandas. #1399

Fix mod & rmod for matching with pandas. #1399

Conversation

itholic commented Apr 6, 2020 • edited Loading

For example.

When calculate the 10 % (-3).

1. the Koalas'(and the PySpark') way: (only wanna get the result as a positive number)

2. the pandas' way:(the result can be a negative number)

itholic commented Apr 6, 2020 • edited Loading

codecov-io commented Apr 6, 2020 • edited Loading

Codecov Report

HyukjinKwon commented Apr 7, 2020

itholic commented Apr 8, 2020

ueshin left a comment

Choose a reason for hiding this comment

ueshin commented Apr 8, 2020

itholic commented Apr 6, 2020 •

edited

Loading

When calculate the `10 % (-3)`.

itholic commented Apr 6, 2020 •

edited

Loading

codecov-io commented Apr 6, 2020 •

edited

Loading