Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix mod & rmod for matching with pandas. #1399

Merged
merged 8 commits into from
Apr 8, 2020
Merged

Conversation

itholic
Copy link
Contributor

@itholic itholic commented Apr 6, 2020

Resolves #1398

modulo calculation for negative numbers in Koalas is different from pandas', so fixed it.

For example.

When calculate the 10 % (-3).

1. the Koalas'(and the PySpark') way: (only wanna get the result as a positive number)

  • just switch the both numbers to positive and calculate.
10 % 3  # the result will be 1

2. the pandas' way:(the result can be a negative number)

  • first, switch the both numbers to positive and calculate just same as Koalas.
10 % 3  # the result will be 1
  • and then, plus the original negative number to the result
1 + (-3)  # the final result will be -2

You can refer the https://stackoverflow.com/questions/1082917/mod-of-negative-number-is-melting-my-brain for more detail about the difference of such a modulo.

@itholic
Copy link
Contributor Author

itholic commented Apr 6, 2020

I have no idea how can I handling the 'negative zero' like the below.

>>> pser
0    100.0
1      NaN
2   -300.0
3      NaN
4    500.0
5   -700.0
Name: Koalas, dtype: float64
>>> kser
0    100.0
1      NaN
2   -300.0
3      NaN
4    500.0
5   -700.0
Name: Koalas, dtype: float64

>>> pser.mod(150)
0    100.0
1      NaN
2      0.0
3      NaN
4     50.0
5     50.0
Name: Koalas, dtype: float64
>>> kser.mod(150)
0    100.0
1      NaN
2     -0.0  # << Here is the matter. how can we handle this negative zero?
3      NaN
4     50.0
5     50.0
Name: Koalas, dtype: float64

I tried to check the negative zero in the when condition and manage it, but It seems that PySpark doesn't distinguish 0.0 and -0.0.

Any idea?

@codecov-io
Copy link

codecov-io commented Apr 6, 2020

Codecov Report

Merging #1399 into master will increase coverage by 0.00%.
The diff coverage is 100.00%.

Impacted file tree graph

@@           Coverage Diff           @@
##           master    #1399   +/-   ##
=======================================
  Coverage   95.24%   95.24%           
=======================================
  Files          34       34           
  Lines        7826     7830    +4     
=======================================
+ Hits         7454     7458    +4     
  Misses        372      372           
Impacted Files Coverage Δ
databricks/koalas/base.py 97.55% <100.00%> (+0.04%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2579312...da68b8b. Read the comment docs.

@HyukjinKwon
Copy link
Member

There have been ongoing issues about 0.0 vs -0.0 in Spark itself. I think you can avoid to address it here.

@itholic
Copy link
Contributor Author

itholic commented Apr 8, 2020

There have been ongoing issues about 0.0 vs -0.0 in Spark itself. I think you can avoid to address it here.

I made it works properly for every cases by fixing the existing calculation properly.

@itholic itholic changed the title Fix mod for matching with pandas. Fix mod & rmod for matching with pandas. Apr 8, 2020
@itholic itholic changed the title Fix mod & rmod for matching with pandas. [WIP] Fix mod & rmod for matching with pandas. Apr 8, 2020
@itholic itholic changed the title [WIP] Fix mod & rmod for matching with pandas. Fix mod & rmod for matching with pandas. Apr 8, 2020
Copy link
Collaborator

@ueshin ueshin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@ueshin
Copy link
Collaborator

ueshin commented Apr 8, 2020

Thanks. merging.
Since I have a concern, I'll submit a follow-up PR soon.

@ueshin ueshin merged commit 2554970 into databricks:master Apr 8, 2020
ueshin added a commit that referenced this pull request Apr 9, 2020
This is a follow-up of #1399.

When performing mod/rmod, if the operands are series from different dataframes, we needed three joins.

```py
>>> kser = ks.Series([100, None, -300, None, 500, -700], name="Koalas")
>>> (kser % ks.Series([150] * 6)).to_frame().explain()
== Physical Plan ==
*(9) Project [CASE WHEN isnotnull(__index_level_0__#317L) THEN __index_level_0__#317L ELSE __index_level_0__#228L END AS __index_level_0__#378L, (Koalas#364 % cast(0#229L as double)) AS Koalas#425]
+- SortMergeJoin [__index_level_0__#317L], [__index_level_0__#228L], FullOuter
   :- *(7) Sort [__index_level_0__#317L ASC NULLS FIRST], false, 0
   :  +- Exchange hashpartitioning(__index_level_0__#317L, 200)
   :     +- *(6) Project [CASE WHEN isnotnull(__index_level_0__#254L) THEN __index_level_0__#254L ELSE __index_level_0__#228L END AS __index_level_0__#317L, (Koalas#303 + cast(0#229L as double)) AS Koalas#364]
   :        +- SortMergeJoin [__index_level_0__#254L], [__index_level_0__#228L], FullOuter
   :           :- *(4) Sort [__index_level_0__#254L ASC NULLS FIRST], false, 0
   :           :  +- Exchange hashpartitioning(__index_level_0__#254L, 200)
   :           :     +- *(3) Project [CASE WHEN isnotnull(__index_level_0__#0L) THEN __index_level_0__#0L ELSE __index_level_0__#228L END AS __index_level_0__#254L, (Koalas#1 % cast(0#229L as double)) AS Koalas#303]
   :           :        +- SortMergeJoin [__index_level_0__#0L], [__index_level_0__#228L], FullOuter
   :           :           :- *(1) Sort [__index_level_0__#0L ASC NULLS FIRST], false, 0
   :           :           :  +- Exchange hashpartitioning(__index_level_0__#0L, 200)
   :           :           :     +- Scan ExistingRDD[__index_level_0__#0L,Koalas#1]
   :           :           +- *(2) Sort [__index_level_0__#228L ASC NULLS FIRST], false, 0
   :           :              +- Exchange hashpartitioning(__index_level_0__#228L, 200)
   :           :                 +- Scan ExistingRDD[__index_level_0__#228L,0#229L]
   :           +- *(5) Sort [__index_level_0__#228L ASC NULLS FIRST], false, 0
   :              +- ReusedExchange [__index_level_0__#228L, 0#229L], Exchange hashpartitioning(__index_level_0__#228L, 200)
   +- *(8) Sort [__index_level_0__#228L ASC NULLS FIRST], false, 0
      +- ReusedExchange [__index_level_0__#228L, 0#229L], Exchange hashpartitioning(__index_level_0__#228L, 200)
```

We can reduce the number to only one.

```py
>>> (kser % ks.Series([150] * 6)).to_frame().explain()
== Physical Plan ==
*(3) Project [CASE WHEN isnotnull(__index_level_0__#0L) THEN __index_level_0__#0L ELSE __index_level_0__#98L END AS __index_level_0__#118L, (((Koalas#1 % cast(0#99L as double)) + cast(0#99L as double)) % cast(0#99L as double)) AS Koalas#165]
+- SortMergeJoin [__index_level_0__#0L], [__index_level_0__#98L], FullOuter
   :- *(1) Sort [__index_level_0__#0L ASC NULLS FIRST], false, 0
   :  +- Exchange hashpartitioning(__index_level_0__#0L, 200)
   :     +- Scan ExistingRDD[__index_level_0__#0L,Koalas#1]
   +- *(2) Sort [__index_level_0__#98L ASC NULLS FIRST], false, 0
      +- Exchange hashpartitioning(__index_level_0__#98L, 200)
         +- Scan ExistingRDD[__index_level_0__#98L,0#99L]
```
@itholic itholic deleted the fix_mod branch April 9, 2020 00:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Different behaviour for mod between pandas and Koalas
4 participants