-
Notifications
You must be signed in to change notification settings - Fork 360
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix mod & rmod for matching with pandas. #1399
Conversation
I have no idea how can I handling the 'negative zero' like the below. >>> pser
0 100.0
1 NaN
2 -300.0
3 NaN
4 500.0
5 -700.0
Name: Koalas, dtype: float64
>>> kser
0 100.0
1 NaN
2 -300.0
3 NaN
4 500.0
5 -700.0
Name: Koalas, dtype: float64
>>> pser.mod(150)
0 100.0
1 NaN
2 0.0
3 NaN
4 50.0
5 50.0
Name: Koalas, dtype: float64
>>> kser.mod(150)
0 100.0
1 NaN
2 -0.0 # << Here is the matter. how can we handle this negative zero?
3 NaN
4 50.0
5 50.0
Name: Koalas, dtype: float64 I tried to check the negative zero in the Any idea? |
Codecov Report
@@ Coverage Diff @@
## master #1399 +/- ##
=======================================
Coverage 95.24% 95.24%
=======================================
Files 34 34
Lines 7826 7830 +4
=======================================
+ Hits 7454 7458 +4
Misses 372 372
Continue to review full report at Codecov.
|
There have been ongoing issues about |
I made it works properly for every cases by fixing the existing calculation properly. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
Thanks. merging. |
This is a follow-up of #1399. When performing mod/rmod, if the operands are series from different dataframes, we needed three joins. ```py >>> kser = ks.Series([100, None, -300, None, 500, -700], name="Koalas") >>> (kser % ks.Series([150] * 6)).to_frame().explain() == Physical Plan == *(9) Project [CASE WHEN isnotnull(__index_level_0__#317L) THEN __index_level_0__#317L ELSE __index_level_0__#228L END AS __index_level_0__#378L, (Koalas#364 % cast(0#229L as double)) AS Koalas#425] +- SortMergeJoin [__index_level_0__#317L], [__index_level_0__#228L], FullOuter :- *(7) Sort [__index_level_0__#317L ASC NULLS FIRST], false, 0 : +- Exchange hashpartitioning(__index_level_0__#317L, 200) : +- *(6) Project [CASE WHEN isnotnull(__index_level_0__#254L) THEN __index_level_0__#254L ELSE __index_level_0__#228L END AS __index_level_0__#317L, (Koalas#303 + cast(0#229L as double)) AS Koalas#364] : +- SortMergeJoin [__index_level_0__#254L], [__index_level_0__#228L], FullOuter : :- *(4) Sort [__index_level_0__#254L ASC NULLS FIRST], false, 0 : : +- Exchange hashpartitioning(__index_level_0__#254L, 200) : : +- *(3) Project [CASE WHEN isnotnull(__index_level_0__#0L) THEN __index_level_0__#0L ELSE __index_level_0__#228L END AS __index_level_0__#254L, (Koalas#1 % cast(0#229L as double)) AS Koalas#303] : : +- SortMergeJoin [__index_level_0__#0L], [__index_level_0__#228L], FullOuter : : :- *(1) Sort [__index_level_0__#0L ASC NULLS FIRST], false, 0 : : : +- Exchange hashpartitioning(__index_level_0__#0L, 200) : : : +- Scan ExistingRDD[__index_level_0__#0L,Koalas#1] : : +- *(2) Sort [__index_level_0__#228L ASC NULLS FIRST], false, 0 : : +- Exchange hashpartitioning(__index_level_0__#228L, 200) : : +- Scan ExistingRDD[__index_level_0__#228L,0#229L] : +- *(5) Sort [__index_level_0__#228L ASC NULLS FIRST], false, 0 : +- ReusedExchange [__index_level_0__#228L, 0#229L], Exchange hashpartitioning(__index_level_0__#228L, 200) +- *(8) Sort [__index_level_0__#228L ASC NULLS FIRST], false, 0 +- ReusedExchange [__index_level_0__#228L, 0#229L], Exchange hashpartitioning(__index_level_0__#228L, 200) ``` We can reduce the number to only one. ```py >>> (kser % ks.Series([150] * 6)).to_frame().explain() == Physical Plan == *(3) Project [CASE WHEN isnotnull(__index_level_0__#0L) THEN __index_level_0__#0L ELSE __index_level_0__#98L END AS __index_level_0__#118L, (((Koalas#1 % cast(0#99L as double)) + cast(0#99L as double)) % cast(0#99L as double)) AS Koalas#165] +- SortMergeJoin [__index_level_0__#0L], [__index_level_0__#98L], FullOuter :- *(1) Sort [__index_level_0__#0L ASC NULLS FIRST], false, 0 : +- Exchange hashpartitioning(__index_level_0__#0L, 200) : +- Scan ExistingRDD[__index_level_0__#0L,Koalas#1] +- *(2) Sort [__index_level_0__#98L ASC NULLS FIRST], false, 0 +- Exchange hashpartitioning(__index_level_0__#98L, 200) +- Scan ExistingRDD[__index_level_0__#98L,0#99L] ```
Resolves #1398
modulo calculation for negative numbers in Koalas is different from pandas', so fixed it.
For example.
When calculate the
10 % (-3)
.1. the Koalas'(and the PySpark') way: (only wanna get the result as a positive number)
2. the pandas' way:(the result can be a negative number)
You can refer the https://stackoverflow.com/questions/1082917/mod-of-negative-number-is-melting-my-brain for more detail about the difference of such a modulo.