-
Notifications
You must be signed in to change notification settings - Fork 194
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve handling of NaN in ranks and rank correlations #659
Conversation
Throw an error when NaN is encountered by rank functions, and make `corspearman` return `NaN`, instead of silently sorting them at the end. This is consistent with what `corkendall` and `cor` do.
I'm not sure we need to throw an exception in the ranking code as a default option:
Suggestions:
|
In the But I'm pretty sure that my recent rewrite of Perhaps making a breaking change is a price worth paying for the consistency? Or perhaps I'm wrong to think I preserved the behaviour of |
I'd tend to agree with @alyst about the ranking function. I think it is a bit excessive to throw on It would seem more natural to me to throw on the correlation functions with |
OK. Actually this PR was partly prompted by https://github.com/JuliaLang/Statistics.jl/pull/72, where I really don't like when So I think that we shouldn't accept them silently by default. To avoid breaking code we could just print a warning for now, with an argument to silence it. Do you think that would be acceptable? That sounds consistent with what we do elsewhere. Regarding the implementation, indeed it would be better to support @PGS62 Yes, changing |
I have checked what R does:
So the conclusions are:
For us I would say that the lesson is:
Then the question is what should be the default. I would error on default unless it degrades performance significantly. The reason is:
|
@bkamins Thanks for checking out R behaviour! I still would suggest not throwing from ranking functions by default, unless there's a real NaN-related bug, which could only be properly addressed from the ranking code, and not in the calling context. It's just my (very subjective) observation that, while Julia equivalents are generally stricter than tidyverse flavour of R, it doesn't necessarily improve the code writing efficiency for me (but this is going OT, I just provide it here to explain my motivation). So I'm a bit concerned with adding more throw-by-default cases. |
Yes - this is a typical tension between "interactive work friendly" vs "production code friendly" behavior. Julia in general follows the "production code friendly" approach, and R the opposite. However, maybe this is not a right approach in this case. I would not die for throwing an error by default (especially that this is breaking). If we have a kwarg users will be prompted about the potential problem with |
I've reverted controversial changes to concentrate on |
OK - so I understand we produce |
I think the handling of
In the case of corkendall, the fix would be:
but I have not looked at what would need to change in |
Always using I guess we could use tricks to make this efficient, e.g. by checking whether other entries in the same column of the correlation matrix are |
Fair point that the diagonal entries are not often useful. But an alternative solution is, once we detect a Something like this:
Having inconsistency between |
The philosophical perspective is that in See e.g. the difference in:
|
That's all correct, in On the other hand, if we (Julia) get
Confession: I was hoping that R would "agree with me" but it didn't! |
Throw an error when NaN is encountered by rank functions, and make
corspearman
returnNaN
, instead of silently sorting them at the end. This is consistent with whatcorkendall
andcor
do.Fixes #657.