-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Upgradeable Statistics stdlib #46501
Conversation
This copies the code for these three methods from the Statistics stdlib module. Only changes are: - add compatibility note to docstrings - remove references to `Statistics` - change tests to use the `mean` keyword argument instead of `stdm`/`varm` - do not test sparse matrices as they are not available in Base
base/statistics.jl
Outdated
2.2285192400943226 | ||
``` | ||
""" | ||
mean(f, A::AbstractArray; dims=:) = _mean(f, A, dims) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you want to be able to dispatch on a keyword, how strange would it be to build in a hook in for licensed piracy... something like:
mean(f, A::AbstractArray; dims=:) = _mean(f, A, dims) | |
mean(f, A::AbstractArray; dims=:, weights=nothing) = _mean(f, A, dims, weights) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah that's the kind of pattern I had in mind. But let's leave that question for later as moving Statistics out would already be a big achievement.
In my opinion, adding 438 lines of statistics code to Base kind of defeats the point of excising the Statistics stdlib. Because we'll still have 438 lines of code that we can't make breaking changes to, etc. For the most common use cases, e.g. There's even less friction for adding packages now, since doing |
Also, what makes |
I mostly agree. We could also just remove Statistics for now and see how strongly people complain. As you say, with Pkg improvements, installing an external package now only requires typing That said the previous attempt ended up being reverted (see #27374) so we kind of know that people do complain. The situation may be a bit different now given that even sparse matrix support has been moved out of the stdlib. |
@nalimilan - thank you for your effort. Could you please summarize:
Thank you! |
People starting to use |
This pattern as a replacement of |
mean, var and std are considered common functions, and not special for statistics. People have a very strong objection to needing to install a package for mean, especially. I personally tend to agree with that view myself. Sparse matrix is still significantly far less common than mean. Also the sparse package is not out of stdlib yet, and has been brought back into the system image until we have conditional dependencies. |
I'll concede that But this PR adds a lot of other methods as well. It's harder for me to buy the argument that all of those methods would be considered common/non-statistical to most people. I'm a little concerned about the slippery slope here. If we add those methods, then why not add weights? It's not clear to me why some functionality gets to be included, and other functionality is excluded. |
On a separate note, I'm not sure why we need to add |
While waiting for a comment by @nalimilan about the long-term plan, I think that if we agree |
I think I can probably be talked into adding just But with a long-term plan that makes it explicit that we won't be adding any additional methods (including positional args or keyword args) to |
Also, I'd be happier with |
We can bring the weights in as well. I would not be opposed to that, but I suspect that will need to bring in a whole lot of other machinery around weights that is better managed outside. Numpy actually goes much further in what it considers common functions and has a whole statistics module: https://numpy.org/doc/stable/reference/routines.statistics.html. Now numpy is itself a python package, so it is a little bit apples and oranges. |
BTW, the one thing that I am not sure of is, that with the package manager improvements in the last 4 years, will people have the same reaction to mean not being in Base as they did in 2018. I suspect there will be lesser opposition, but there will still be quite a lot of discontent. What if Base had, say, |
What is |
@bkamins As I see it, the goal is to have Statistics as a standalone package to which we will be able to move most or all of the StatsBase features (details to be discussed, we probably don't want to import deprecated API), and to deprecate StatsBase in the long term. For Julia 1.9 probably the only change will be that Statistics is no longer an stdlib module but a separate package, as importing things from StatsBase will take some work. Nothing should break for users of StatsBase and Statistics, except for having to install Statistics and add |
I believe JuliaCon 2023 could be the right timeframe for the next LTS. Hopefully we have conditional dependencies in place by then, and can even move out SparseArrays and perhaps a few other things as well for the new LTS. |
@DilumAluthge @ViralBShah The reason why I think weighted stats should not be included is that it would pull in a whole machinery of weight vectors types which really cannot live in Base as they are too specialized. That's not a problem for A better comparison than Numpy is probably Python's statistics standard library:
It is very similar to our current Statistics stdlib, but it also provides mode, harmonic and geometric mean, and basic linear regression. So if we took Python as a reference we would favor the status quo. :-) But the situation isn't ideal in Python as NumPy provides separate definitions, so you have multiple functions to do the same thing. Conversely, Rust, Go and Swift don't provide any statistical functions in their standard libraries, not even |
@ViralBShah - is changing a module in which some name is defined ( |
I use basic statistical functions in my work in signal processing all the time, but I am not a professional statistician. I am not concerned about having to import a package to use Recent discussions over in Discourse (see for example here) make it clear that the desire of the professional statisticians in the community is to adapt Statistics.jl to meet their needs. Among other things, there is a clear reluctance to continue to support functions that take anything other than tables. Such changes would make it difficult and inconvenient for me (and I suspect, many others) to use that package in my own projects. My concrete proposal is to excise statistics from stdlib if you must, but give a different name to the new package (maybe BasicStats or ClassicStats [or NonSeriousStats 😉 ]). Then the serious statisticians would be free to drive Statistics.jl forward, while the rest of us would continue to use the more basic functions. |
Perhaps this is more of a @KristofferC or @StefanKarpinski question. I believe that since these are exported names, the module name is less important. And of course, the Statistics.jl package can still provide compatibility. |
Speaking for the serious statisticians: we have no interest in developing the package in a direction where it stops being practical |
Right, we can probably revert this now. So the PR becomes really minimal! :-) I've pushed a commit to do that. |
@@ -57,6 +57,9 @@ Standard library changes | |||
|
|||
#### Dates | |||
|
|||
#### Statistics | |||
|
|||
* Statistics is now an upgradeable standard library.([#46501]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* Statistics is now an upgradeable standard library.([#46501]) | |
* Statistics is now an upgradeable standard library ([#46501]). |
So I think the only real thing left to do is step 4 and 5 in #50697 |
Ah and I see in #45540 we did remove it from the spdx file. |
Step 4: JuliaRegistries/General#89713
Right, though as I noted the situation was different at the time since the module was completely removed. Maybe we just forgot to revert that. |
Just to make sure. With all these updates we are still going to merge Statistics.jl and StatsBase.jl as the last step some time in the future? |
I hope so. One complication will be that we probably won't be able to add dependencies to Statistics, which could block porting some features. |
That is why I am asking. I think we should have some plan prepared for this. So that we do not end up with a situation where we cannot achieve what we wanted to do in the first place. In particular - where things that cannot go to Statistics.jl should be moved (as I assume leaving leftovers in StatsBase.jl is not a good idea). If no one has thought about it yet - please let me know and I can try to check. |
At any rate making Statistics upgradable is an improvement. But yeah it would be good to list at JuliaStats/Statistics.jl#87 features that rely on dependencies and see what we can do about them. Thanks for offering your help! |
If Statistics becomes an upgradeable bundled library (sounds great to me!) then can we leave Any complaints about discoverability can be solved nicely by adding an exception hint for the names of common stats functions or for all names that are exported from the environment/some core list of packages/whatever. |
mean
from it
Yes, that's what the PR does now. |
I will briefly be the voice of Alan here and say that he detests the removal of I think I hear about it every other meeting |
Step 5 (JuliaLang/Pkg.jl#3587) has just been completed, so it looks like we're good? |
This finishes the removal of the Statistics stdlib started at #45594, to make it a separate package. We will then be able to merge Statistics and StatsBase to avoid splitting basic stats functions across multiple modules.
Since there are concerns that users will complain about the need to install a package just to compute the mean and standard deviation, the PR also copies the code from
mean
,std
andvar
and exports these functions from Base. This is yet one more step in a saga started by #27152. I'm not too happy about this since the choice of functions that live in Base is kind of arbitrary. In particular, its not great that weighted methods defined in StatsBase (to be merged with Statistics) are documented in a different place from the unweighted ones, and we had plans to make weights a keyword argument, which makes dispatching on weights types defined in StatsBase tricky if functions are defined in Base. But well... at least the situation after the PR is less confusing than having both Statistics and StatsBase.Code is just copied from Statistics, the only changes are:
Statistics
mean
keyword argument instead ofstdm
/varm