Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make the API consistent with Statistics.jl/StatsBase.jl #9

Draft
wants to merge 13 commits into
base: main
Choose a base branch
from

Conversation

sethaxen
Copy link

@sethaxen sethaxen commented May 9, 2023

This PR makes a number of breaking changes to make the API of circ_foo more consistent with its corresponding function foo (if defined) in Statistics/StatsBase.

Main changes

Supporting StatsBase.AbstractWeights

Weighted scalar statistics are handled in StatsBase using StatsBase.AbstractWeights types, which can represent not only frequency weights but also other kinds of weights. Unlike the current API, weights in StatsBase are always vectors, so when they are used, the dims keyword is not supported, and instead an optional positional argument dim::Int is provided, specifying the single dimension that will be reduced to a singleton. i.e., the signatures are

foo(x::Any; dims=Colon(), kwargs...)  # reduce over specified dimensions of `x`, defaulting to flattening
foo(x::AbstractArray, w::AbstractWeights, dim::Int=1, kwargs...)  # weighted reduction over `dim`

This PR adopts the same signatures for circ_foo, which unfortunately does require some code duplication. Future refactors could reduce this code duplication.

Avoiding recomputing circ_mean and circ_r

When a function requires r or mean, it now accepts one or both of these as a keyword argument, allowing for some speed-ups. See how circ_stats does this for example. It also adds circ_mean_and_r, analogous to mean_and_var and mean_and_std in StatsBase.

Return a single statistic

Now circ_mean returns just the mean, while circ_std and circ_var accept a kind keyword to specify the kind of statistic to return. It would be nice if we could do something similar for circ_skewness or circ_kurtosis as well. circ_moment could be similarly simplified to return the complex moment, but as this isn't consistent with the other moment functions, this doesn't seem ideal.

What's left

  • implement API changes for scalar statistics
  • implement API changes for statistical tests
  • implement any additional API changes to circ_skewness, circ_kurtosis, and circ_moment
  • update tests to use AbstractWeights weight types.

Discussion

Before this PR, this package supported multidimensional arrays of weights. Is there a clear use case for this? JuliaStats/StatsBase.jl#776 discusses adding something similar to StatsBase, but if this happens, it wouldn't happen for some time.

cc @huangziwei, @Meteore

@sethaxen
Copy link
Author

sethaxen commented May 9, 2023

@huangziwei this is marked as a draft because it is not yet complete, but before I continue, it would be helpful to have a chat (synchronous or asynchronous on here) about the discussion points highlighted above.

@huangziwei
Copy link

@sethaxen cool! I will have a look later and get back to you here soon!

@babaq
Copy link
Collaborator

babaq commented May 11, 2023

@sethaxen This PR is a good start to polish and push further this package, however, it wouldn't necessarily be a breaking version.

Supporting AbstractWeights would better be just adding methods, without touching old methods. To avoid confusing usage, deprecation can be introduced in several versions to ease the transition.

The other thing i had planed is to improve the documentation and CI testing, right now, doc only explain the function signature, without examples, plots. The formula used only mentioned of their reference in comments. it would be good to add the precise math Latex in doc and full references. This can promote the packages among several other similar ones and provide learning materials for users too. It should also be straightforward to add GitHub flows with code coverage.

@sethaxen
Copy link
Author

@sethaxen This PR is a good start to polish and push further this package, however, it wouldn't necessarily be a breaking version.

Supporting AbstractWeights would better be just adding methods, without touching old methods. To avoid confusing usage, deprecation can be introduced in several versions to ease the transition.

I agree that methods taking AbstractWeights positional arguments could be added in a non-breaking way. However, changing the return types of circ_mean, circ_var, and circ_std to return a single statistic would be breaking. If necessary, I can separate the AbstractWeights methods into their own non-breaking PR.

The other thing i had planed is to improve the documentation and CI testing, right now, doc only explain the function signature, without examples, plots. The formula used only mentioned of their reference in comments. it would be good to add the precise math Latex in doc and full references. This can promote the packages among several other similar ones and provide learning materials for users too. It should also be straightforward to add GitHub flows with code coverage.

Yes this would be nice. The implementations themselves also need some updates to avoid unnecessary allocations, type-promotions, and type-instability, but these updates are orthogonal to this PR, which strictly focuses on API updates.

@huangziwei
Copy link

gosh, time flies so fast and it's been a year, but I still haven't got the time to go through this (and I still can't properly review it as I still don't do Julia at all...). I'd suggest @sethaxen you take over the Julia version (no need to involve me, for now) and we only communicate on the keeping API design level for consistency, if necessary?

@sethaxen
Copy link
Author

@huangziwei, sorry for the delay in replying. I opened this PR as part of my work at the @mlcolab and intended it as part of a larger effort we had discussed last year to improve consistency with both the Julia stats ecosystem and with other circstats packages (where applicable). However, priorities have since changed. I will soon be moving on from my position at the colab and no longer plan to continue the effort; thus it would not be ideal for me to make changes to the package if no-one else is able to maintain those changes.

If there's future interest in your group (or from someone else) in refreshing this package, then I think this PR is still a good starting point for that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants