-
Notifications
You must be signed in to change notification settings - Fork 34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make ordered
a type parameter
#184
Comments
Actually we originally had I guess we could make it a type parameter if that's really useful, but we would have to drop the
I think this illustrates that it's not really possible to know all properties of a variable without having access to its data. You can read JuliaStats/DataArrays.jl#73 for context about the design of CategoricalArrays. One can always use an enum instead if the pool is known statically, but that implies recompiling functions for each pool, which isn't very practical.
Is that really a problem though? StatsModels does that in JuliaStats/StatsModels.jl#71, that seems to be OK. |
Just to add to what @nalimilan has already said, in working on StatsModels we found that we can't even rely on the metadata in the categorical pool, because users often want to fit models based on a subset of their data, which means that the pool will contain too many levels and we'll generate model matrices that are rank-deficient. One of the goals of Terms2.0 is to clearly separate the metadata/type information necessary to create stand-alone representations of data transformations (e.g., number and values of unique levels for categorical data) from the data source itself. I think asking the data store to contain all that information for you isn't going to work: it's brittle and it's always possible that new demands about what metadata is necessary will crop up. Better to explicitly recognize that there will always be a data cleaning/transformation/schema extraction step. Just my $0.02! |
While I stand by my objections, I can see that a change at this point is probably unrealistic. The annoyance for my use case is probably going to be small anyhow. Thanks for the comments! |
OK. Let us know about any problems though, that's always interesting. |
I believe the difference between nominal and ordinal is persistent regardless of changing the pool, subset, and whatnot. In this case, I do not see any argument against having it part of the metadata. It is useful, cheap, and persistent. |
I highlighted one argument above: you would need to create a new array if you want to make it ordered, e.g. after reading a CSV file. |
I guess from a metadata perspective it is already embedded at, |
Would it be possible to make the |
Yes that would be very easy. It's just a bit inconvenient to have to replace an array instead of just changing its properties. There's also the deeper problem that when concatenating two ordered arrays, you don't know without know their levels whether their levels are equal or at least have compatible orders. So if you make orderedness part of the type, you get a type instability: you may have to return either an ordered or an unordered array -- unless you also put levels in the type, which is a no-go given the amount of recompilation it would trigger. |
Perhaps it is too late, but one can make the case that
ordered
should be a type-parameter ofCategoricalString
andCategoricalValue
. In many contexts, one should like to be able to distinguish the "scientific type" of objects ("OrderedFactor" vs "UnorderdFactor") from the Julia types of the objects one is using to represent them. At the moment this distinction is only possible at the level of the objects themselves.In MLJ we define formal scientific types for the purposes of matching models to "tasks". One would like to define a (partial-order preserving) function
scitype
from Julia types to scientific types (to express our conventions about how the scientific types should be represented) but this does not work out for the above reason. (It also does not work out because the number of classes is also not a type parameter of cat values/strings. From our point-of-view, the levels of a pool ought to be immutable, but I imagine there are other use cases that require mutability?)The practical downfall for us is that we can determined scientific types of the columns of a Tables.jl table from
schema(table).eltypes
. Instead, we have to dig inside of the table to get actual elements to test.The text was updated successfully, but these errors were encountered: