Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make ordered a type parameter #184

Closed
ablaom opened this issue Feb 22, 2019 · 9 comments
Closed

Make ordered a type parameter #184

ablaom opened this issue Feb 22, 2019 · 9 comments

Comments

@ablaom
Copy link

ablaom commented Feb 22, 2019

Perhaps it is too late, but one can make the case that ordered should be a type-parameter of CategoricalString and CategoricalValue. In many contexts, one should like to be able to distinguish the "scientific type" of objects ("OrderedFactor" vs "UnorderdFactor") from the Julia types of the objects one is using to represent them. At the moment this distinction is only possible at the level of the objects themselves.

In MLJ we define formal scientific types for the purposes of matching models to "tasks". One would like to define a (partial-order preserving) function scitype from Julia types to scientific types (to express our conventions about how the scientific types should be represented) but this does not work out for the above reason. (It also does not work out because the number of classes is also not a type parameter of cat values/strings. From our point-of-view, the levels of a pool ought to be immutable, but I imagine there are other use cases that require mutability?)

The practical downfall for us is that we can determined scientific types of the columns of a Tables.jl table from schema(table).eltypes. Instead, we have to dig inside of the table to get actual elements to test.

@nalimilan
Copy link
Member

Actually we originally had NominalVector/NominalValue and OrdinalVector/OrdinalValue, which were merged into a single type because it wasn't very convenient (see #15 and d32441a). In particular if you created a NominalVector using e.g. CSV.read, you could not mark it later as being ordered without creating a new array (and marking it as ordinal from the beginning would be dangerous since the order of levels will likely be incorrect until you set it manually). Also from a technical perspective there's no advantage in compiling all functions twice as the code paths are identical.

I guess we could make it a type parameter if that's really useful, but we would have to drop the ordered! function...

(It also does not work out because the number of classes is also not a type parameter of cat values/strings. From our point-of-view, the levels of a pool ought to be immutable, but I imagine there are other use cases that require mutability?)

I think this illustrates that it's not really possible to know all properties of a variable without having access to its data. You can read JuliaStats/DataArrays.jl#73 for context about the design of CategoricalArrays. One can always use an enum instead if the pool is known statically, but that implies recompiling functions for each pool, which isn't very practical.

The practical downfall for us is that we can determined scientific types of the columns of a Tables.jl table from schema(table).eltypes. Instead, we have to dig inside of the table to get actual elements to test.

Is that really a problem though? StatsModels does that in JuliaStats/StatsModels.jl#71, that seems to be OK.

@kleinschmidt
Copy link
Contributor

Just to add to what @nalimilan has already said, in working on StatsModels we found that we can't even rely on the metadata in the categorical pool, because users often want to fit models based on a subset of their data, which means that the pool will contain too many levels and we'll generate model matrices that are rank-deficient. One of the goals of Terms2.0 is to clearly separate the metadata/type information necessary to create stand-alone representations of data transformations (e.g., number and values of unique levels for categorical data) from the data source itself. I think asking the data store to contain all that information for you isn't going to work: it's brittle and it's always possible that new demands about what metadata is necessary will crop up. Better to explicitly recognize that there will always be a data cleaning/transformation/schema extraction step. Just my $0.02!

@ablaom
Copy link
Author

ablaom commented Feb 24, 2019

While I stand by my objections, I can see that a change at this point is probably unrealistic. The annoyance for my use case is probably going to be small anyhow.

Thanks for the comments!

@ablaom ablaom closed this as completed Feb 24, 2019
@nalimilan
Copy link
Member

OK. Let us know about any problems though, that's always interesting.

@Nosferican
Copy link
Contributor

I believe the difference between nominal and ordinal is persistent regardless of changing the pool, subset, and whatnot. In this case, I do not see any argument against having it part of the metadata. It is useful, cheap, and persistent.

@nalimilan
Copy link
Member

I highlighted one argument above: you would need to create a new array if you want to make it ordered, e.g. after reading a CSV file.

@Nosferican
Copy link
Contributor

I guess from a metadata perspective it is already embedded at, obj.pool.ordered, no?

@jtrakk
Copy link

jtrakk commented Oct 10, 2020

you would need to create a new array if you want to make it ordered

Would it be possible to make the OrdinalArray or CyclicArray object simply wrap the CategoricalArray, so it could be an O(1) operation instead of O(n)?

@nalimilan
Copy link
Member

Yes that would be very easy. It's just a bit inconvenient to have to replace an array instead of just changing its properties.

There's also the deeper problem that when concatenating two ordered arrays, you don't know without know their levels whether their levels are equal or at least have compatible orders. So if you make orderedness part of the type, you get a type instability: you may have to return either an ordered or an unordered array -- unless you also put levels in the type, which is a no-go given the amount of recompilation it would trigger.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants