Make `ordered` a type parameter #184

ablaom · 2019-02-22T01:49:17Z

Perhaps it is too late, but one can make the case that ordered should be a type-parameter of CategoricalString and CategoricalValue. In many contexts, one should like to be able to distinguish the "scientific type" of objects ("OrderedFactor" vs "UnorderdFactor") from the Julia types of the objects one is using to represent them. At the moment this distinction is only possible at the level of the objects themselves.

In MLJ we define formal scientific types for the purposes of matching models to "tasks". One would like to define a (partial-order preserving) function scitype from Julia types to scientific types (to express our conventions about how the scientific types should be represented) but this does not work out for the above reason. (It also does not work out because the number of classes is also not a type parameter of cat values/strings. From our point-of-view, the levels of a pool ought to be immutable, but I imagine there are other use cases that require mutability?)

The practical downfall for us is that we can determined scientific types of the columns of a Tables.jl table from schema(table).eltypes. Instead, we have to dig inside of the table to get actual elements to test.

The text was updated successfully, but these errors were encountered:

nalimilan · 2019-02-22T09:06:21Z

Actually we originally had NominalVector/NominalValue and OrdinalVector/OrdinalValue, which were merged into a single type because it wasn't very convenient (see #15 and d32441a). In particular if you created a NominalVector using e.g. CSV.read, you could not mark it later as being ordered without creating a new array (and marking it as ordinal from the beginning would be dangerous since the order of levels will likely be incorrect until you set it manually). Also from a technical perspective there's no advantage in compiling all functions twice as the code paths are identical.

I guess we could make it a type parameter if that's really useful, but we would have to drop the ordered! function...

(It also does not work out because the number of classes is also not a type parameter of cat values/strings. From our point-of-view, the levels of a pool ought to be immutable, but I imagine there are other use cases that require mutability?)

I think this illustrates that it's not really possible to know all properties of a variable without having access to its data. You can read JuliaStats/DataArrays.jl#73 for context about the design of CategoricalArrays. One can always use an enum instead if the pool is known statically, but that implies recompiling functions for each pool, which isn't very practical.

The practical downfall for us is that we can determined scientific types of the columns of a Tables.jl table from schema(table).eltypes. Instead, we have to dig inside of the table to get actual elements to test.

Is that really a problem though? StatsModels does that in JuliaStats/StatsModels.jl#71, that seems to be OK.

kleinschmidt · 2019-02-22T16:43:55Z

Just to add to what @nalimilan has already said, in working on StatsModels we found that we can't even rely on the metadata in the categorical pool, because users often want to fit models based on a subset of their data, which means that the pool will contain too many levels and we'll generate model matrices that are rank-deficient. One of the goals of Terms2.0 is to clearly separate the metadata/type information necessary to create stand-alone representations of data transformations (e.g., number and values of unique levels for categorical data) from the data source itself. I think asking the data store to contain all that information for you isn't going to work: it's brittle and it's always possible that new demands about what metadata is necessary will crop up. Better to explicitly recognize that there will always be a data cleaning/transformation/schema extraction step. Just my $0.02!

ablaom · 2019-02-24T21:12:54Z

While I stand by my objections, I can see that a change at this point is probably unrealistic. The annoyance for my use case is probably going to be small anyhow.

Thanks for the comments!

nalimilan · 2019-02-25T13:23:10Z

OK. Let us know about any problems though, that's always interesting.

Nosferican · 2019-02-26T19:42:57Z

I believe the difference between nominal and ordinal is persistent regardless of changing the pool, subset, and whatnot. In this case, I do not see any argument against having it part of the metadata. It is useful, cheap, and persistent.

nalimilan · 2019-02-27T16:47:59Z

I highlighted one argument above: you would need to create a new array if you want to make it ordered, e.g. after reading a CSV file.

Nosferican · 2019-02-27T18:52:18Z

I guess from a metadata perspective it is already embedded at, obj.pool.ordered, no?

jtrakk · 2020-10-10T21:11:29Z

you would need to create a new array if you want to make it ordered

Would it be possible to make the OrdinalArray or CyclicArray object simply wrap the CategoricalArray, so it could be an O(1) operation instead of O(n)?

nalimilan · 2020-10-12T21:01:42Z

Yes that would be very easy. It's just a bit inconvenient to have to replace an array instead of just changing its properties.

There's also the deeper problem that when concatenating two ordered arrays, you don't know without know their levels whether their levels are equal or at least have compatible orders. So if you make orderedness part of the type, you get a type instability: you may have to return either an ordered or an unordered array -- unless you also put levels in the type, which is a no-go given the amount of recompilation it would trigger.

ablaom closed this as completed Feb 24, 2019

ablaom mentioned this issue Jun 10, 2019

scitype JuliaAI/MLJ.jl#155

Closed

This was referenced Sep 18, 2019

automatic types suggestion, closes #2 JuliaAI/ScientificTypes.jl#4

Merged

scitype(X) is slow for large tables JuliaAI/ScientificTypes.jl#12

Closed

ablaom mentioned this issue Aug 6, 2020

Add support for "cyclic" categorical variables, such as month of year. #287

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make `ordered` a type parameter #184

Make `ordered` a type parameter #184

ablaom commented Feb 22, 2019 •

edited

Loading

nalimilan commented Feb 22, 2019

kleinschmidt commented Feb 22, 2019

ablaom commented Feb 24, 2019

nalimilan commented Feb 25, 2019

Nosferican commented Feb 26, 2019

nalimilan commented Feb 27, 2019

Nosferican commented Feb 27, 2019

jtrakk commented Oct 10, 2020

nalimilan commented Oct 12, 2020

Make ordered a type parameter #184

Make ordered a type parameter #184

Comments

ablaom commented Feb 22, 2019 • edited Loading

nalimilan commented Feb 22, 2019

kleinschmidt commented Feb 22, 2019

ablaom commented Feb 24, 2019

nalimilan commented Feb 25, 2019

Nosferican commented Feb 26, 2019

nalimilan commented Feb 27, 2019

Nosferican commented Feb 27, 2019

jtrakk commented Oct 10, 2020

nalimilan commented Oct 12, 2020

Make `ordered` a type parameter #184

Make `ordered` a type parameter #184

ablaom commented Feb 22, 2019 •

edited

Loading