Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

patsy questions/wishlist #93

Open
thequackdaddy opened this issue Sep 16, 2016 · 4 comments
Open

patsy questions/wishlist #93

thequackdaddy opened this issue Sep 16, 2016 · 4 comments

Comments

@thequackdaddy
Copy link
Contributor

thequackdaddy commented Sep 16, 2016

Hello,

I fit some (relatively) large-ish GLMs in statsmodels and have been experimenting with using patsy instead of a home rolled thing. My home rolled method isn't very good (I tend to underestimate challenges...). I've gotten some better hardware so now some models that used to not work with patsy (because of memory constraints) work now. I've run across a few things that might make it easier for me to use patsy more. Happy to work on PRs for them if there's interest.

  1. Categorical NA logic: Currently, it appears that when a categorical is fed through patsy, every inidividual value is checked against a rather detailed list on how to handle NaNs/Missing Values/Empty whatever. I ran a cProfile on this and it was quite slow. I think the bottleneck is here:
    https://github.com/pydata/patsy/blob/master/patsy/categorical.py#L341
    I know that NaNs/NA/None/Empty is a mess in general, but its a fact of life in my line of work (insurance modeling). I'm wondering if we scope out exactly all the scenarios we need to control for and use (pandas maybe?) to do this more elegantly? I'm not sure of the scope as there's far more players here than me.
  2. Reading from on-disk data stores: Memory is something of a problem for me. a typical model that I run might eat up 10+ GB of RAM. It works, but obviously is not ideal. As far as (relatively) mature tools, I've found bcolz's ctables to be pretty good (and fast). HDFStores/dask would be nice too. I'm not sure if xarray support for categorical data #91 relates to this (I don't know if xarray works well as an on-disk storage/data tool).
  3. Partial predictions: This is partly a statsmodels partly a patsy idea... I'd like the ability to do a so-called partial predict. Essentially I have a model like y ~ a + b + a:c. I want to come up with predictions y assuming that just a changes or just b changes. I think the process would look something like (assuming we're talking about changing only a) creating a new design matrix with every unique value of a as a separate row, and have the most frequent (or some other innocuous value) of b and c as constant for these rows. Then feed this dataset through the statsmodels predict routine. This is very helpful for GLMs with the log link--which is the bulk of what I work with.
  4. Categorical grouping: Suppose I have categories A, B, C, D, and E. The categories aren't really sortable in any logical way, but some could be grouped. Has there been any thought on how (or if) to allow this?
  5. Weights: For methods like standardize, it may make sense to weight the observations. (Really only applicable if you have a really skewed data where certain values are more prevalent on higher weighted records.
@thequackdaddy thequackdaddy changed the title patsy question/wishlist patsy questions/wishlist Sep 17, 2016
@thequackdaddy
Copy link
Contributor Author

For item 1 here, #97 greatly improves categorical coding time... from 10's of seconds on million record datasets to only a few seconds.

NA/NaN/NaT/None and the like are a real nightmare...

@njsmith
Copy link
Member

njsmith commented Oct 24, 2016

Reading from on-disk data stores: Memory is something of a problem for me. a typical model that I run might eat up 10+ GB of RAM

Are you aware that patsy has full support for building design matrices in bounded memory, by doing multiple passes over the data? You call incr_dbuilder or similar to scan the data to figure out which columns are numerical vs. categorical, the levels of categoricals, the mean and stddev if you're using standardize, etc. -- basically everything that requires looking at the whole data set -- and it gives you back a DesignInfo, and then you can pass that DesignInfo to dmatrix along with a horizontal slice of your data, and it gives you back a horizontal slice of the design matrix. It's maybe not the most convenient thing to use, but AFAIK it's everything you'd need to e.g. write dask.patsy.

Partial predictions

Does DesignInfo.subset do what you want? (You still have to pick what values you want to predict at etc., I'm not sure how much patsy can really help there.)

Categorical grouping

I'm not sure what this means. Is this a standard technique? Any references to what the resulting design matrices would look like?

Weights: For methods like standardize, it may make sense to weight the observations

True! You can always define your own stateful transform but adding features like this to standardize seems handy.

@thequackdaddy
Copy link
Contributor Author

Wow thanks for your answers.

Larger datasets

incr_dbuilder - Had no idea this existed. Let me play with it. That is really close... I'll experiment with that.

Partial predictions

I hadn't known about that... and it may be useful. Does subset return a design matrix that is the same number of columns? I think it would have to return the same number of columns or statsmodels will not be able to predict? Essentially for columns other than the one's selected, I'd want the resulting design matrix to have some constant (possibly 0) inserted.

What would also be helpful is that if I do something like y ~ bs(Age, df=4) + C(Sex) + C(MaritalStatus) and I want to see how y changes for all the unique input values of A, I could type partial_predict('Age') and it would realize that bs(Age, df=4) uses Age, and create a design matrix with all the unique values of Age for the 4 bs columns, but 0's for the rest. This could be fed into statsmodels as a predictor to get a prediction of just Age.

Does that make sense?

Categorical grouping

So think of vehicle types. There are 4 door sedans, 2 door sedans, small suv, medium suv, large suv, convertibles, 1/4 ton regular cab pickup, 1/4 ton extended cap pickup, 1/2 ton extended cab pickup, 1/2 ton crew cab pickup, etc... (there's actually about 200 in a list we use...

What I'd like to do is conveniently group these things (and potentially have an "all else") bucket. Maybe I decide that I just want to have 2 groups - sedans vs. SUV/pickup... maybe I want 3 groups sedans vs. SUV vs. Pickup... maybe I want 3 groups... sedans/SUV vs. 1/4 ton pickups vs. 1/2 ton pickups.

There's a huge potential number of groups... and deciding which to use is honestly more gut than science. But that's the idea.

Weights

I presume with standardize you could make a new argument which would be a weight column? The fact that patsy can iterate over the data sort of confuses me. This code is a little over my head (and by a little, I mean a lot) so unclear to me how to do this. If it becomes a bigger issue, I'll give it thought.

@njsmith
Copy link
Member

njsmith commented Oct 25, 2016

Re: subset: it's a bit lower-level than what you're thinking of. What it does is let you pick out a set of terms in your model -- which correspond to a set of columns in the design matrix -- and then compute just those columns. And the crucial thing is that when you do this, you only have to supply values for the variables that are actually used in those columns. So, for this:

create a design matrix with all the unique values of Age for the 4 bs columns, but 0's for the rest

then you could do it by using subset to get a DesignInfo that corresponds to just the bs(Age, 4) term, and then make a design matrix for it by passing in the different values of Age, and then add zero columns back to pad out the resulting matrix to the original number of columns. This way we don't need to add some complicated interface inside patsy itself to say "I know it says C(Sex) for these columns but please ignore that and put zeros there instead". But there is some assembly required :-).

Also, regarding:

realize that bs(Age, df=4) uses Age

this isn't something that patsy currently keeps track of -- but potentially it could I guess. It's a little tricky because from patsy's point of view, bs(Age, df=4) is just some arbitrary chunk of Python code, that happens to refer to Python variables like bs and Age. But EvalFactor actually already has to pull out what variables are used (see ast_names in eval.py), so we do have the pieces we'd need to find all the factors that mention Age, there's just no public API for it right now. Hmm.

Re: categorical grouping: Oh, I see, you just want way to recode your categorical so that it merges some existing categories together? Sure, that makes sense (though patsy doesn't have anything for this right now). In the general case this is probably something that it makes more sense to do in pandas using all its tools before entering patsy, since patsy's core competence isn't arbitrary data manipulation, but rather providing a terse mini-language for experimenting with models. (E.g., you probably shouldn't be listing 200 different categories inside your patsy formula string :-).) But if there's a nice notation for simple kinds of grouping that would fit into patsy, then that would make sense.

Remember also that you're free to define your own helper functions and use them in patsy without them having to be built into patsy, e.g., you can do stuff like:

def collapse_to_sedan_vs_pickup(categorical_series):
    ...
    return recorded_series

dmatrix("bs(Age, 4) + C(collapse_to_sedan_vs_pickup(VehicleType), Poly) + x", data=df)

I presume with standardize you could make a new argument which would be a weight column? The fact that patsy can iterate over the data sort of confuses me.

Right, internally the way stateful transforms are implemented is confusing, because of the need to work with incr_dbuilders and friends :-/. But conceptually they're pretty straightforward: from the user's point of view, standardize(x) is just (x - np.mean(x)) / np.std(x). So it's easy to imagine adding a weights argument to that.

The complications you're seeing are that if we don't have all the data loaded in memory at once, then we can't just call np.mean(x) and np.std(x). Also, we want to compute the mean/std once on the data used for fitting, and then keep using the same values for predictions. So there's one piece of code that knows how to compute mean and std incrementally, using this algorithm, and we run that on the fitting data, and then there's a second piece of code that takes the computed mean and std and actually standardizes the data.

Adding weight support would require finding and implementing an algorithm for incrementally computing the weighted std -- but if you scroll down in that wikipedia link, then the next section is called "weighted incremental algorithm", so maybe this is not so hard ;-).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants