ENH: Generate var_names from the data and partial predict #98

thequackdaddy · 2016-12-29T22:07:58Z

Hello,

I have a proposal that really came about because of the way I've been interacting with patsy.

My datasets are kind of long and kind of wide. I have lots of fields that I use for expoloring stuff, but naturally they just don't work out.

I've been using bcolz because it stores the data in a columnar fashion making horizontal slices really easy. Before, I'd been creating a list of variables that I wanted, defining all the transforms that I needed in patsy, and then feeding that through. I can't load the entire dataset into memory just because its too wide and long and I might only be looking at 20-30 columns for any one model.

So I propose having patsy attempt to figure out which columns it needs from the data using this new var_names method which is available on DesignInfo, EvalFactor, and Term. In a nutshell, it gets a list of all the variables used, checks if that variable is defined in the EvalEnvironment, and if not, assumes it must be data.

I've called this var_names for now, but arguably maybe non_eval_var_names might be more accurate? Open to suggestions here.

One nice thing is that when using incr_dbuilder, it can automatically slice on the columns which makes the construction much faster (for me at least).

Here's a gist demo'ing this.

https://gist.github.com/thequackdaddy/2e601afff4fbbfe42ed31a9b2925967d

Let me know what you think.

codecov-io · 2016-12-29T22:15:48Z

Codecov Report

Merging #98 into master will increase coverage by 0.03%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master      #98      +/-   ##
==========================================
+ Coverage   98.96%   98.99%   +0.03%     
==========================================
  Files          30       30              
  Lines        5585     5760     +175     
  Branches      775      803      +28     
==========================================
+ Hits         5527     5702     +175     
  Misses         35       35              
  Partials       23       23

Impacted Files	Coverage Δ
patsy/user_util.py	`100% <100%> (ø)`	⬆️
patsy/test_build.py	`98.1% <100%> (+0.1%)`	⬆️
patsy/desc.py	`98.42% <100%> (+0.07%)`	⬆️
patsy/design_info.py	`99.68% <100%> (+0.06%)`	⬆️
patsy/build.py	`99.62% <100%> (ø)`	⬆️
patsy/eval.py	`99.16% <100%> (+0.04%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4c613d0...544effd. Read the comment docs.

thequackdaddy · 2017-03-04T23:06:35Z

I went ahead and built the partial function that I had alluded to in #93. This makes it much easier to create design matrices for statsmodels that show you the marginal differences whe you only change the levels of 1 (or more) factors.

Here's a basic example:

In [1]: from patsy import dmatrix
   ...: import pandas as pd
   ...: import numpy as np
   ...:
   ...: data = pd.DataFrame({'categorical': ['a', 'b', 'c', 'b', 'a'],
   ...:                      'integer': [1, 3, 7, 2, 1],
   ...:                      'flt': [1.5, 0.0, 3.2, 4.2, 0.7]})
   ...: dm = dmatrix('categorical * np.log(integer) + bs(flt, df=3, degree=3)',
   ...:  data)
   ...: dm.design_info.partial({'categorical': ['a', 'b', 'c']})
   ...:
Out[1]:
array([[ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  1.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  1.,  0.,  0.,  0.,  0.,  0.,  0.]])

In [2]: dm.design_info.partial({'categorical': ['a', 'b'],
   ...:                         'integer': [1, 2, 3, 4]},
   ...:                        product=True)
Out[2]:
array([[ 0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
         0.        ,  0.        ,  0.        ,  0.        ],
       [ 0.        ,  0.        ,  0.        ,  0.69314718,  0.        ,
         0.        ,  0.        ,  0.        ,  0.        ],
       [ 0.        ,  0.        ,  0.        ,  1.09861229,  0.        ,
         0.        ,  0.        ,  0.        ,  0.        ],
       [ 0.        ,  0.        ,  0.        ,  1.38629436,  0.        ,
         0.        ,  0.        ,  0.        ,  0.        ],
       [ 0.        ,  1.        ,  0.        ,  0.        ,  0.        ,
         0.        ,  0.        ,  0.        ,  0.        ],
       [ 0.        ,  1.        ,  0.        ,  0.69314718,  0.69314718,
         0.        ,  0.        ,  0.        ,  0.        ],
       [ 0.        ,  1.        ,  0.        ,  1.09861229,  1.09861229,
         0.        ,  0.        ,  0.        ,  0.        ],
       [ 0.        ,  1.        ,  0.        ,  1.38629436,  1.38629436,
         0.        ,  0.        ,  0.        ,  0.        ]])

thequackdaddy · 2017-03-04T23:07:09Z

@njsmith Also, it appears that travis isn't kicking off for this all of a sudden. Any ideas why this would be?

I'm fairly certain this will pass. Here is the branch in my travis.

njsmith · 2017-03-05T20:15:08Z

It seems like it would be simpler to query a ModelDesc for all the variables it uses, period? And then it'd be your job to ignore the ones that aren't present in your data set. This would also be more accurate, because

The even simpler (from the API design / coupling perspective) would be to pass in a dict-like that loads the data lazily on demand, like:

class LazyData(dict):
    def __missing__(self, key):
        try:
            return bcolz.load(key, file)
        except BcolzKeyNotFound:
            raise KeyError(key)

Would this work for you?

Is the partial part somehow tied to the var_names part? They look like separate changes to me, so should be in separate PRs?

This is also missing lots of tests, but let's not worry about that until after the high-level discussion...

thequackdaddy · 2017-03-05T21:28:15Z

It seems like it would be simpler to query a ModelDesc for all the variables it uses, period? And then it'd be your job to ignore the ones that aren't present in your data set. This would also be more accurate, because

Hmm... I hadn't thought of that. That should be relatively easy to add/change based on what I've done so far. The heart of this is the var_names is on the EvalFactor class that looks at all the objects needed to evalulate the factor using the ast_names function. This is in turn used by the Term class... (and is used in turn by DesignInfo class). ModelDesc has a list of terms (lhs_termlist and rhs_termlist), so adding this would be easy.

I presume you're implying that I shouldn't be worrying about the EvalEnvironment variable and just return every dependent object--function and module alike? I was trying to return only "data"-ish things. Simply removing them from the output set manually set seems easy enough...

The even simpler (from the API design / coupling perspective) would be to pass in a dict-like that loads the data lazily on demand, like:

This is really clever, thanks! I'll try it. However, I don't think it solves the partial issue below.

Is the partial part somehow tied to the var_names part? They look like separate changes to me, so should be in separate PRs?

Yes. partial looks at each Term's var_names and decides whether the Term needs the variable or not. If yes, it pulls that Term using subset to create the design matrix only for that subset of columns using the variables specified. Otherwise, it returns columns full of zeros. The end result is a design matrix of the same width and column alignment as the model's DesignMatrix, but only with as many rows as needed to evaluate the partial predictions and the rest of the columns as zeros.

This is also missing lots of tests, but let's not worry about that until after the high-level discussion...

Sound good. Writing tests is not something I've excelled at. This is somewhat tested and I (think) there is coverage for most of the new lines--likely I missed a few. I added some asserts to some of the existing tests with the new functionality.

thequackdaddy force-pushed the varnames branch 6 times, most recently from b0dc258 to 460a6f9 Compare March 4, 2017 22:58

thequackdaddy changed the title ~~ENH: Generate var_names from the data~~ ENH: Generate var_names from the data and partial predict Mar 4, 2017

thequackdaddy force-pushed the varnames branch from 19ad339 to e63da78 Compare February 27, 2018 15:47

josef-pkt mentioned this pull request Oct 31, 2018

ENH/Design: statsmodels equivalent of design_info statsmodels/statsmodels#5342

Open

thequackdaddy added 4 commits November 3, 2018 12:51

ENH: Support for var_names which are missing from environment

470c997

DOC: Fixes

2e94be5

Added partial function

5f662a9

Added logic to handle modules and user-defined functions

807cc93

thequackdaddy force-pushed the varnames branch from e63da78 to 807cc93 Compare November 3, 2018 17:52

Use sum instead of np.sum on a generator

ac612d0

thequackdaddy force-pushed the varnames branch 2 times, most recently from a79c5c8 to 050c220 Compare November 3, 2018 23:47

Improve test coverage

544effd

thequackdaddy force-pushed the varnames branch from 050c220 to 544effd Compare November 4, 2018 02:16

matthewwardrop force-pushed the master branch 2 times, most recently from b07ba3f to 48fd2e4 Compare September 5, 2021 04:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: Generate var_names from the data and partial predict #98

ENH: Generate var_names from the data and partial predict #98

thequackdaddy commented Dec 29, 2016 •

edited

Loading

codecov-io commented Dec 29, 2016 •

edited

Loading

thequackdaddy commented Mar 4, 2017

thequackdaddy commented Mar 4, 2017 •

edited

Loading

njsmith commented Mar 5, 2017

thequackdaddy commented Mar 5, 2017

ENH: Generate var_names from the data and partial predict #98

Are you sure you want to change the base?

ENH: Generate var_names from the data and partial predict #98

Conversation

thequackdaddy commented Dec 29, 2016 • edited Loading

codecov-io commented Dec 29, 2016 • edited Loading

Codecov Report

thequackdaddy commented Mar 4, 2017

thequackdaddy commented Mar 4, 2017 • edited Loading

njsmith commented Mar 5, 2017

thequackdaddy commented Mar 5, 2017

thequackdaddy commented Dec 29, 2016 •

edited

Loading

codecov-io commented Dec 29, 2016 •

edited

Loading

thequackdaddy commented Mar 4, 2017 •

edited

Loading