Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

W features to ortho IV #295

Closed
federiconuta opened this issue Oct 25, 2020 · 10 comments
Closed

W features to ortho IV #295

federiconuta opened this issue Oct 25, 2020 · 10 comments
Assignees

Comments

@federiconuta
Copy link

federiconuta commented Oct 25, 2020

Hi all,

I am writing with reference to issue #292 for kindly asking if you are going to extend the code with control variables (W). Indeed our team is actually working to an interesting and novel application for the pharmaceutical market which require estimation of heterogeneous effects. Since @kbattocchi told that the improvement should be straightforward, we expect this may be done in reasonable time. May you please inform us about your time agenda on that? Our project would strongly benefit from the availability of your code.

Thank you again for the great work you are doing and for helping us.

@federiconuta
Copy link
Author

federiconuta commented Oct 26, 2020

I don't know if I am correct but is it about modifying the following:

        """
        Estimate the counterfactual model from data, i.e. estimates function :math:`\\theta(\\cdot)`.

        Parameters
        ----------
        Y: (n, d_y) matrix or vector of length n
            Outcomes for each sample
        T: (n, d_t) matrix or vector of length n
            Treatments for each sample
        Z: (n, d_z) matrix
            Instruments for each sample
        X: optional(n, d_x) matrix or None (Default=None)
            Features for each sample
        sample_weight: optional(n,) vector or None (Default=None)
            Weights for each samples
        sample_var: optional(n,) vector or None (Default=None)
            Sample variance for each sample
        inference: string,:class:`.Inference` instance, or None
            Method for performing inference.  This estimator supports 'bootstrap'
            (or an instance of:class:`.BootstrapInference`).

        Returns
        -------
        self: _BaseDMLIV
        """
        # Replacing fit from _OrthoLearner, to enforce W=None and improve the docstring
        return super().fit(Y, T, X=X, Z=Z,
                           sample_weight=sample_weight, sample_var=sample_var, inference=inference)

with W=W also? Or maybe X and W should be concat together...

@kbattocchi
Copy link
Collaborator

Thanks for the suggestion. As it turns out we've been discussing several improvements to the Ortho IV module to make it more consistent with the DML module, and this will be one of the things that we'll address.

Just to give you a heads-up, though, it is likely that we will make a few breaking changes about how some of the subclasses are named, to also make the class names more consistent with DML.

@federiconuta
Copy link
Author

@kbattocchi thank you for the answer. Could you please share your time agenda on such modifications? I am so sorry to push you on this aspect but, as I mentioned, it would be a great step ahead if we could apply your algorithm with W in ortho_iv (DMLIV).

Thanks again!

@federiconuta
Copy link
Author

@kbattocchi. Hi,

so I am trying to modify the code myself but of course I am not sure if I am doing the right things.
Maybe, since this might be a simple task for you, we can find the solution together in short time.

I have modified the code dml_iv as follows, but I need to allow for multi treatments:

   def fit_new(self, y, T, X, W, Z, store_final=False):
        """
        Parameters
        ----------
        y : outcome
        T : treatment (single dimensional)
        X : features/controls
        Z : instrument (single dimensional)
        store_final (bool) : whether to store the estimated nuisance values for
            fitting a different final stage model without the need of refitting
            the nuisance values. Increases memory usage. 
        """
        #if len(T.shape) > 1 and T.shape[1] > 1:
        #raise AssertionError(
        #        "Can only accept single dimensional treatment")
        if len(y.shape) > 1 and y.shape[1] > 1:
            raise AssertionError("Can only accept single dimensional outcome")
        if len(Z.shape) == 1:
            Z = Z.reshape(-1, 1)
        if (Z.shape[1] > 1) and self.binary_instrument:
            raise AssertionError(
                "Binary instrument flag is True, but instrument is multi-dimensional")

        #T = T.flatten()
        y = y.flatten()

        n_samples = y.shape[0]
        proj_t = np.zeros(n_samples)
        pred_t = np.zeros(n_samples)
        res_y = np.zeros(n_samples)
        XW = hstack([X, W]) if W is not None else X

        if self.n_splits == 1:
            splits = [(np.arange(X.shape[0]), np.arange(X.shape[0]))]
        # TODO. Deal with multi-class instrument/treatment
        elif self.binary_instrument or self.binary_treatment:
            group = 2*T*self.binary_treatment + Z.flatten()*self.binary_instrument
            splits = StratifiedKFold(
                n_splits=self.n_splits, shuffle=True).split(XW, group)
        else:
            splits = KFold(n_splits=self.n_splits, shuffle=True).split(XW)

        for idx, (train, test) in enumerate(splits):
            # Estimate h(Z, X) = E[T | Z, X] in cross-fitting manner
            proj_t[test] = self.model_T_XZ[idx].fit(hstack([XW[train], Z[train]]),
                                                    T[train]).predict(hstack([XW[test],
                                                                              Z[test]]))
            # Estimate residual Y_res = Y - q(X) = Y - E[Y | X] in cross-fitting manner
            res_y[test] = y[test] - \
                self.model_Y_X[idx].fit(XW[train], y[train]).predict(XW[test])
            # Estimate p(X) = E[T | X] in cross-fitting manner
            pred_t[test] = self.model_T_X[idx].fit(
                XW[train], T[train]).predict(XW[test])

        # Estimate theta by minimizing square loss (Y_res - theta(X) * (h(Z, X) - p(X)))^2
        self.model_effect.fit(res_y, (proj_t-pred_t).reshape(-1, 1), XW)

        if store_final:
            self.stored_final_data = True
            self.X = X
            self.res_t = (proj_t-pred_t).reshape(-1, 1)
            self.res_y = res_y

        return self

    def effect(self, X):
        """
        Parameters
        ----------
        X : features
        """
        return self.model_effect.predict(X)

I think that it has to do with modifying:

proj_t = np.zeros(n_samples)
  pred_t = np.zeros(n_samples)

but then I am not aware of what (and if) it implies something for the rest of the code. What do you think?

@vsyrgkanis
Copy link
Collaborator

Are you using the prototype?

If yes then you can easily emulate X and W behavior. Use as X all the variables and pass as model final a model wrapped by the SubsetWrapper that is in the prototyp utilities, which will fit a modle only on a subset of the X. So this way you have the same behavrior as if you were to specify X and W and X was the subset of the variables you specify in the subset wrapper.

I believe one of the jupyter notebooks in the prototype have an example of how to use the subsetwrappwr

@vasilismsr
Copy link
Contributor

@federiconuta
Copy link
Author

Are you using the prototype?

@vasilismsr @vsyrgkanis yes I am. Specifically I cloned EconML/prototypes. However another problem is that my T variable (treatment I mean) has 2 columns, so I have multiple treatments. Specifically, my y variable is (n_obs, 1), T is (n_obs,2), Z is (n_obs, n_z) but for the moment I believe I could use a mono dimensional one, X id (n_obs, d_x) and W is (n_obs, d_w). Therefore I was thinking about doing something along this fashion:

#Folders adopted
fold0 = np.empty((0,))
for i in np.arange(n_products//2):
    fold0 = np.concatenate((fold0, np.arange(i*n_months, i*n_months + n_months//2)))
for i in np.arange(n_products//2, n_products):
    fold0 = np.concatenate((fold0, np.arange(i*n_months + n_months//2, (i+1)*n_months)))
fold1 = np.setdiff1d(np.arange(n_products*n_months), fold0).astype(int)
fold0 = fold0.astype(int)

fold00 = np.empty((0,))
for i in np.arange(n_products//4):
    fold00 = np.concatenate((fold00, np.arange(i*(n_months//2), i*(n_months//2) + n_months//4)))
for i in np.arange(n_products//4, n_products//2):
    fold00 = np.concatenate((fold00, np.arange(i*(n_months//2) + (n_months//4), (i+1)*(n_months//2))))
for i in np.arange(n_products//2, 3*n_products//4):
    fold00 = np.concatenate((fold00, np.arange(i*(n_months//2), i*(n_months//2) + n_months//4)))
for i in np.arange(3*n_products//4, n_products):
    fold00 = np.concatenate((fold00, np.arange(i*(n_months//2) + (n_months//4), (i+1)*(n_months//2))))
fold11 = np.setdiff1d(np.arange((n_products)*(n_months//2)), fold00).astype(int)
fold00 = fold00.astype(int)


model = lambda: LassoCV(cv=[(fold00, fold11), (fold11, fold00)]) 
model_t = lambda: MultiTaskLassoCV(cv=[(fold00, fold11), (fold11, fold00)]) 
model_clf = lambda: RegWrapper(LassoCV(cv=5, n_jobs=-1)) # to review
model_Y_X = lambda: model() # model for E[Y | X]
model_T_X = lambda: model_t() # model for E[T | X]. 
model_Z_X = lambda: model_clf() # model for E[Z | X]. 
model_T_XZ = lambda: model()
dmliv_featurizer = lambda: PolynomialFeatures(degree=1, include_bias=True)
dmliv_model_effect = lambda: LinearRegression(fit_intercept=False)

heteff = DMLIV(model_Y_X(), model_T_X(), model_T_XZ(), 
             dmliv_model_effect(), dmliv_featurizer(),
             n_splits=N_SPLITS, # number of splits to use for cross-fitting
             #binary_instrument=True, # a flag whether to stratify cross-fitting by instrument
             #binary_treatment=True # a flag whether to stratify cross-fitting by treatment
            )

and then call the effects as:

heteff.fit_new(Y, T, X, W, Z, inference = 'bootstrap')
X_group = np.eye(n_products-1)
heteff.effect(X_group)

Now. As far as I have understood from the Jupyter notebook, the problem of calling W as subset of X should be solved with something like:

subset_names = set(['col_1', 'col_23']) #examples: these are the columns that I am interested to
# list of indices of features X to use in the final model
feature_inds = np.argwhere([(x in subset_names) for x in X_data.columns.values]).flatten()

heff_dmliv_model_effect = lambda: SubsetWrapper(StatsModelLinearRegression(),
                                          feature_inds # list of indices of features X to use in the final model
                                         )
heff.refit_final(heff_dmliv_model_effect())

From which I can call directly the effects as:

X_group = np.eye(n_products-1)
driv_cate = dr_cate.effect(X[:, feature_inds])

right?
What about the possibility to use multi-dimensional T?

Thank you a lot :)

@vasilismsr
Copy link
Contributor

the prototype I believe only accepts univariate T (same will hold for the mainstream implementation of driv).

You could try just running it one by one, controlling for the other treatment as a control.

@federiconuta
Copy link
Author

@vasilismsr thanks a lot!

@kbattocchi
Copy link
Collaborator

This was addressed by #460, included in our v0.12 release

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants