W features to ortho IV #295

federiconuta · 2020-10-25T16:39:58Z

Hi all,

I am writing with reference to issue #292 for kindly asking if you are going to extend the code with control variables (W). Indeed our team is actually working to an interesting and novel application for the pharmaceutical market which require estimation of heterogeneous effects. Since @kbattocchi told that the improvement should be straightforward, we expect this may be done in reasonable time. May you please inform us about your time agenda on that? Our project would strongly benefit from the availability of your code.

Thank you again for the great work you are doing and for helping us.

federiconuta · 2020-10-26T16:21:18Z

I don't know if I am correct but is it about modifying the following:

        """
        Estimate the counterfactual model from data, i.e. estimates function :math:`\\theta(\\cdot)`.

        Parameters
        ----------
        Y: (n, d_y) matrix or vector of length n
            Outcomes for each sample
        T: (n, d_t) matrix or vector of length n
            Treatments for each sample
        Z: (n, d_z) matrix
            Instruments for each sample
        X: optional(n, d_x) matrix or None (Default=None)
            Features for each sample
        sample_weight: optional(n,) vector or None (Default=None)
            Weights for each samples
        sample_var: optional(n,) vector or None (Default=None)
            Sample variance for each sample
        inference: string,:class:`.Inference` instance, or None
            Method for performing inference.  This estimator supports 'bootstrap'
            (or an instance of:class:`.BootstrapInference`).

        Returns
        -------
        self: _BaseDMLIV
        """
        # Replacing fit from _OrthoLearner, to enforce W=None and improve the docstring
        return super().fit(Y, T, X=X, Z=Z,
                           sample_weight=sample_weight, sample_var=sample_var, inference=inference)

with W=W also? Or maybe X and W should be concat together...

kbattocchi · 2020-10-27T14:10:16Z

Thanks for the suggestion. As it turns out we've been discussing several improvements to the Ortho IV module to make it more consistent with the DML module, and this will be one of the things that we'll address.

Just to give you a heads-up, though, it is likely that we will make a few breaking changes about how some of the subclasses are named, to also make the class names more consistent with DML.

federiconuta · 2020-10-28T07:43:33Z

@kbattocchi thank you for the answer. Could you please share your time agenda on such modifications? I am so sorry to push you on this aspect but, as I mentioned, it would be a great step ahead if we could apply your algorithm with W in ortho_iv (DMLIV).

Thanks again!

federiconuta · 2020-11-02T11:13:52Z

@kbattocchi. Hi,

so I am trying to modify the code myself but of course I am not sure if I am doing the right things.
Maybe, since this might be a simple task for you, we can find the solution together in short time.

I have modified the code dml_iv as follows, but I need to allow for multi treatments:

   def fit_new(self, y, T, X, W, Z, store_final=False):
        """
        Parameters
        ----------
        y : outcome
        T : treatment (single dimensional)
        X : features/controls
        Z : instrument (single dimensional)
        store_final (bool) : whether to store the estimated nuisance values for
            fitting a different final stage model without the need of refitting
            the nuisance values. Increases memory usage. 
        """
        #if len(T.shape) > 1 and T.shape[1] > 1:
        #raise AssertionError(
        #        "Can only accept single dimensional treatment")
        if len(y.shape) > 1 and y.shape[1] > 1:
            raise AssertionError("Can only accept single dimensional outcome")
        if len(Z.shape) == 1:
            Z = Z.reshape(-1, 1)
        if (Z.shape[1] > 1) and self.binary_instrument:
            raise AssertionError(
                "Binary instrument flag is True, but instrument is multi-dimensional")

        #T = T.flatten()
        y = y.flatten()

        n_samples = y.shape[0]
        proj_t = np.zeros(n_samples)
        pred_t = np.zeros(n_samples)
        res_y = np.zeros(n_samples)
        XW = hstack([X, W]) if W is not None else X

        if self.n_splits == 1:
            splits = [(np.arange(X.shape[0]), np.arange(X.shape[0]))]
        # TODO. Deal with multi-class instrument/treatment
        elif self.binary_instrument or self.binary_treatment:
            group = 2*T*self.binary_treatment + Z.flatten()*self.binary_instrument
            splits = StratifiedKFold(
                n_splits=self.n_splits, shuffle=True).split(XW, group)
        else:
            splits = KFold(n_splits=self.n_splits, shuffle=True).split(XW)

        for idx, (train, test) in enumerate(splits):
            # Estimate h(Z, X) = E[T | Z, X] in cross-fitting manner
            proj_t[test] = self.model_T_XZ[idx].fit(hstack([XW[train], Z[train]]),
                                                    T[train]).predict(hstack([XW[test],
                                                                              Z[test]]))
            # Estimate residual Y_res = Y - q(X) = Y - E[Y | X] in cross-fitting manner
            res_y[test] = y[test] - \
                self.model_Y_X[idx].fit(XW[train], y[train]).predict(XW[test])
            # Estimate p(X) = E[T | X] in cross-fitting manner
            pred_t[test] = self.model_T_X[idx].fit(
                XW[train], T[train]).predict(XW[test])

        # Estimate theta by minimizing square loss (Y_res - theta(X) * (h(Z, X) - p(X)))^2
        self.model_effect.fit(res_y, (proj_t-pred_t).reshape(-1, 1), XW)

        if store_final:
            self.stored_final_data = True
            self.X = X
            self.res_t = (proj_t-pred_t).reshape(-1, 1)
            self.res_y = res_y

        return self

    def effect(self, X):
        """
        Parameters
        ----------
        X : features
        """
        return self.model_effect.predict(X)

I think that it has to do with modifying:

proj_t = np.zeros(n_samples)
  pred_t = np.zeros(n_samples)

but then I am not aware of what (and if) it implies something for the rest of the code. What do you think?

vsyrgkanis · 2020-11-02T12:09:40Z

Are you using the prototype?

If yes then you can easily emulate X and W behavior. Use as X all the variables and pass as model final a model wrapped by the SubsetWrapper that is in the prototyp utilities, which will fit a modle only on a subset of the X. So this way you have the same behavrior as if you were to specify X and W and X was the subset of the variables you specify in the subset wrapper.

I believe one of the jupyter notebooks in the prototype have an example of how to use the subsetwrappwr

vasilismsr · 2020-11-02T12:12:21Z

Look for instance how subset wrapper is used in this notebook:
https://github.com/microsoft/EconML/blob/master/prototypes/dml_iv/TA_DGP_analysis_Step_CATE.ipynb

federiconuta · 2020-11-02T15:53:39Z

Are you using the prototype?

@vasilismsr @vsyrgkanis yes I am. Specifically I cloned EconML/prototypes. However another problem is that my T variable (treatment I mean) has 2 columns, so I have multiple treatments. Specifically, my y variable is (n_obs, 1), T is (n_obs,2), Z is (n_obs, n_z) but for the moment I believe I could use a mono dimensional one, X id (n_obs, d_x) and W is (n_obs, d_w). Therefore I was thinking about doing something along this fashion:

#Folders adopted
fold0 = np.empty((0,))
for i in np.arange(n_products//2):
    fold0 = np.concatenate((fold0, np.arange(i*n_months, i*n_months + n_months//2)))
for i in np.arange(n_products//2, n_products):
    fold0 = np.concatenate((fold0, np.arange(i*n_months + n_months//2, (i+1)*n_months)))
fold1 = np.setdiff1d(np.arange(n_products*n_months), fold0).astype(int)
fold0 = fold0.astype(int)

fold00 = np.empty((0,))
for i in np.arange(n_products//4):
    fold00 = np.concatenate((fold00, np.arange(i*(n_months//2), i*(n_months//2) + n_months//4)))
for i in np.arange(n_products//4, n_products//2):
    fold00 = np.concatenate((fold00, np.arange(i*(n_months//2) + (n_months//4), (i+1)*(n_months//2))))
for i in np.arange(n_products//2, 3*n_products//4):
    fold00 = np.concatenate((fold00, np.arange(i*(n_months//2), i*(n_months//2) + n_months//4)))
for i in np.arange(3*n_products//4, n_products):
    fold00 = np.concatenate((fold00, np.arange(i*(n_months//2) + (n_months//4), (i+1)*(n_months//2))))
fold11 = np.setdiff1d(np.arange((n_products)*(n_months//2)), fold00).astype(int)
fold00 = fold00.astype(int)


model = lambda: LassoCV(cv=[(fold00, fold11), (fold11, fold00)]) 
model_t = lambda: MultiTaskLassoCV(cv=[(fold00, fold11), (fold11, fold00)]) 
model_clf = lambda: RegWrapper(LassoCV(cv=5, n_jobs=-1)) # to review
model_Y_X = lambda: model() # model for E[Y | X]
model_T_X = lambda: model_t() # model for E[T | X]. 
model_Z_X = lambda: model_clf() # model for E[Z | X]. 
model_T_XZ = lambda: model()
dmliv_featurizer = lambda: PolynomialFeatures(degree=1, include_bias=True)
dmliv_model_effect = lambda: LinearRegression(fit_intercept=False)

heteff = DMLIV(model_Y_X(), model_T_X(), model_T_XZ(), 
             dmliv_model_effect(), dmliv_featurizer(),
             n_splits=N_SPLITS, # number of splits to use for cross-fitting
             #binary_instrument=True, # a flag whether to stratify cross-fitting by instrument
             #binary_treatment=True # a flag whether to stratify cross-fitting by treatment
            )

and then call the effects as:

heteff.fit_new(Y, T, X, W, Z, inference = 'bootstrap')
X_group = np.eye(n_products-1)
heteff.effect(X_group)

Now. As far as I have understood from the Jupyter notebook, the problem of calling W as subset of X should be solved with something like:

subset_names = set(['col_1', 'col_23']) #examples: these are the columns that I am interested to
# list of indices of features X to use in the final model
feature_inds = np.argwhere([(x in subset_names) for x in X_data.columns.values]).flatten()

heff_dmliv_model_effect = lambda: SubsetWrapper(StatsModelLinearRegression(),
                                          feature_inds # list of indices of features X to use in the final model
                                         )
heff.refit_final(heff_dmliv_model_effect())

From which I can call directly the effects as:

X_group = np.eye(n_products-1)
driv_cate = dr_cate.effect(X[:, feature_inds])

right?
What about the possibility to use multi-dimensional T?

Thank you a lot :)

vasilismsr · 2020-11-02T15:57:11Z

the prototype I believe only accepts univariate T (same will hold for the mainstream implementation of driv).

You could try just running it one by one, controlling for the other treatment as a control.

federiconuta · 2020-11-02T16:01:31Z

@vasilismsr thanks a lot!

kbattocchi · 2022-03-28T19:54:59Z

This was addressed by #460, included in our v0.12 release

kbattocchi assigned moprescu and heimengqi Oct 27, 2020

kbattocchi closed this as completed Mar 28, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

W features to ortho IV #295

W features to ortho IV #295

federiconuta commented Oct 25, 2020 •

edited

Loading

federiconuta commented Oct 26, 2020 •

edited

Loading

kbattocchi commented Oct 27, 2020

federiconuta commented Oct 28, 2020

federiconuta commented Nov 2, 2020

vsyrgkanis commented Nov 2, 2020

vasilismsr commented Nov 2, 2020

federiconuta commented Nov 2, 2020

vasilismsr commented Nov 2, 2020

federiconuta commented Nov 2, 2020

kbattocchi commented Mar 28, 2022

W features to ortho IV #295

W features to ortho IV #295

Comments

federiconuta commented Oct 25, 2020 • edited Loading

federiconuta commented Oct 26, 2020 • edited Loading

kbattocchi commented Oct 27, 2020

federiconuta commented Oct 28, 2020

federiconuta commented Nov 2, 2020

vsyrgkanis commented Nov 2, 2020

vasilismsr commented Nov 2, 2020

federiconuta commented Nov 2, 2020

vasilismsr commented Nov 2, 2020

federiconuta commented Nov 2, 2020

kbattocchi commented Mar 28, 2022

federiconuta commented Oct 25, 2020 •

edited

Loading

federiconuta commented Oct 26, 2020 •

edited

Loading