Indexing coming from output of the stratified shuffle split error when fitting CBPE #79

SoyGema · 2022-05-23T19:07:41Z

Hello There! I'm having an issue fitting CBPE with reference data. The backtrace seems to be related with some of the process related with calibration , to be more precise and as far as I can understand, with indexing coming from output of the stratified shuffle split . It might be related with my reference partition.

nannyml version: 0.4.1
Python version: 3.8
Operating System: Mac

Description

Fit CBPE Estimator on reference partition data

5946 rows
31 columns
Includes predicted value, predicted_proba, real value as customer column
Dates from 2020-12-01 to 2020-12-22 ( dataset sorted by date )

What I Did

cbpe_auto = nml.CBPE(model_metadata=metadata,metrics=['roc_auc', 'f1'] )
cbpe_auto.fit(reference_data=reference)

KeyError                                  Traceback (most recent call last)
Input In [10], in <cell line: 3>()
      1 cbpe_auto = nml.CBPE(model_metadata=metadata,metrics=['roc_auc', 'f1'] )
----> 3 cbpe_auto.fit(reference_data=reference)
      4 est_perf_auto = cbpe_auto.estimate(pd.concat([reference, analysis], ignore_index=True))
      5 fig = est_perf_auto.plot(kind='performance')

File ~/Documents/virtualenv/lib/python3.8/site-packages/nannyml/performance_estimation/confidence_based/_cbpe_binary_classification.py:145, in _BinaryClassificationCBPE.fit(self, reference_data)
    142 self.minimum_chunk_size = _minimum_chunk_size(reference_data)
    144 # Fit calibrator if calibration is needed
--> 145 self.needs_calibration = needs_calibration(
    146     y_true=reference_data[NML_METADATA_TARGET_COLUMN_NAME],
    147     y_pred_proba=reference_data[NML_METADATA_PREDICTED_PROBABILITY_COLUMN_NAME],
    148     calibrator=self.calibrator,
    149 )
    151 if self.needs_calibration:
    152     self.calibrator.fit(
    153         reference_data[NML_METADATA_PREDICTED_PROBABILITY_COLUMN_NAME],
    154         reference_data[NML_METADATA_TARGET_COLUMN_NAME],
    155     )

File ~/Documents/virtualenv/lib/python3.8/site-packages/nannyml/calibration.py:289, in needs_calibration(y_true, y_pred_proba, calibrator, bin_count, split_count)
    287     y_pred_proba_test, y_true_test = y_pred_proba.iloc[test, :], y_true[test]
    288 else:
--> 289     y_pred_proba_train, y_true_train = y_pred_proba[train], y_true[train]
    290     y_pred_proba_test, y_true_test = y_pred_proba[test], y_true[test]
    292 calibrator.fit(y_pred_proba_train, y_true_train)

File ~/Documents/virtualenv/lib/python3.8/site-packages/pandas/core/series.py:984, in Series.__getitem__(self, key)
    981     key = np.asarray(key, dtype=bool)
    982     return self._get_values(key)
--> 984 return self._get_with(key)

File ~/Documents/virtualenv/lib/python3.8/site-packages/pandas/core/series.py:1019, in Series._get_with(self, key)
   1015 if key_type == "integer":
   1016     # We need to decide whether to treat this as a positional indexer
   1017     #  (i.e. self.iloc) or label-based (i.e. self.loc)
   1018     if not self.index._should_fallback_to_positional:
-> 1019         return self.loc[key]
   1020     else:
   1021         return self.iloc[key]

File ~/Documents/virtualenv/lib/python3.8/site-packages/pandas/core/indexing.py:967, in _LocationIndexer.__getitem__(self, key)
    964 axis = self.axis or 0
    966 maybe_callable = com.apply_if_callable(key, self.obj)
--> 967 return self._getitem_axis(maybe_callable, axis=axis)

File ~/Documents/virtualenv/lib/python3.8/site-packages/pandas/core/indexing.py:1191, in _LocIndexer._getitem_axis(self, key, axis)
   1188     if hasattr(key, "ndim") and key.ndim > 1:
   1189         raise ValueError("Cannot index with multidimensional key")
-> 1191     return self._getitem_iterable(key, axis=axis)
   1193 # nested tuple slicing
   1194 if is_nested_tuple(key, labels):

File ~/Documents/virtualenv/lib/python3.8/site-packages/pandas/core/indexing.py:1132, in _LocIndexer._getitem_iterable(self, key, axis)
   1129 self._validate_key(key, axis)
   1131 # A collection of keys
-> 1132 keyarr, indexer = self._get_listlike_indexer(key, axis)
   1133 return self.obj._reindex_with_indexers(
   1134     {axis: [keyarr, indexer]}, copy=True, allow_dups=True
   1135 )

File ~/Documents/virtualenv/lib/python3.8/site-packages/pandas/core/indexing.py:1327, in _LocIndexer._get_listlike_indexer(self, key, axis)
   1324 ax = self.obj._get_axis(axis)
   1325 axis_name = self.obj._get_axis_name(axis)
-> 1327 keyarr, indexer = ax._get_indexer_strict(key, axis_name)
   1329 return keyarr, indexer

File ~/Documents/virtualenv/lib/python3.8/site-packages/pandas/core/indexes/base.py:5782, in Index._get_indexer_strict(self, key, axis_name)
   5779 else:
   5780     keyarr, indexer, new_indexer = self._reindex_non_unique(keyarr)
-> 5782 self._raise_if_missing(keyarr, indexer, axis_name)
   5784 keyarr = self.take(indexer)
   5785 if isinstance(key, Index):
   5786     # GH 42790 - Preserve name from an Index

File ~/Documents/virtualenv/lib/python3.8/site-packages/pandas/core/indexes/base.py:5845, in Index._raise_if_missing(self, key, indexer, axis_name)
   5842     raise KeyError(f"None of [{key}] are in the [{axis_name}]")
   5844 not_found = list(ensure_index(key)[missing_mask.nonzero()[0]].unique())
-> 5845 raise KeyError(f"{not_found} not in index")

KeyError: '[1979, 3272, 3695, 4953, 5721, 250, 764, 4733, 3654, 4011, 399, 3728, 3953, 1555, 1737, 3210, 1660, 2394, 5412, 2083, 3145, 2147, 1194, 2166, 2573, 1483, 2835, 4503, 133, 2982, 3703, 4963, 1014, 2208, 5166, 1765, 624] not in index'

Please, let me know if I should try something on my data reference partition.
Any idea what I should do ?
Thanks in advance for the help! :)

The text was updated successfully, but these errors were encountered:

nnansters · 2022-05-24T13:53:08Z

Hi @SoyGema ,

I found what's been happening here.

Reproducing the problem

The main issue is the return of the StratifiedShuffleSplit.split method. It will return arrays of indices for train and test folds. These indices are based on the size of the data passed to it, i.e. the returned indices are within range [0, len(y_true)].

These indices do not correspond to the indices used within the DataFrame and which are not guaranteed to be within this range of [0, len(y_true)]. This might occur when using a slice of the reference data set. I'll illustrate using the built-in binary classification data set.

# After doing the regular data setup (not gonna go into that here)
estimator = nml.CBPE(model_metadata=metadata, chunk_size=chunk_size, metrics=['roc_auc', 'f1'])

# This will reproduce the issue
estimator.fit(reference.loc[40000:, :])

The reference DataFrame only contains indices in range [40000, 49999], but the stratified shuffle split results will try to use indexes in the range [0, 9999], resulting in the error you see.

Remediating the problem

You can work around this yourself by resetting the DataFrame indices before passing it to NannyML.

# After doing the regular data setup (not gonna go into that here)
estimator = nml.CBPE(model_metadata=metadata, chunk_size=chunk_size, metrics=['roc_auc', 'f1'])

# This will remediate the issue
estimator.fit(reference.loc[40000:, :].reset_index(drop=True))

Fixing the problem

We'll perform the index reset before running the stratified shuffle split in the CBPE.fit method.

…calibration (#79)

nnansters · 2022-05-24T15:08:49Z

@SoyGema , can you confirm the workaround or fix solve your issue?

SoyGema · 2022-05-24T15:17:57Z

It worked .
Thanks @nnansters

nnansters self-assigned this May 23, 2022

nnansters added the triage Needs to be assessed label May 23, 2022

nnansters added a commit that referenced this issue May 24, 2022

Deal with index misalignment when using stratified shuffle split for …

a449c49

…calibration (#79)

nnansters added bug Something isn't working question Further information is requested and removed triage Needs to be assessed labels May 24, 2022

nnansters closed this as completed May 25, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Indexing coming from output of the stratified shuffle split error when fitting CBPE #79

Indexing coming from output of the stratified shuffle split error when fitting CBPE #79

SoyGema commented May 23, 2022

nnansters commented May 24, 2022

nnansters commented May 24, 2022

SoyGema commented May 24, 2022

Indexing coming from output of the stratified shuffle split error when fitting CBPE #79

Indexing coming from output of the stratified shuffle split error when fitting CBPE #79

Comments

SoyGema commented May 23, 2022

Description

What I Did

nnansters commented May 24, 2022

Reproducing the problem

Remediating the problem

Fixing the problem

nnansters commented May 24, 2022

SoyGema commented May 24, 2022