Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Indexing coming from output of the stratified shuffle split error when fitting CBPE #79

Closed
SoyGema opened this issue May 23, 2022 · 3 comments
Assignees
Labels
bug Something isn't working question Further information is requested

Comments

@SoyGema
Copy link
Contributor

SoyGema commented May 23, 2022

Hello There! I'm having an issue fitting CBPE with reference data. The backtrace seems to be related with some of the process related with calibration , to be more precise and as far as I can understand, with indexing coming from output of the stratified shuffle split . It might be related with my reference partition.

  • nannyml version: 0.4.1
  • Python version: 3.8
  • Operating System: Mac

Description

Fit CBPE Estimator on reference partition data

  • 5946 rows

  • 31 columns

  • Includes predicted value, predicted_proba, real value as customer column

  • Dates from 2020-12-01 to 2020-12-22 ( dataset sorted by date )

What I Did

cbpe_auto = nml.CBPE(model_metadata=metadata,metrics=['roc_auc', 'f1'] )
cbpe_auto.fit(reference_data=reference)
KeyError                                  Traceback (most recent call last)
Input In [10], in <cell line: 3>()
      1 cbpe_auto = nml.CBPE(model_metadata=metadata,metrics=['roc_auc', 'f1'] )
----> 3 cbpe_auto.fit(reference_data=reference)
      4 est_perf_auto = cbpe_auto.estimate(pd.concat([reference, analysis], ignore_index=True))
      5 fig = est_perf_auto.plot(kind='performance')

File ~/Documents/virtualenv/lib/python3.8/site-packages/nannyml/performance_estimation/confidence_based/_cbpe_binary_classification.py:145, in _BinaryClassificationCBPE.fit(self, reference_data)
    142 self.minimum_chunk_size = _minimum_chunk_size(reference_data)
    144 # Fit calibrator if calibration is needed
--> 145 self.needs_calibration = needs_calibration(
    146     y_true=reference_data[NML_METADATA_TARGET_COLUMN_NAME],
    147     y_pred_proba=reference_data[NML_METADATA_PREDICTED_PROBABILITY_COLUMN_NAME],
    148     calibrator=self.calibrator,
    149 )
    151 if self.needs_calibration:
    152     self.calibrator.fit(
    153         reference_data[NML_METADATA_PREDICTED_PROBABILITY_COLUMN_NAME],
    154         reference_data[NML_METADATA_TARGET_COLUMN_NAME],
    155     )

File ~/Documents/virtualenv/lib/python3.8/site-packages/nannyml/calibration.py:289, in needs_calibration(y_true, y_pred_proba, calibrator, bin_count, split_count)
    287     y_pred_proba_test, y_true_test = y_pred_proba.iloc[test, :], y_true[test]
    288 else:
--> 289     y_pred_proba_train, y_true_train = y_pred_proba[train], y_true[train]
    290     y_pred_proba_test, y_true_test = y_pred_proba[test], y_true[test]
    292 calibrator.fit(y_pred_proba_train, y_true_train)

File ~/Documents/virtualenv/lib/python3.8/site-packages/pandas/core/series.py:984, in Series.__getitem__(self, key)
    981     key = np.asarray(key, dtype=bool)
    982     return self._get_values(key)
--> 984 return self._get_with(key)

File ~/Documents/virtualenv/lib/python3.8/site-packages/pandas/core/series.py:1019, in Series._get_with(self, key)
   1015 if key_type == "integer":
   1016     # We need to decide whether to treat this as a positional indexer
   1017     #  (i.e. self.iloc) or label-based (i.e. self.loc)
   1018     if not self.index._should_fallback_to_positional:
-> 1019         return self.loc[key]
   1020     else:
   1021         return self.iloc[key]

File ~/Documents/virtualenv/lib/python3.8/site-packages/pandas/core/indexing.py:967, in _LocationIndexer.__getitem__(self, key)
    964 axis = self.axis or 0
    966 maybe_callable = com.apply_if_callable(key, self.obj)
--> 967 return self._getitem_axis(maybe_callable, axis=axis)

File ~/Documents/virtualenv/lib/python3.8/site-packages/pandas/core/indexing.py:1191, in _LocIndexer._getitem_axis(self, key, axis)
   1188     if hasattr(key, "ndim") and key.ndim > 1:
   1189         raise ValueError("Cannot index with multidimensional key")
-> 1191     return self._getitem_iterable(key, axis=axis)
   1193 # nested tuple slicing
   1194 if is_nested_tuple(key, labels):

File ~/Documents/virtualenv/lib/python3.8/site-packages/pandas/core/indexing.py:1132, in _LocIndexer._getitem_iterable(self, key, axis)
   1129 self._validate_key(key, axis)
   1131 # A collection of keys
-> 1132 keyarr, indexer = self._get_listlike_indexer(key, axis)
   1133 return self.obj._reindex_with_indexers(
   1134     {axis: [keyarr, indexer]}, copy=True, allow_dups=True
   1135 )

File ~/Documents/virtualenv/lib/python3.8/site-packages/pandas/core/indexing.py:1327, in _LocIndexer._get_listlike_indexer(self, key, axis)
   1324 ax = self.obj._get_axis(axis)
   1325 axis_name = self.obj._get_axis_name(axis)
-> 1327 keyarr, indexer = ax._get_indexer_strict(key, axis_name)
   1329 return keyarr, indexer

File ~/Documents/virtualenv/lib/python3.8/site-packages/pandas/core/indexes/base.py:5782, in Index._get_indexer_strict(self, key, axis_name)
   5779 else:
   5780     keyarr, indexer, new_indexer = self._reindex_non_unique(keyarr)
-> 5782 self._raise_if_missing(keyarr, indexer, axis_name)
   5784 keyarr = self.take(indexer)
   5785 if isinstance(key, Index):
   5786     # GH 42790 - Preserve name from an Index

File ~/Documents/virtualenv/lib/python3.8/site-packages/pandas/core/indexes/base.py:5845, in Index._raise_if_missing(self, key, indexer, axis_name)
   5842     raise KeyError(f"None of [{key}] are in the [{axis_name}]")
   5844 not_found = list(ensure_index(key)[missing_mask.nonzero()[0]].unique())
-> 5845 raise KeyError(f"{not_found} not in index")

KeyError: '[1979, 3272, 3695, 4953, 5721, 250, 764, 4733, 3654, 4011, 399, 3728, 3953, 1555, 1737, 3210, 1660, 2394, 5412, 2083, 3145, 2147, 1194, 2166, 2573, 1483, 2835, 4503, 133, 2982, 3703, 4963, 1014, 2208, 5166, 1765, 624] not in index'

Please, let me know if I should try something on my data reference partition.
Any idea what I should do ?
Thanks in advance for the help! :)

@nnansters nnansters self-assigned this May 23, 2022
@nnansters nnansters added the triage Needs to be assessed label May 23, 2022
@nnansters
Copy link
Contributor

Hi @SoyGema ,

I found what's been happening here.

Reproducing the problem

The main issue is the return of the StratifiedShuffleSplit.split method. It will return arrays of indices for train and test folds. These indices are based on the size of the data passed to it, i.e. the returned indices are within range [0, len(y_true)].

These indices do not correspond to the indices used within the DataFrame and which are not guaranteed to be within this range of [0, len(y_true)]. This might occur when using a slice of the reference data set. I'll illustrate using the built-in binary classification data set.

# After doing the regular data setup (not gonna go into that here)
estimator = nml.CBPE(model_metadata=metadata, chunk_size=chunk_size, metrics=['roc_auc', 'f1'])

# This will reproduce the issue
estimator.fit(reference.loc[40000:, :])

The reference DataFrame only contains indices in range [40000, 49999], but the stratified shuffle split results will try to use indexes in the range [0, 9999], resulting in the error you see.

Remediating the problem

You can work around this yourself by resetting the DataFrame indices before passing it to NannyML.

# After doing the regular data setup (not gonna go into that here)
estimator = nml.CBPE(model_metadata=metadata, chunk_size=chunk_size, metrics=['roc_auc', 'f1'])

# This will remediate the issue
estimator.fit(reference.loc[40000:, :].reset_index(drop=True))

Fixing the problem

We'll perform the index reset before running the stratified shuffle split in the CBPE.fit method.

@nnansters nnansters added bug Something isn't working question Further information is requested and removed triage Needs to be assessed labels May 24, 2022
@nnansters
Copy link
Contributor

@SoyGema , can you confirm the workaround or fix solve your issue?

@SoyGema
Copy link
Contributor Author

SoyGema commented May 24, 2022

It worked .
Thanks @nnansters

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants