You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hello There! I'm having an issue fitting CBPE with reference data. The backtrace seems to be related with some of the process related with calibration , to be more precise and as far as I can understand, with indexing coming from output of the stratified shuffle split . It might be related with my reference partition.
nannyml version: 0.4.1
Python version: 3.8
Operating System: Mac
Description
Fit CBPE Estimator on reference partition data
5946 rows
31 columns
Includes predicted value, predicted_proba, real value as customer column
Dates from 2020-12-01 to 2020-12-22 ( dataset sorted by date )
The main issue is the return of the StratifiedShuffleSplit.split method. It will return arrays of indices for train and test folds. These indices are based on the size of the data passed to it, i.e. the returned indices are within range [0, len(y_true)].
These indices do not correspond to the indices used within the DataFrame and which are not guaranteed to be within this range of [0, len(y_true)]. This might occur when using a slice of the reference data set. I'll illustrate using the built-in binary classification data set.
# After doing the regular data setup (not gonna go into that here)estimator=nml.CBPE(model_metadata=metadata, chunk_size=chunk_size, metrics=['roc_auc', 'f1'])
# This will reproduce the issueestimator.fit(reference.loc[40000:, :])
The reference DataFrame only contains indices in range [40000, 49999], but the stratified shuffle split results will try to use indexes in the range [0, 9999], resulting in the error you see.
Remediating the problem
You can work around this yourself by resetting the DataFrame indices before passing it to NannyML.
# After doing the regular data setup (not gonna go into that here)estimator=nml.CBPE(model_metadata=metadata, chunk_size=chunk_size, metrics=['roc_auc', 'f1'])
# This will remediate the issueestimator.fit(reference.loc[40000:, :].reset_index(drop=True))
Fixing the problem
We'll perform the index reset before running the stratified shuffle split in the CBPE.fit method.
Hello There! I'm having an issue fitting CBPE with reference data. The backtrace seems to be related with some of the process related with calibration , to be more precise and as far as I can understand, with indexing coming from output of the stratified shuffle split . It might be related with my reference partition.
Description
Fit CBPE Estimator on reference partition data
5946 rows
31 columns
Includes predicted value, predicted_proba, real value as customer column
Dates from 2020-12-01 to 2020-12-22 ( dataset sorted by date )
What I Did
Please, let me know if I should try something on my data reference partition.
Any idea what I should do ?
Thanks in advance for the help! :)
The text was updated successfully, but these errors were encountered: