eval does not share encoding transformers #250

bvanbreugel · 2024-01-26T11:12:29Z

Description

In metrics/eval.py, each dataset (e.g. X_gt, X_syn) is encoded separately. This is problematic, as this fits separate sklearn.preprocessing.LabelEncoder's. This results in unexpected behaviour if the unique elements for each column are not identical for X_gt, X_syn, as in this case the encoding of X_gt does not denote the same variable as in X_syn.

How to Reproduce

from sklearn.preprocessing import LabelEncoder
df_real = LabelEncoder.fit_transform(pd.DataFrame(["0","1", "2"])[0])
>>> [0,1,2]
df_syn = LabelEncoder.fit_transform(pd.DataFrame(["1","2", "2"])[0])
>>> [0,1,1]

Expected Behavior

Evidently, above we want the processed df_syn to be [1,2,2].

Fix

Seems like we can just get the encoders when calling X_gt.encode(), and pass this to all other encode calls.

The text was updated successfully, but these errors were encountered:

bvanbreugel mentioned this issue Jan 26, 2024

reuse encoders for ordinal variables [type:bug] #251

Closed

4 tasks

bvanbreugel mentioned this issue Feb 22, 2024

Metrics: ensure ordinal encoder of classes is the same in real and synthetic datasets [type:bug] #257

Merged

4 tasks

robsdavis closed this as completed in #257 Mar 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

eval does not share encoding transformers #250

eval does not share encoding transformers #250

bvanbreugel commented Jan 26, 2024

eval does not share encoding transformers #250

eval does not share encoding transformers #250

Comments

bvanbreugel commented Jan 26, 2024

Description

How to Reproduce

Expected Behavior

Fix