Problem Description
The Detection metrics use machine learning to determine whether the real vs. synthetic data can be detected. For this to work, we should only be using columns that are statistically modeled.
Expected behavior
When running any of the detection metrics, the following columns should be ignored:
- Primary keys
- Foreign keysEdit: Foreign keys do not need to be considered because Detection metrics are only implemented at the single table level.
- Any other kinds of IDs
- PII or sensitive data
- Text data (or data created by RegEx)
None of these columns provide any useful information for detection.
The remaining data types are statistically modeled and should be included: numerical, datetime, categorical (non-PII), boolean
Additional context
We already filtered out primary keys in #119. The issue of foreign keys is discussed in #285.