Skip to content

Detection metrics should only use statistically modeled columns (filter out the rest) #286

@npatki

Description

@npatki

Problem Description

The Detection metrics use machine learning to determine whether the real vs. synthetic data can be detected. For this to work, we should only be using columns that are statistically modeled.

Expected behavior

When running any of the detection metrics, the following columns should be ignored:

  • Primary keys
  • Foreign keys Edit: Foreign keys do not need to be considered because Detection metrics are only implemented at the single table level.
  • Any other kinds of IDs
  • PII or sensitive data
  • Text data (or data created by RegEx)

None of these columns provide any useful information for detection.

The remaining data types are statistically modeled and should be included: numerical, datetime, categorical (non-PII), boolean

Additional context

We already filtered out primary keys in #119. The issue of foreign keys is discussed in #285.

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions