Diagnostic Report not 100% (Data Structure Score <100%) #2375

markuoh · 2025-02-19T21:22:17Z

Environment Details

Please indicate the following details about the environment in which you found the bug:

SDV version: 1.18.0
Python version: 3.9
Operating System: MacOS Sequoia 15.3

Error Description

I'm writing my thesis on Synthetic Data Generation for Fraud Detection Systems and I'm using Kaggle's credit card fraud detection dataset (https://www.kaggle.com/datasets/mlg-ulb/creditcardfraud) and SDV to evaluate the goodness of the generated synthetic data, but when I execute "run_diagnostics" I get a 100% on Data Validity but only 93.75% on Data Structure. I don't know if the type of the data inside the cells of the dataset might be relevant (e.g.: a float/integer mismatch), but the datasets have the same column names, which (as per the official SDV docs) are the only info I could find about the Data Structure Score.

Steps to reproduce

I put up a modular workflow in Python consisting of:
0. Analyze original dataset

Pre-process the original dataset
Analyze the pre-processed dataset
Train a CTGAN and sample data
Post-process the synthetic dataset
Analyze the post-processed dataset
Evaluate metrics

I did not put the code in a Git repo, so I don't know what's the most convenient way to share the .py files I'm using for each phase.
Just let me know how I can be of any help.

srinify · 2025-02-19T21:26:46Z

Hi there @markuoh do you mind also returning the details on your DiagnosticReport object specifically for 'Data Structure'?

diagnostic_report.get_details(property_name='Data Structure')

If you'd like to share the code and it's spread out over multiple files, you could create GitHub gists and link to them: https://gist.github.com/

I will try to replicate without your code though.

markuoh · 2025-02-19T21:31:23Z

        Metric      Score
0  TableStructure  0.9375

That's the only output I get when running the get_details method.

markuoh · 2025-02-19T21:44:57Z

Pre-process GIST: https://gist.github.com/markuoh/fe74f074973358e607e874ffbe47f65b

CTGAN GIST: https://gist.github.com/markuoh/90ca1d7004b4f01874d0c97b3f97817e

Post-process GIST: https://gist.github.com/markuoh/250d8f60b4bf149fcee1210fc4d88842

Evaluate-metrics GIST: https://gist.github.com/markuoh/b5c90de0cba439bcf3212b3fae81e128

npatki · 2025-02-20T00:31:00Z

Hi there @markuoh and @srinify, chiming in here.

As @markuoh mentions, the TableStructure metric is looking for matches in the column names and and the dtype to ensure that the outputted synthetic data is exactly the same structure as the inputted real data. So if the column names are the same, that would indeed imply that the dtypes are causing the problem.

From looking at your code, I think you are applying your own pre- and post-processing functions into the pipeline. You are running the diagnostic report to compare the original data (before pre-process) and the synthetic data (after post-process). So we don't know if the root cause is in your pre/post-processing script or the synthesizer itself.

I would recommend trying to run the diagnostic report directly on the data that you pass into the synthesizer and the synthetic data that you get out of the synthesizer. -- i.e. run the diagnostic using the metadata, data, and synthetic data you have in this file only.

This would help you pinpoint if CTGANSynthesizer is causing the issue or if it is some other section of the pre/post-processing script.

markuoh · 2025-02-20T10:26:18Z

Hi @npatki, thank you for the reply!

I would recommend trying to run the diagnostic report directly on the data that you pass into the synthesizer and the synthetic data that you get out of the synthesizer. -- i.e. run the diagnostic using the metadata, data, and synthetic data you have in this file only.

Did you consider the fact that, in that stage, I already did some pre-processing on the input dataset? The pre-process file is this.
Is it okay to do that, even if the input is pre-processed? If so, I'll test the diagnostic report right after the end of the fit() method.

npatki · 2025-02-20T15:20:39Z

Hi @markuoh yup, that is being considered.

The CTGANSynthesizer guarantees that whatever is input into the synthesizer (input in fit) should match the structure of anything that is outputted (output of sample). The diagnostic score should be 1.0 here. You are putting the pre-processed data into fit so that is what should be used for the diagnostic report. Eg:

ctgan = CTGANSynthesizer(metadata, epochs=250, batch_size=500, verbose=True)
ctgan.fit(data) # your pre-processed data is passed into fit
synthetic_data = ctgan.sample_from_conditions(
    ...)

# this should be run on whatever is passed into the synthesizer and outputted from it
run_diagnostic(data, synthetic_data, metadata)

Anything you do outside of this (pre-processing before, post-processing after) is not really within the scope of SDV or the diagnostic report.

markuoh added bug Something isn't working new Automatic label applied to new issues labels Feb 19, 2025

srinify added under discussion Issue is currently being discussed and removed new Automatic label applied to new issues labels Feb 19, 2025

srinify changed the title ~~[run_diagnostics] Data Structure Score <100%~~ Diagnostic Report not 100% (Data Structure Score <100%) Feb 19, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Diagnostic Report not 100% (Data Structure Score <100%) #2375

Diagnostic Report not 100% (Data Structure Score <100%) #2375

markuoh commented Feb 19, 2025

srinify commented Feb 19, 2025

markuoh commented Feb 19, 2025 •

edited

Loading

markuoh commented Feb 19, 2025

npatki commented Feb 20, 2025

markuoh commented Feb 20, 2025 •

edited

Loading

npatki commented Feb 20, 2025 •

edited

Loading

Diagnostic Report not 100% (Data Structure Score <100%) #2375

Diagnostic Report not 100% (Data Structure Score <100%) #2375

Comments

markuoh commented Feb 19, 2025

Environment Details

Error Description

Steps to reproduce

srinify commented Feb 19, 2025

markuoh commented Feb 19, 2025 • edited Loading

markuoh commented Feb 19, 2025

npatki commented Feb 20, 2025

markuoh commented Feb 20, 2025 • edited Loading

npatki commented Feb 20, 2025 • edited Loading

markuoh commented Feb 19, 2025 •

edited

Loading

markuoh commented Feb 20, 2025 •

edited

Loading

npatki commented Feb 20, 2025 •

edited

Loading