Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Diagnostic Report not 100% (Data Structure Score <100%) #2375

Open
markuoh opened this issue Feb 19, 2025 · 6 comments
Open

Diagnostic Report not 100% (Data Structure Score <100%) #2375

markuoh opened this issue Feb 19, 2025 · 6 comments
Labels
bug Something isn't working under discussion Issue is currently being discussed

Comments

@markuoh
Copy link

markuoh commented Feb 19, 2025

Environment Details

Please indicate the following details about the environment in which you found the bug:

  • SDV version: 1.18.0
  • Python version: 3.9
  • Operating System: MacOS Sequoia 15.3

Error Description

I'm writing my thesis on Synthetic Data Generation for Fraud Detection Systems and I'm using Kaggle's credit card fraud detection dataset (https://www.kaggle.com/datasets/mlg-ulb/creditcardfraud) and SDV to evaluate the goodness of the generated synthetic data, but when I execute "run_diagnostics" I get a 100% on Data Validity but only 93.75% on Data Structure. I don't know if the type of the data inside the cells of the dataset might be relevant (e.g.: a float/integer mismatch), but the datasets have the same column names, which (as per the official SDV docs) are the only info I could find about the Data Structure Score.

Steps to reproduce

I put up a modular workflow in Python consisting of:
0. Analyze original dataset

  1. Pre-process the original dataset
  2. Analyze the pre-processed dataset
  3. Train a CTGAN and sample data
  4. Post-process the synthetic dataset
  5. Analyze the post-processed dataset
  6. Evaluate metrics

I did not put the code in a Git repo, so I don't know what's the most convenient way to share the .py files I'm using for each phase.
Just let me know how I can be of any help.

@markuoh markuoh added bug Something isn't working new Automatic label applied to new issues labels Feb 19, 2025
@srinify
Copy link
Contributor

srinify commented Feb 19, 2025

Hi there @markuoh do you mind also returning the details on your DiagnosticReport object specifically for 'Data Structure'?

diagnostic_report.get_details(property_name='Data Structure')

If you'd like to share the code and it's spread out over multiple files, you could create GitHub gists and link to them: https://gist.github.com/

I will try to replicate without your code though.

@srinify srinify added under discussion Issue is currently being discussed and removed new Automatic label applied to new issues labels Feb 19, 2025
@srinify srinify changed the title [run_diagnostics] Data Structure Score <100% Diagnostic Report not 100% (Data Structure Score <100%) Feb 19, 2025
@markuoh
Copy link
Author

markuoh commented Feb 19, 2025

        Metric      Score
0  TableStructure  0.9375

That's the only output I get when running the get_details method.

@npatki
Copy link
Contributor

npatki commented Feb 20, 2025

Hi there @markuoh and @srinify, chiming in here.

As @markuoh mentions, the TableStructure metric is looking for matches in the column names and and the dtype to ensure that the outputted synthetic data is exactly the same structure as the inputted real data. So if the column names are the same, that would indeed imply that the dtypes are causing the problem.

From looking at your code, I think you are applying your own pre- and post-processing functions into the pipeline. You are running the diagnostic report to compare the original data (before pre-process) and the synthetic data (after post-process). So we don't know if the root cause is in your pre/post-processing script or the synthesizer itself.

I would recommend trying to run the diagnostic report directly on the data that you pass into the synthesizer and the synthetic data that you get out of the synthesizer. -- i.e. run the diagnostic using the metadata, data, and synthetic data you have in this file only.

This would help you pinpoint if CTGANSynthesizer is causing the issue or if it is some other section of the pre/post-processing script.

@markuoh
Copy link
Author

markuoh commented Feb 20, 2025

Hi @npatki, thank you for the reply!

I would recommend trying to run the diagnostic report directly on the data that you pass into the synthesizer and the synthetic data that you get out of the synthesizer. -- i.e. run the diagnostic using the metadata, data, and synthetic data you have in this file only.

Did you consider the fact that, in that stage, I already did some pre-processing on the input dataset? The pre-process file is this.
Is it okay to do that, even if the input is pre-processed? If so, I'll test the diagnostic report right after the end of the fit() method.

@npatki
Copy link
Contributor

npatki commented Feb 20, 2025

Hi @markuoh yup, that is being considered.

The CTGANSynthesizer guarantees that whatever is input into the synthesizer (input in fit) should match the structure of anything that is outputted (output of sample). The diagnostic score should be 1.0 here. You are putting the pre-processed data into fit so that is what should be used for the diagnostic report. Eg:

ctgan = CTGANSynthesizer(metadata, epochs=250, batch_size=500, verbose=True)
ctgan.fit(data) # your pre-processed data is passed into fit
synthetic_data = ctgan.sample_from_conditions(
    ...)

# this should be run on whatever is passed into the synthesizer and outputted from it
run_diagnostic(data, synthetic_data, metadata)

Anything you do outside of this (pre-processing before, post-processing after) is not really within the scope of SDV or the diagnostic report.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working under discussion Issue is currently being discussed
Projects
None yet
Development

No branches or pull requests

3 participants