-
Notifications
You must be signed in to change notification settings - Fork 326
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Diagnostic Report not 100% (Data Structure Score <100%) #2375
Comments
Hi there @markuoh do you mind also returning the details on your DiagnosticReport object specifically for 'Data Structure'?
If you'd like to share the code and it's spread out over multiple files, you could create GitHub gists and link to them: https://gist.github.com/ I will try to replicate without your code though. |
That's the only output I get when running the |
Hi there @markuoh and @srinify, chiming in here. As @markuoh mentions, the TableStructure metric is looking for matches in the column names and and the dtype to ensure that the outputted synthetic data is exactly the same structure as the inputted real data. So if the column names are the same, that would indeed imply that the dtypes are causing the problem. From looking at your code, I think you are applying your own pre- and post-processing functions into the pipeline. You are running the diagnostic report to compare the original data (before pre-process) and the synthetic data (after post-process). So we don't know if the root cause is in your pre/post-processing script or the synthesizer itself. I would recommend trying to run the diagnostic report directly on the data that you pass into the synthesizer and the synthetic data that you get out of the synthesizer. -- i.e. run the diagnostic using the metadata, data, and synthetic data you have in this file only. This would help you pinpoint if CTGANSynthesizer is causing the issue or if it is some other section of the pre/post-processing script. |
Hi @npatki, thank you for the reply!
Did you consider the fact that, in that stage, I already did some pre-processing on the input dataset? The pre-process file is this. |
Hi @markuoh yup, that is being considered. The CTGANSynthesizer guarantees that whatever is input into the synthesizer (input in ctgan = CTGANSynthesizer(metadata, epochs=250, batch_size=500, verbose=True)
ctgan.fit(data) # your pre-processed data is passed into fit
synthetic_data = ctgan.sample_from_conditions(
...)
# this should be run on whatever is passed into the synthesizer and outputted from it
run_diagnostic(data, synthetic_data, metadata) Anything you do outside of this (pre-processing before, post-processing after) is not really within the scope of SDV or the diagnostic report. |
Environment Details
Please indicate the following details about the environment in which you found the bug:
Error Description
I'm writing my thesis on Synthetic Data Generation for Fraud Detection Systems and I'm using Kaggle's credit card fraud detection dataset (https://www.kaggle.com/datasets/mlg-ulb/creditcardfraud) and SDV to evaluate the goodness of the generated synthetic data, but when I execute "run_diagnostics" I get a 100% on Data Validity but only 93.75% on Data Structure. I don't know if the type of the data inside the cells of the dataset might be relevant (e.g.: a float/integer mismatch), but the datasets have the same column names, which (as per the official SDV docs) are the only info I could find about the Data Structure Score.
Steps to reproduce
I put up a modular workflow in Python consisting of:
0. Analyze original dataset
I did not put the code in a Git repo, so I don't know what's the most convenient way to share the .py files I'm using for each phase.
Just let me know how I can be of any help.
The text was updated successfully, but these errors were encountered: