Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KeyError: '0', Sampling terminated, GaussianCopulaSynthesizer #2376

Open
PieterKnops opened this issue Feb 20, 2025 · 2 comments
Open

KeyError: '0', Sampling terminated, GaussianCopulaSynthesizer #2376

PieterKnops opened this issue Feb 20, 2025 · 2 comments
Labels
bug Something isn't working under discussion Issue is currently being discussed

Comments

@PieterKnops
Copy link

PieterKnops commented Feb 20, 2025

Environment Details

Please indicate the following details about the environment in which you found the bug:

  • SDV version: 1.17.4
  • Python version: 3.10.15
  • Operating System: Ubuntu 20.04.4 LTS

Error Description

I'm generating data with a GaussianCopulaSynthesizer, but during generation, it errors in the following line:

File rdt/transformers/categorical.py:188, in UniformEncoder._transform.<locals>.map_labels(label) dt/transformers/categorical.py:187 def map_labels(label): ---> rdt/transformers/categorical.py:188 return np.random.uniform(self.intervals[label][0], self.intervals[label][1]) KeyError: '0'

Steps to reproduce

I find this something difficult to provide, as the data is confidential and I haven't succeeded in creating a minimal example. Furthermore, the error happens in a different row every time I run this, regardless of the input not changing.

The input doesn't contain empty values, is a mix of integer, float and categorical variables.
I run generation with the sample_missing_columns.

Has anyone experienced something like this? How can I best debug this?

Edits: Formatting

Full stack trace:

KeyError                                  Traceback (most recent call last)
File ~....venv/lib/python3.12/site-packages/sdv/single_table/base.py:1263, in BaseSingleTableSynthesizer.sample_remaining_columns(self, known_columns, max_tries_per_batch, batch_size, output_file_path)
   1262     progress_bar.set_description('Sampling remaining columns')
-> 1263     sampled = self._sample_with_conditions(
   1264         known_columns, max_tries_per_batch, batch_size, progress_bar, output_file_path
   1265     )
   1267 is_reject_sampling = hasattr(self, '_model') and not isinstance(
   1268     self._model, copulas.multivariate.GaussianMultivariate
   1269 )

File ~....venv/lib/python3.12/site-packages/sdv/single_table/base.py:1068, in BaseSingleTableSynthesizer._sample_with_conditions(self, conditions, max_tries_per_batch, batch_size, progress_bar, output_file_path)
   1067 try:
-> 1068     transformed_condition = self._data_processor.transform(
   1069         condition_df, is_condition=True
   1070     )
   1071 except ConstraintsNotMetError as error:

File ~....venv/lib/python3.12/site-packages/sdv/data_processing/data_processor.py:837, in DataProcessor.transform(self, data, is_condition)
    836 try:
--> 837     transformed = self._hyper_transformer.transform_subset(data)
    838 except (rdt.errors.NotFittedError, rdt.errors.ConfigNotSetError):

File ~....venv/lib/python3.12/site-packages/rdt/hyper_transformer.py:818, in HyperTransformer.transform_subset(self, data)
    808 """Transform a subset of the fitted data's columns.
    809 
    810 Args:
   (...)
    816         Transformed subset.
    817 """
--> 818 return self._transform(data, prevent_subset=False)

File ~....venv/lib/python3.12/site-packages/rdt/hyper_transformer.py:802, in HyperTransformer._transform(self, data, prevent_subset)
    801 for transformer in self._transformers_sequence:
--> 802     data = transformer.transform(data)
    804 transformed_columns = self._subset(self._output_columns, data.columns)

File ~....venv/lib/python3.12/site-packages/rdt/transformers/base.py:57, in random_state.<locals>.wrapper(self, *args, **kwargs)
     56 with set_random_states(self.random_states, method_name, self.set_random_state):
---> 57     return function(self, *args, **kwargs)

File ~....venv/lib/python3.12/site-packages/rdt/transformers/base.py:428, in BaseTransformer.transform(self, data)
    427 columns_data = self._get_columns_data(data, self.columns)
--> 428 transformed_data = self._transform(columns_data)
    429 data = data.drop(self.columns, axis=1)

File ~....venv/lib/python3.12/site-packages/rdt/transformers/categorical.py:190, in UniformEncoder._transform(self, data)
    188     return np.random.uniform(self.intervals[label][0], self.intervals[label][1])
--> 190 return data_with_none.map(map_labels).astype(float)

File ~....venv/lib/python3.12/site-packages/pandas/core/series.py:4700, in Series.map(self, arg, na_action)
   4625 """
   4626 Map values of Series according to an input mapping or function.
   4627 
   (...)
   4698 dtype: object
   4699 """
-> 4700 new_values = self._map_values(arg, na_action=na_action)
   4701 return self._constructor(new_values, index=self.index, copy=False).__finalize__(
   4702     self, method="map"
   4703 )

File ~....venv/lib/python3.12/site-packages/pandas/core/base.py:921, in IndexOpsMixin._map_values(self, mapper, na_action, convert)
    919     return arr.map(mapper, na_action=na_action)
--> 921 return algorithms.map_array(arr, mapper, na_action=na_action, convert=convert)

File ~....venv/lib/python3.12/site-packages/pandas/core/algorithms.py:1743, in map_array(arr, mapper, na_action, convert)
   1742 if na_action is None:
-> 1743     return lib.map_infer(values, mapper, convert=convert)
   1744 else:

File lib.pyx:2972, in pandas._libs.lib.map_infer()

File ~....venv/lib/python3.12/site-packages/rdt/transformers/categorical.py:188, in UniformEncoder._transform.<locals>.map_labels(label)
    187 def map_labels(label):
--> 188     return np.random.uniform(self.intervals[label][0], self.intervals[label][1])

KeyError: '0'

The above exception was the direct cause of the following exception:

KeyError                                  Traceback (most recent call last)
Cell In[5], line 51
     49 # For example, it goes from  .. to 28 minutes if adding educ_years_0, educ_years_1, House, OtherRealEstate, Business
     50 acs_psid_complete_sipp_first_two_to_generate_next_sipp_person = acs_psid_complete_sipp_first_two[columns_to_match_extra_sipp]
---> 51 extra_generated_persons_sipp = sipp_generator_groups.sample_remaining_columns(
     52     known_columns=acs_psid_complete_sipp_first_two_to_generate_next_sipp_person.dropna(),
     53     max_tries_per_batch=20,
     54     batch_size=20
     55 )
     57 extra_generated_persons_sipp[[col for col in extra_generated_persons_sipp.columns if col[-2:] == "_2"]]
     58 extra_generated_persons_sipp.index = extra_generated_persons_sipp.index.to_series().apply(lambda x: int(x))

File ~....venv/lib/python3.12/site-packages/sdv/single_table/base.py:1279, in BaseSingleTableSynthesizer.sample_remaining_columns(self, known_columns, max_tries_per_batch, batch_size, output_file_path)
   1271     check_num_rows(
   1272         num_rows=len(sampled),
   1273         expected_num_rows=len(known_columns),
   1274         is_reject_sampling=is_reject_sampling,
   1275         max_tries_per_batch=max_tries_per_batch,
   1276     )
   1278 except (Exception, KeyboardInterrup
<Replace this text with a description of the steps that anyone can follow to reproduce the error. If the error happens only on a specific dataset, please consider attaching some example data to the issue so that others can use it to reproduce the error.>

Paste the command(s) you ran and the output.
If there was a crash, please include the traceback here.

-> 1279     handle_sampling_error(output_file_path, error)
   1281 return sampled

File ~....venv/lib/python3.12/site-packages/sdv/single_table/utils.py:106, in handle_sampling_error(output_file_path, sampling_error)
    100     error_msg = (
    101         'Error: Sampling terminated. No results were saved due to unspecified '
    102         '"output_file_path".'
    103     )
    105 if error_msg:
--> 106     raise type(sampling_error)(error_msg) from sampling_error
    108 raise sampling_error

KeyError: 'Error: Sampling terminated. No results were saved due to unspecified "output_file_path".'```
@PieterKnops PieterKnops added bug Something isn't working new Automatic label applied to new issues labels Feb 20, 2025
@PieterKnops
Copy link
Author

PieterKnops commented Feb 20, 2025

After more invesigation, it seems like rows containing new categories (unseen during fitting) are prone to this error.
It happens with maybe a 10% probability for these rows. Other rows are sampled fine.

Are new categories maybe sometimes incorrectly (?) set to 0? That might explain the error message in these lines of code
[187] def map_labels(label): --> [188] rdt/transformers/categorical.py:188) return np.random.uniform(self.intervals[label][0], self.intervals[label][1])

With KeyError: 0. Which let's me think the label is set to 0, and might be unexpected.

@npatki
Copy link
Contributor

npatki commented Feb 20, 2025

Hi @PieterKnops nice to meet you. I understand that your data is too sensitive to share. For debugging purposes, it would be nice to get other information on the code you're running to instantiate, fit, and sample your GaussianCopulaSynthesizer.

For example, are you doing something like this? Are you adding any other customizations to your synthesizer (such as constraints, updating transformers, etc.)?

synthesizer = GaussianCopulaSynthesizer(metadata)
synthesizer.fit(data)
synthetic_data = synthesizer.sample_remaining_columns(??)

Additionally, are you able to share your metadata (that just contains column/table names)? You can also anonymize your metadata before sharing. This would greatly speed up the debugging process.

it seems like rows containing new categories (unseen during fitting) are prone to this error.
It happens with maybe a 10% probability for these rows. Other rows are sampled fine.

This makes sense, as the conditions that you provide to your synthesizer should be within the bounds of whatever was passed in during fit. For categorical data, this means that it must be a category value that was present in fit. Otherwise, the synthesizer doesn't know what to do with the category value, as it has not learned any associated patterns about it.

However, I'm not sure if that's the root cause of your issue. If I pass in new category values, I just see some warnings -- but I do see that synthetic data for other (valid) rows is correctly sampled.

Sampling remaining columns:  33%|███▎      | 1/3 [00:08<00:16,  8.42s/it]/usr/local/lib/python3.11/dist-packages/rdt/transformers/categorical.py:175: UserWarning: The data in column 'room_type' contains new categories that did not appear during 'fit' (TEST). Assigning them random values. If you want to model new categories, please fit the data again using 'fit'.
  warnings.warn(
Sampling remaining columns:  33%|███▎      | 1/3 [00:15<00:30, 15.17s/it]
/usr/local/lib/python3.11/dist-packages/sdv/single_table/utils.py:154: UserWarning: Only able to sample 1 rows for the given conditions. To sample more rows, try increasing `max_tries_per_batch` (currently: 100). Note that increasing this value will also increase the sampling time.
  warnings.warn(user_msg)

@npatki npatki added under discussion Issue is currently being discussed and removed new Automatic label applied to new issues labels Feb 20, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working under discussion Issue is currently being discussed
Projects
None yet
Development

No branches or pull requests

2 participants