diff --git a/README.md b/README.md index 99e6edb5..8202a718 100644 --- a/README.md +++ b/README.md @@ -28,10 +28,10 @@ Otherwise, if you want to install the UI version with streamlit, run: pip install syngen[ui] ``` -*Note*: see details of the UI usage in the [corresponding section](#ui-web-interface) +*Note:* see details of the UI usage in the [corresponding section](#ui-web-interface) -The training and inference processes are separated with two cli entry points. The training one receives paths to the original table, metadata json file or table name and used hyperparameters.
+The training and inference processes are separated with two CLI entry points. The training one receives paths to the original table, metadata json file or table name and used hyperparameters.
To start training with defaults parameters run: @@ -74,17 +74,33 @@ train --source PATH_TO_ORIGINAL_CSV \ --epochs INT \ --row_limit INT \ --drop_null BOOL \ - --print_report BOOL \ + --reports STR \ --batch_size INT ``` +*Note:* To specify multiple options for the *--reports* parameter, you need to provide the *--reports* parameter multiple times. +For example: +```bash +train --source PATH_TO_ORIGINAL_CSV \ + --table_name TABLE_NAME \ + --reports accuracy \ + --reports sample +``` +The accepted values for the parameter "reports": + - "none" (default) - no reports will be generated + - "accuracy" - generates an accuracy report to measure the quality of synthetic data relative to the original dataset. This report is produced after the completion of the training process, during which a model learns to generate new data. The synthetic data generated for this report is of the same size as the original dataset to reach more accurate comparison. + - "sample" - generates a sample report (if original data is sampled, the comparison of distributions of original data and sampled data is provided in the report) + - "metrics_only" - outputs the metrics information only to standard output without generation of an accuracy report + - "all" - generates both accuracy and sample reports
+Default value is "none". + To train one or more tables using a metadata file, you can use the following command: ```bash train --metadata_path PATH_TO_METADATA_YAML ``` -The parameters which you can set up for training process: +Parameters that you can set up for training process: - source – required parameter for training of single table, a path to the file that you want to use as a reference - table_name – required parameter for training of single table, an arbitrary string to name the directories @@ -92,7 +108,7 @@ The parameters which you can set up for training process: - row_limit – a number of rows to train over. A number less than the original table length will randomly subset the specified number of rows - drop_null – whether to drop rows with at least one missing value - batch_size – if specified, the training is split into batches. This can save the RAM -- print_report - whether to generate accuracy and sampling reports. Please note that the sampling report is generated only if the `row_limit` parameter is set. +- reports - controls the generation of quality reports, might require significant time for big tables (>10000 rows) - metadata_path – a path to the metadata file containing the metadata - column_types - might include the section categorical which contains the listed columns defined as categorical by a user @@ -103,7 +119,7 @@ Requirements for parameters of training process: * row_limit - data type - integer * drop_null - data type - boolean, default value - False * batch_size - data type - integer, must be equal to or more than 1, default value - 32 -* print_report - data type - boolean, default value is False +* reports - data type - if the value is passed through CLI - string, if the value is passed in the metadata file - string or list, accepted values: "none" (default) - no reports will be generated, "all" - generates both accuracy and sample reports, "accuracy" - generates an accuracy report, "sample" - generates a sample report, "metrics_only" - outputs the metrics information only to standard output without generation of a report. Default value is "none". In the metadata file multiple values can be specified as a list of available options ("accuracy", "sample", "metrics_only") to generate multiple types of reports simultaneously, e.g. ["metrics_only", "sample"] * metadata_path - data type - string * column_types - data type - dictionary with the key categorical - the list of columns (data type - string) @@ -117,9 +133,23 @@ infer --size INT \ --run_parallel BOOL \ --batch_size INT \ --random_seed INT \ - --print_report BOOL + --reports STR ``` +*Note:* To specify multiple options for the *--reports* parameter, you need to provide the *--reports* parameter multiple times. +For example: +```bash +infer --table_name TABLE_NAME \ + --reports accuracy \ + --reports metrics_only +``` +The accepted values for the parameter "reports": + - "none" (default) - no reports will be generated + - "accuracy" - generates an accuracy report that compares original and synthetic data patterns to verify the quality of the generated data + - "metrics_only" - outputs the metrics information only to standard output without generation of an accuracy report + - "all" - generates an accuracy report
+Default value is "none". + To generate one or more tables using a metadata file, you can use the following command: ```bash @@ -133,7 +163,7 @@ The parameters which you can set up for generation process: - run_parallel – whether to use multiprocessing (feasible for tables > 5000 rows) - batch_size – if specified, the generation is split into batches. This can save the RAM - random_seed – if specified, generates a reproducible result -- print_report – whether to generate accuracy and sampling reports. Please note that the sampling report is generated only if the row_limit parameter is set. +- reports - controls the generation of quality reports, might require significant time for big generated tables (>10000 rows) - metadata_path – a path to metadata file Requirements for parameters of generation process: @@ -142,13 +172,13 @@ Requirements for parameters of generation process: * run_parallel - data type - boolean, default value is False * batch_size - data type - integer, must be equal to or more than 1 * random_seed - data type - integer, must be equal to or more than 0 -* print_report - data type - boolean, default value is False +* reports - data type - if the value is passed through CLI - string, if the value is passed in the metadata file - string or list, accepted values: "none" (default) - no reports will be generated, "all" - generates an accuracy report, "accuracy" - generates an accuracy report, "metrics_only" - outputs the metrics information only to standard output without generation of a report. Default value is "none". In the metadata file multiple values can be specified as a list of available options ("accuracy", "metrics_only") to generate multiple types of reports simultaneously * metadata_path - data type - string The metadata can contain any of the arguments above for each table. If so, the duplicated arguments from the CLI will be ignored. -Note: If you want to set the logging level, you can use the parameter log_level in the CLI call: +*Note:* If you want to set the logging level, you can use the parameter log_level in the CLI call: ```bash train --source STR --table_name STR --log_level STR @@ -159,7 +189,6 @@ infer --metadata_path STR --log_level STR where log_level might be one of the following values: TRACE, DEBUG, INFO, WARNING, ERROR, CRITICAL. - ### Linked tables generation To generate one or more tables, you might provide metadata in yaml format. By providing information about the relationships @@ -167,7 +196,7 @@ between tables via metadata, it becomes possible to manage complex relationships You can also specify additional parameters needed for training and inference in the metadata file and in this case, they will be ignored in the CLI call. -Note: By using metadata file, you can also generate tables with absent relationships. +*Note:* By using metadata file, you can also generate tables with absent relationships. In this case, the tables will be generated independently. The yaml metadata file should match the following template: @@ -179,15 +208,14 @@ global: # Global settings. Optional paramete drop_null: False # Drop rows with NULL values. Optional parameter row_limit: null # Number of rows to train over. A number less than the original table length will randomly subset the specified rows number. Optional parameter batch_size: 32 # If specified, the training is split into batches. This can save the RAM. Optional parameter - print_report: False # Turn on or turn off generation of the report. Optional parameter + reports: none # Controls the generation of quality reports. Optional parameter. Accepted values: "none" (default) - no reports will be generated, "all" - generates both accuracy and sample reports, "accuracy" - generates an accuracy report, "sample" - generates a sample report, "metrics_only" - outputs the metrics information only to standard output without generation of a report. Multiple values can be specified as a list to generate multiple types of reports simultaneously, e.g. ["metrics_only", "sample"]. Might require significant time for big tables (>10000 rows). infer_settings: # Settings for infer process. Optional parameter size: 100 # Size for generated data. Optional parameter run_parallel: False # Turn on or turn off parallel training process. Optional parameter - print_report: False # Turn on or turn off generation of the report. Optional parameter + reports: none # Controls the generation of quality reports. Optional parameter. Accepted values: "none" (default) - no reports will be generated, "all" - generates an accuracy report, "accuracy" - generates an accuracy report, "metrics_only" - outputs the metrics information only to standard output without generation of a report. Multiple values can be specified as a list to generate multiple types of reports simultaneously. Might require significant time for big generated tables (>10000 rows). batch_size: null # If specified, the generation is split into batches. This can save the RAM. Optional parameter random_seed: null # If specified, generates a reproducible result. Optional parameter - get_infer_metrics: False # Whether to fetch metrics for the inference process. If the parameter 'print_report' is set to True, the 'get_infer_metrics' parameter will be ignored and metrics will be fetched anyway. Optional parameter CUSTOMER: # Table name. Required parameter train_settings: # Settings for training process. Required parameter @@ -196,7 +224,7 @@ CUSTOMER: # Table name. Required parameter drop_null: False # Drop rows with NULL values. Optional parameter row_limit: null # Number of rows to train over. A number less than the original table length will randomly subset the specified rows number. Optional parameter batch_size: 32 # If specified, the training is split into batches. This can save the RAM. Optional parameter - print_report: False # Turn on or turn off generation of the report. Optional parameter + reports: none # Controls the generation of quality reports. Optional parameter. Accepted values: "none" (default) - no reports will be generated, "all" - generates both accuracy and sample reports, "accuracy" - generates an accuracy report, "sample" - generates a sample report, "metrics_only" - outputs the metrics information only to standard output without generation of a report. Multiple values can be specified as a list to generate multiple types of reports simultaneously, e.g. ["metrics_only", "sample"]. Might require significant time for big tables (>10000 rows). column_types: categorical: # Force listed columns to have categorical type (use dictionary of values). Optional parameter - gender @@ -218,10 +246,10 @@ CUSTOMER: # Table name. Required parameter destination: "./files/generated_data_customer.csv" # The path where the generated data will be stored. If the information about 'destination' isn't specified, by default the synthetic data will be stored locally in '.csv'. Supported formats include local files in '.csv', '.avro' formats. Optional parameter size: 100 # Size for generated data. Optional parameter run_parallel: False # Turn on or turn off parallel training process. Optional parameter - print_report: False # Turn on or turn off generation of the report. Optional parameter + reports: none # Controls the generation of quality reports. Optional parameter. Accepted values: "none" (default) - no reports will be generated, "all" - generates an accuracy report, "accuracy" - generates an accuracy report, "metrics_only" - outputs the metrics information only to standard output without generation of a report. Multiple values can be specified as a list to generate multiple types of reports simultaneously. Might require significant time for big generated tables (>10000 rows). batch_size: null # If specified, the generation is split into batches. This can save the RAM. Optional parameter random_seed: null # If specified, generates a reproducible result. Optional parameter - get_infer_metrics: False # Whether to fetch metrics for the inference process. If the parameter 'print_report' is set to True, the 'get_infer_metrics' parameter will be ignored and metrics will be fetched anyway. Optional parameter + keys: # Keys of the table. Optional parameter PK_CUSTOMER_ID: # Name of a key. Only one PK per table. type: "PK" # The key type. Supported: PK - primary key, FK - foreign key, TKN - token key @@ -261,20 +289,19 @@ ORDER: # Table name. Required parameter drop_null: False # Drop rows with NULL values. Optional parameter row_limit: null # Number of rows to train over. A number less than the original table length will randomly subset the specified rows number. Optional parameter batch_size: 32 # If specified, the training is split into batches. This can save the RAM. Optional parameter - print_report: False # Turn on or turn off generation of the report. Optional parameter + reports: none # Controls the generation of quality reports. Optional parameter. Accepted values: "none" (default) - no reports will be generated, "all" - generates both accuracy and sample reports, "accuracy" - generates an accuracy report, "sample" - generates a sample report, "metrics_only" - outputs the metrics information only to standard output without generation of a report, e.g. ["metrics_only", "sample"]. Might require significant time for big tables (>10000 rows). column_types: - categorical: # Force listed columns to have categorical type (use dictionary of values). Optional parameter - - gender - - marital_status + categorical: # Force listed columns to have categorical type (use dictionary of values). Optional parameter + - gender + - marital_status infer_settings: # Settings for infer process. Optional parameter destination: "./files/generated_data_order.csv" # The path where the generated data will be stored. If the information about 'destination' isn't specified, by default the synthetic data will be stored locally in '.csv'. Supported formats include local files in 'csv', '.avro' formats. Required parameter size: 100 # Size for generated data. Optional parameter run_parallel: False # Turn on or turn off parallel training process. Optional parameter - print_report: False # Turn on or turn off generation of the report. Optional parameter + reports: none # Controls the generation of quality reports. Optional parameter. Accepted values: "none" (default) - no reports will be generated, "all" - generates an accuracy report, "accuracy" - generates an accuracy report, "metrics_only" - outputs the metrics information only to standard output without generation of a report. Multiple values can be specified as a list to generate multiple types of reports simultaneously. Might require significant time for big generated tables (>10000 rows). batch_size: null # If specified, the generation is split into batches. This can save the RAM. Optional parameter random_seed: null # If specified, generates a reproducible result. Optional parameter - get_infer_metrics: False # Whether to fetch metrics for the inference process. If the parameter 'print_report' is set to True, the 'get_infer_metrics' parameter will be ignored and metrics will be fetched anyway. Optional parameter format: # Settings for reading and writing data in 'csv' format. Optional parameter sep: ',' # Delimiter to use. Optional parameter quotechar: '"' # The character used to denote the start and end of a quoted item. Optional parameter @@ -298,11 +325,11 @@ ORDER: # Table name. Required parameter - customer_id references: table: "CUSTOMER" - columns: + columns: - customer_id ``` -Note: +*Note:*