This tool is designed to perform various data operations like categorization, normalization, and enrichment using OpenAI's GPT-3.5-turbo model.
- Python 3.7 or above installed
- OpenAI API key. You can get it by creating an account at https://beta.openai.com/signup/.
All required Python packages are listed in requirements.txt
file. You can install all of them by running the following command in your terminal:
python3 -m pip install -r requirements.txt
The tool uses a YAML configuration file named config.yaml where you define the tasks you want to execute. The tasks include:
- categorize: Categorizes provided responses according to predefined categories
- normalize: Normalizes text data according to specified rules
- enrich: Enriches the data with additional relevant information
Each task should be specified in the following way:
tasks:
- operation: "operation_name" # one of "categorize", "normalize", "enrich"
context: "context_for_operation" # context or rules for the operation
input_file: "input_file.csv" # input file containing the data
output_file: "output_file.csv" # output file where results will be stored
data_column: "column_name" # name of the column in the CSV that contains the data to be processed
model: "gpt-3.5-turbo-16k" # OpenAI model to use
batch_size: 50 # number of rows to process at a time
id_column: "id" # column name of the ID column
subcategories: # only for "categorize" operation, list of possible categories
- name: category_name
description: category_description
To run the tool, you will need to set the OPENAI_API_KEY environment variable to your OpenAI API key. You can do it in the terminal by running:
export OPENAI_API_KEY='your-api-key'
Or you can set it programmatically before running the script:
import os
os.environ["OPENAI_API_KEY"] = 'your-api-key'
To execute the tasks, run the script as follows:
python3 clean_data.py
The output for each task will be written to the CSV file specified in the output_file parameter for each task in the config.yaml. The output will be appended to the existing data in the file.
In case of errors or exceptions, the script will print an error message in the console. The most common exceptions are OpenAI's ServiceUnavailableError
(when the OpenAI API service is temporarily unavailable) and KeyError
(when an operation name does not exist).
Here's an example config.yaml that uses all three operations:
The tool uses a YAML configuration file named config.yaml where you define the tasks you want to execute. Here's an example config.yaml:
tasks:
- operation: "categorize"
context: "What is your job title?"
input_file: "random_job_titles.csv"
output_file: "job_title_processed.csv"
data_column: "job_title"
model: "gpt-3.5-turbo-16k"
batch_size: 50
id_column: "id"
subcategories:
- name: Executive Leadership
description: Roles at the senior-most level of an organization, responsible for overall strategic direction and decision-making
- name: Other
description: Any role that does not fit into one of the other categories mentioned above.
- operation: "normalize"
input_file: "denormalized_data.csv"
output_file: "normalized2.csv"
data_column: "text"
model: "gpt-3.5-turbo-16k"
batch_size: 50
id_column: "id"
- operation: "enrich"
input_file: "random_job_titles.csv"
output_file: "enriched_data.csv"
data_column: "job_title"
model: "gpt-3.5-turbo-16k"
batch_size: 50
id_column: "id"