OpenAI Data Processing Tool

This tool is designed to perform various data operations like categorization, normalization, and enrichment using OpenAI's GPT-3.5-turbo model.

Prerequisites

Python 3.7 or above installed
OpenAI API key. You can get it by creating an account at https://beta.openai.com/signup/.

Installing Required Packages

All required Python packages are listed in requirements.txt file. You can install all of them by running the following command in your terminal:

python3 -m pip install -r requirements.txt

Configuration

The tool uses a YAML configuration file named config.yaml where you define the tasks you want to execute. The tasks include:

categorize: Categorizes provided responses according to predefined categories
normalize: Normalizes text data according to specified rules
enrich: Enriches the data with additional relevant information

Each task should be specified in the following way:

tasks:
  - operation: "operation_name" # one of "categorize", "normalize", "enrich"
    context: "context_for_operation" # context or rules for the operation
    input_file: "input_file.csv" # input file containing the data
    output_file: "output_file.csv" # output file where results will be stored
    data_column: "column_name" # name of the column in the CSV that contains the data to be processed
    model: "gpt-3.5-turbo-16k" # OpenAI model to use
    batch_size: 50 # number of rows to process at a time
    id_column: "id" # column name of the ID column
    subcategories: # only for "categorize" operation, list of possible categories
      - name: category_name
        description: category_description

Running the Tool

To run the tool, you will need to set the OPENAI_API_KEY environment variable to your OpenAI API key. You can do it in the terminal by running:

export OPENAI_API_KEY='your-api-key'

Or you can set it programmatically before running the script:

import os
os.environ["OPENAI_API_KEY"] = 'your-api-key'

To execute the tasks, run the script as follows:

python3 clean_data.py

Output

The output for each task will be written to the CSV file specified in the output_file parameter for each task in the config.yaml. The output will be appended to the existing data in the file.

Debugging

In case of errors or exceptions, the script will print an error message in the console. The most common exceptions are OpenAI's ServiceUnavailableError (when the OpenAI API service is temporarily unavailable) and KeyError (when an operation name does not exist).

Example

Here's an example config.yaml that uses all three operations:

The tool uses a YAML configuration file named config.yaml where you define the tasks you want to execute. Here's an example config.yaml:

tasks:
  - operation: "categorize"
    context: "What is your job title?"
    input_file: "random_job_titles.csv"
    output_file: "job_title_processed.csv"
    data_column: "job_title"
    model: "gpt-3.5-turbo-16k"
    batch_size: 50
    id_column: "id"
    subcategories:
      - name: Executive Leadership
        description: Roles at the senior-most level of an organization, responsible for overall strategic direction and decision-making
      - name: Other
        description: Any role that does not fit into one of the other categories mentioned above.
  
  - operation: "normalize"
    input_file: "denormalized_data.csv"
    output_file: "normalized2.csv"
    data_column: "text"
    model: "gpt-3.5-turbo-16k"
    batch_size: 50
    id_column: "id"
  
  - operation: "enrich"
    input_file: "random_job_titles.csv"
    output_file: "enriched_data.csv"
    data_column: "job_title"
    model: "gpt-3.5-turbo-16k"
    batch_size: 50
    id_column: "id"

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
LICENSE		LICENSE
README.md		README.md
clean_data.py		clean_data.py
config.yaml		config.yaml
generate_test_data.py		generate_test_data.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

OpenAI Data Processing Tool

Prerequisites

Installing Required Packages

Configuration

Running the Tool

Output

Debugging

Example

About

Releases

Packages

Languages

License

philip06/data-cleaner-gpt

Folders and files

Latest commit

History

Repository files navigation

OpenAI Data Processing Tool

Prerequisites

Installing Required Packages

Configuration

Running the Tool

Output

Debugging

Example

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages