Provides a LLMDataset
class for generating and adding data to .csv
datasets using LLMs (OpenAI API)
Install the following packages:
pip install openai==1.3.5 pandas==2.1.3 python-dotenv==1.0.0
1. Create a .env file in the root directory of the project and add your OpenAI API key to it:
OPENAI_API_KEY=<your-openai-api-key>
2. Create an empty dataset file using the create_dataset.py
script
You can skip this step if you already have a dataset file
3. Create an instance of the LLMDataset
class and provide a dataset_path
:
from llm_dataset_gen import LLMDataset
data_filepath = "./data/Dataset.csv"
dataset = LLMDataset(dataset_path=data_filepath)
4. Call the add_data
method by providing the context
and num_samples
parameters:
dataset_context="For Context, this dataset represents requirements engineering excerpts and their corresponding Language Construct (LC) and Language Quality (LQ) codings"
dataset.add_data(context=dataset_context, num_samples=20)
- The
add_data
method will automatically overwrite/save the dataset file after appending the new data - The
context
parameter is the prompt that will be used to generate the data - The
num_samples
parameter is the number of data samples to generate and add to the dataset
The LLMDataset
class is designed to manage a dataset and interact with the OpenAI API to generate new data entries. By using the JSON Mode of the OpenAI API and the gpt-4-1106-preview
or gpt-3.5-turbo-1106
model, it can generate new data entries (as JSON Objects) that match the structure of a given dataset, and easily append them to the dataset.
When calling the API, two messages are sent to the model: a dataset_description
, and a context
- The
dataset_description
is automatically generated by theLLMDataset
class and describes the column names in the dataset, the number of data entries to generate, and how to format the data entries. This ensures that the generated data is consistent with the structure of the dataset. - The
context
is the prompt that is used to describe the data entries. This is provided by the user as a parameter in theadd_data
method. - If the dataset contains an
ID
column, theLLMDataset
will ignore the LLM's generated ID and instead use the next available ID in the dataset.