llm-dataset-gen

Provides a LLMDataset class for generating and adding data to .csv datasets using LLMs (OpenAI API)

Installation

Install the following packages: pip install openai==1.3.5 pandas==2.1.3 python-dotenv==1.0.0

Usage

1. Create a .env file in the root directory of the project and add your OpenAI API key to it:

OPENAI_API_KEY=<your-openai-api-key>

2. Create an empty dataset file using the create_dataset.py script

You can skip this step if you already have a dataset file

3. Create an instance of the LLMDataset class and provide a dataset_path:

from llm_dataset_gen import LLMDataset
data_filepath = "./data/Dataset.csv"
dataset = LLMDataset(dataset_path=data_filepath)

4. Call the add_data method by providing the context and num_samples parameters:

dataset_context="For Context, this dataset represents requirements engineering excerpts and their corresponding Language Construct (LC) and Language Quality (LQ) codings"
dataset.add_data(context=dataset_context, num_samples=20)

The add_data method will automatically overwrite/save the dataset file after appending the new data
The context parameter is the prompt that will be used to generate the data
The num_samples parameter is the number of data samples to generate and add to the dataset

How It Works

The LLMDataset class is designed to manage a dataset and interact with the OpenAI API to generate new data entries. By using the JSON Mode of the OpenAI API and the gpt-4-1106-preview or gpt-3.5-turbo-1106 model, it can generate new data entries (as JSON Objects) that match the structure of a given dataset, and easily append them to the dataset.

When calling the API, two messages are sent to the model: a dataset_description, and a context

The dataset_description is automatically generated by the LLMDataset class and describes the column names in the dataset, the number of data entries to generate, and how to format the data entries. This ensures that the generated data is consistent with the structure of the dataset.
The context is the prompt that is used to describe the data entries. This is provided by the user as a parameter in the add_data method.
If the dataset contains an ID column, the LLMDataset will ignore the LLM's generated ID and instead use the next available ID in the dataset.

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
data		data
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
create_dataset.py		create_dataset.py
llm_dataset_gen.py		llm_dataset_gen.py
main.py		main.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

llm-dataset-gen

Installation

Usage

How It Works

About

Releases

Packages

Languages

License

Brandon82/llm-dataset-gen

Folders and files

Latest commit

History

Repository files navigation

llm-dataset-gen

Installation

Usage

How It Works

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages