JSONL generator for discord logs, which can be used to fine-tune OpenAI models.
With the release of OpenAI API, it's now possible (and easier than ever) to use their models in your application. One use is creating an AI bot of yourself, this repository gathers training data from discord logs and cleans them into a usable JSONL file for openAI CLI data preparation tool.
OpenAI requires a JSONL file to finetune data, which looks something like this:
{"prompt": "<prompt text>", "completion": "<ideal generated text>"}
{"prompt": "<prompt text>", "completion": "<ideal generated text>"}
{"prompt": "<prompt text>", "completion": "<ideal generated text>"}
This program uses the reply source as the "prompt"
, and your reply message as the "completion"
Currently data is only cleaned by word length (excluding @s), further optimization can be done to better select quality message for training.
- DiscordChatExporter
- openai
pip install openai
- Export as many discord text channels as you want with DiscordChatExporter as JSON files.
- Clone the repo
git clone https://github.com/quinnha/openAI-discord-data-generator.git
- Create a new folder called
/log
cd openAI-discord-data-generator mkdir log
- Drop your JSON files from step 1 into
/log
- Edit USER_NAME and DISCRIMINATOR in
parse.py
to your discord tag. For example, if your discord tag isquinnha#1234
, it should look like
USER_NAME = "quinnha"
DISCRIMINATOR = "1234"
- Run
parse.py
python parse.py
Output should be similar
Operation complete, find the log in output.jsonl
Replies Parsed: 1440
Words Parsed: 29896
Approximate Tokens: 39861
- Run OpenAI's CLI data preparation tool to confirm correctness
openai tools fine_tunes.prepare_data -f output.jsonl
Lastly, follow openAI Finetuning API documentation to create and use your model!
- Compare quality of completions of discord reply data to a hand-picked set of prompts and ansewrs
- Determine logic to better parse/clean data
- Create a GUI to use