Use LLM to generate training data for fine-tuning LLM
- Parse input file (pdf, docx, txt, etc.)
- Generate questions and answers by LLM
- Save data to jsonl file
{"instruction": "instruction", "output": "output"}
{"instruction": "instruction", "output": "output"}
# for CPU
docker build -t ft-data-gen:cpu .
# for GPU
docker build -t ft-data-gen:gpu .
# for CPU
docker run -it --rm -v ${PWD}:/workspace ft-data-gen:cpu bash
# for GPU
docker run -it --rm -v ${PWD}:/workspace --gpus all ft-data-gen:gpu
cp .env.example .env
Modify .env
file with your own LLM model and API key
Here we use litellm to support multiple LLM models, you can refer to litellm for more details.
LLM_MODEL=ollama/llama3.1:70b
LLM_BASE_URL=http://localhost:11434
LLM_MODEL=openai/gpt-4o
LLM_API_KEY=sk-proj-.....
Arguments:
--input_file
: input file path--qa_num
: number of questions--output_folder
: output folder
python generate_data.py --input_file data/test.txt --qa_num 2 --output_folder output