A collection of open-source instruction tuning datasets to train (text and multi-modal) chat-based LLMs (GPT-4, ChatGPT,LLaMA,Alpaca). We currently include three types of dataset:
- visual-instruction-tuning (e.g. image-instruction-answer)
- text-instruction-tuning datasets.
- red-teaming | Reinforcement Learning from Human Feedback (RLHF) Datasets
Instruction Tuning / Reinforcement Learning from Human Feedback (RLHF) Dataset is a key component of instruction-following LLMs such as ChatGPT. This repo is dedicated to providing a comprehensive list of datasets used for instruction tuning in various LLMs, making it easier for researchers and developers to access and utilize these resources.
Lists of codebse to train your LLMs:
- nichtdax/awesome-totally-open-chatgpt: A codebase of totally open alternatives to ChatGPT
Size: The number of instruction tuning pairs
Lingual-Tags:
- EN: Instruction datasets in English
- CN: Instruction datasets in Chinese
- ML: [Multi-lingual] Instruction datasets in multiple languages
Task-Tags:
- MT: [Multi-task] Datasets containing multiple tasks
- TS: [Task-specific] Datasets tailored for specific tasks
Generation-method:
- HG: [Human Generated Dataset] Datasets created by humans
- SI: [Self-Instruct] Datasets generated using self-instruct methods
- MIX: [Mixed Dataset] Dataset contains both human and machine generated data
- COL: [Collection of Dataset] Dataset made from a collection of other datasets
- The template
- The Multi-modal Instruction Dataset
- The Instruction tuning Dataset
- (tatsu-lab/Alpaca)|52K|EN|MT|SI
- (gururise/Cleaned Alpaca)|52K|EN|MT|SI
- (XueFuzhao/InstructionWild)|52K|EN|CN|MT|SI
- (JosephusCheung/GuanacoDataset)|534K|ML|MT|SI
- (Hello-SimpleAI/HC3)|24K|EN|MT|MIX
- (Hello-SimpleAI/HC3-Chinese)|13K|CN|MT|MIX
- (allenai/prosocial-dialog)|58K|EN|MT|MIX
- (allenai/natural-instructions)|1.6K|ML|MT|HG
- (bigscience/xP3)|N/A|ML|MT|MIX
- (nomic-ai/gpt4all)|437k|EN|MT|COL
- (PhoebusSi/Alpaca-CoT)|500k|ML|MT|COL
- (google-research/FLAN)|N/A|EN|MT|MIX
- (thunlp/UltraChat)|280k|EN|TS|MIX
- (cascip/ChatAlpaca)|10k|EN|MT|MIX
- (YeungNLP/firefly-train-1.1M)|1100k|CN|MT|COL
- (orhonovich/unnatural-instructions)|240K|EN|MT|MIX
- (Instruction-Tuning-with-GPT-4/GPT-4-LLM)|52K|EN|CN|MT|SI
- (databrickslabs/dolly)|15K|EN|MT|HG
- (OpenAssistant/oasst1)|161K|ML|MT|HG
- (RyokoAI/ShareGPT52K)|90K|ML|MT|SI
- Reinforcement Learning from Human Feedback (RLHF) Datasets
- License that Allows Commercial Use
Append the new project at the end of file
## [({owner}/{project-name)|Tags}]{https://github.com/link/to/project}
- summary:
- Data generation model:
- paper:
- License:
- Related: (if applicable)
- Summary: A high-quality, well-aligned (e.g. more detailed image desciption) image-text dataset created using conversation between two bots, similar to ChatCaptioner. This image-text dataset can then be used with some predefined instruction template for image-instruction-answer finetuning.
- Modality: Text, Image
- Data generation model: N/A
- paper: MiniGPT-4: Enhancing Vision-language Understanding with Advanced Large Language Models
- License:
BSD 3-Clause
- Related:
- Summary: LLaVA Visual Instruct 150K is a set of GPT-generated multimodal instruction-following data. It is constructed for visual instruction tuning and for building large multimodal towards GPT-4 vision/language capability.
- Modality: Text, Image
- Data generation model:
GPT-4-0314
- paper: Visual Instruction Tuning
- License:
CC BY-NC 4.0
- Summary:
52K
data generated from modifiedself-instruct
pipeline with human written175 seed task
. - Data generation model:
text-davinci-003
- paper: alpaca-blog
- License:
CC BY-NC 4.0
- Summary: A project that manually cleaned the Alpaca 52K Dataset
- Data generation model:
text-davinci-003
- paper: N/A
- License:
CC BY-NC 4.0
- Summary:
52K
data generated from modifiedself-instruct
pipeline with human written429 seed task
. - Data generation model:
text-davinci-003
- paper: N/A
- License: InstructWild dataset is intended for non-commercial research purpose only.
- Summary:
52K
instruction data generated from modifiedself-instruct
pipeline with human written429 seed task
. - Data generation model:
text-davinci-003
- License:
GPL-3.0
- Summary:The the first human-ChatGPT comparison corpus (English Version), named HC3 dataset
- Data generation model:
gpt-3.5
,human generated
- paper: How Close is ChatGPT to Human Experts? Comparison Corpus, Evaluation, and Detection
- License:
CC BY-SA 4.0
- Summary:The the first human-ChatGPT comparison corpus (Chinese Version), named HC3 dataset
- Data generation model:
gpt-3.5
,human generated
- paper: How Close is ChatGPT to Human Experts? Comparison Corpus, Evaluation, and Detection
- License:
CC BY-SA 4.0
- Summary: ProsocialDialog is the first large-scale multi-turn English dialogue dataset to teach conversational agents to respond to problematic content following social norms.
- Data generation model:
gpt-3.5
,human generated
- paper: ProsocialDialog: A Prosocial Backbone for Conversational Agents
- License:
CC BY 4.0
- Summary: A community effort to create a large collection of
1,616 diverse NLP tasks
and their natural language definitions/instructions. - Data generation model:
Human generated
- paper: Super-NaturalInstructions: Generalization via Declarative Instructions on 1600+ NLP Tasks
- License:
Apache License 2.0
- Summary: [Prompt-resource] xP3 (Crosslingual Public Pool of Prompts) is a collection of prompts & datasets across 46 of languages & 16 NLP tasks.
- Data generation model: N/A
- paper: Crosslingual Generalization through Multitask Finetuning
- License:
Apache License 2.0
- Summary: A datset for Chain-of-Thoughts reasoning based on LLaMA and Alpaca. Note: Their repository will continuously collect and combine various instruction tuning datasets. Github Repo
- paper: N/A
- License:
Apache License 2.0
- Summary: gpt4all leverages three publicly available datasets: 1.laion/OIG, 2.pacovaldez/stackoverflow-questions 3. subset of bigscience/bloomz-p3
- Data generation model: N/A
- paper: GPT4All: Training an Assistant-style Chatbot with Large Scale Data Distillation from GPT-3.5-Turbo
- License:
MIT License
- Summary: A collection of modular datasets generated by GPT-4, General-Instruct - Roleplay-Instruct - Code-Instruct - and Toolformer
- Data generation model:
GPT-4
- paper: N/A
- License:
MIT License
- Summary: The Flan Collection compiles datasets from Flan 2021, P3, Super-Natural Instructions, along with dozens more datasets into one place, formats them into a mix of zero-shot, few-shot and chain-of-thought templates
- Data generation model: N/A
- paper: The Flan Collection: Designing Data and Methods for Effective Instruction Tuning
- License:
Apache License 2.0
- Summary: UltraChat aims to construct an open-source, large-scale, and multi-round dialogue data. The first part of UltraChat (i.e., the Questions about the World sector) is released, which contains 280k diverse and informative dialogues. More dialogues about writing and creation, assistance on existing materials are to come.
- Data generation model:
GPT-3.5-turbo
- paper: N/A
- License:
CC BY-NC 4.0
- Summary: Based on the Stanford Alpaca data, ChatAlpaca extends the data to multi-turn instructions and their corresponding responses. More data (20k) and the Chinese translated version are to come.
- Data generation model:
GPT-3.5-turbo
- paper: N/A
- License:
Apache License 2.0
- Related: (tatsu-lab/Alpaca)|52K|EN|MT|SI
- Summary: Chinese datasets of 23 tasks combined with human-written instruction templates.
- Data generation model: N/A
- paper: N/A
- License: N/A
- Summary: 64K examples by prompting a language model with three seed examples of instructions and eliciting a fourth. Then the set is expanded to 240K by prompting the model to rephrase each instruction.
- Data generation model:
text-davinci-002
- paper: Unnatural Instructions: Tuning Language Models with (Almost) No Human Labor
- License:
MIT License
- Summary: 52K instruction-following data generated by GPT-4 with the original Alpaca prompts & Alpaca prompts translated into Chinese by ChatGPT + 9K instruction-following data generated by GPT-4 with prompts in Unnatural Instruction.
- Data generation model:
GPT-4
- paper: Instruction Tuning with GPT-4
- License:
CC BY-NC 4.0
- Related:
- Summary: This datset was generated by thousands of Databricks employees in several of the behavioral categories outlined in the InstructGPT paper, including brainstorming, classification, closed QA, generation, information extraction, open QA, and summarization.
- Data generation model: N/A
- paper: Free Dolly
- License:
CC BY-SA 3.0
- Summary: OpenAssistant Conversations (OASST1), a human-generated, human-annotated assistant-style conversation corpus consisting of 161,443 messages distributed across 66,497 conversation trees, in 35 different languages, annotated with 461,292 quality ratings.
- Data generation model: N/A
- paper: OpenAssistant Conversations - Democratizing Large Language Model Alignment
- License:
Apache License 2.0
- Summary: 90,000 conversations scraped via the ShareGPT API before it was shut down. These conversations include both user prompts and responses from OpenAI's ChatGPT.
- Data generation model:
GPT-4
,GPT-3.5
- paper: N/A
- License:
CC0 1.0 Universal
- Summary: This RLHF dataset is an iterated 'online' dataset that includes data from 52B language models. It contains 22k helpfulness comparisons and no red-teaming data.
- Data generation model:
Anthropic RL-CAI 52B
- paper: Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback
- License:
MIT License
- Related:
- Summary: Chinese safety prompts for evaluating and improving the safety of LLMs. This repository includes 100k Chinese security scene prompts and ChatGPT responses, covering various security scenarios and command attacks. It can be used for comprehensive evaluation and improvement of model security, as well as enhancing the model's knowledge of security, aligning model output with human values.
- Data generation model:
GPT-3.5
- paper: Safety Assessment of Chinese Large Language Models
- License:
Apache License 2.0
- Summary: This dataset contains questions and answers from the Stack Overflow Data Dump for the purpose of preference model training.
- Data generation model: N/A
- paper: A General Language Assistant as a Laboratory for Alignment
- License:
CC BY-SA 4.0
- Related:
- Summary: Each example is a Reddit post with a question/instruction and a pair of top-level comments for that post, where one comment is more preferred by Reddit users (collectively).
- Data generation model: N/A
- paper: N/A
- License: N/A
- Summary: Ranked responses (Note: Data is evaluated by
GPT-4
model NOT human) of Alpaca prompts from three models (GPT-4, GPT-3.5 and OPT-IML) by asking GPT-4 to rate the quality. Author believes "GPT-4 is capable of identifying and fixing its own mistakes, and accurately judging the quality of responses" - Data generation model:
GPT-4
- paper: Instruction Tuning with GPT-4
- License:
CC BY-NC 4.0
- Related:
- summary: This dataset contains questions and answers from the subreddits r/explainlikeimfive, r/askhistorians and r/askscience.
- Data generation model: N/A
- paper: N/A
- License: N/A
- Related: eli5 dataset a transformation of the eli5 dataset in a format similar to stack-exchange-paired.
Note: While these licenses permit commercial use, they may have different requirements for attribution, distribution, or modification. Be sure to review the specific terms of each license before using it in a commercial project.
Commercial use licenses:
Apache License 2.0
MIT License
BSD 3-Clause License
BSD 2-Clause License
GNU Lesser General Public License v3.0 (LGPLv3)
GNU Affero General Public License v3.0 (AGPLv3)
Mozilla Public License 2.0 (MPL-2.0)
Eclipse Public License 2.0 (EPL-2.0)
Microsoft Public License (Ms-PL)
Creative Commons Attribution 4.0 International (CC BY 4.0)
Creative Commons Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
zlib License
Boost Software License 1.0