Skip to content

A collection of open-source dataset to train instruction-following LLMs (ChatGPT,LLaMA,Alpaca)

Notifications You must be signed in to change notification settings

tattrongvu/awesome-instruction-dataset

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

40 Commits
 
 

Repository files navigation

awesome-text/visual-instruction-tuning-dataset

A collection of open-source instruction tuning datasets to train (text and multi-modal) chat-based LLMs (GPT-4, ChatGPT,LLaMA,Alpaca). We currently include three types of dataset:

  1. visual-instruction-tuning (e.g. image-instruction-answer)
  2. text-instruction-tuning datasets.
  3. red-teaming | Reinforcement Learning from Human Feedback (RLHF) Datasets

Instruction Tuning / Reinforcement Learning from Human Feedback (RLHF) Dataset is a key component of instruction-following LLMs such as ChatGPT. This repo is dedicated to providing a comprehensive list of datasets used for instruction tuning in various LLMs, making it easier for researchers and developers to access and utilize these resources.

Lists of codebse to train your LLMs:

Size: The number of instruction tuning pairs

Lingual-Tags:

  • EN: Instruction datasets in English
  • CN: Instruction datasets in Chinese
  • ML: [Multi-lingual] Instruction datasets in multiple languages

Task-Tags:

  • MT: [Multi-task] Datasets containing multiple tasks
  • TS: [Task-specific] Datasets tailored for specific tasks

Generation-method:

  • HG: [Human Generated Dataset] Datasets created by humans
  • SI: [Self-Instruct] Datasets generated using self-instruct methods
  • MIX: [Mixed Dataset] Dataset contains both human and machine generated data
  • COL: [Collection of Dataset] Dataset made from a collection of other datasets

Table of Contents

  1. The template
  2. The Multi-modal Instruction Dataset
  3. The Instruction tuning Dataset
  4. Reinforcement Learning from Human Feedback (RLHF) Datasets
  5. License that Allows Commercial Use

The template

Append the new project at the end of file

## [({owner}/{project-name)|Tags}]{https://github.com/link/to/project}

- summary:
- Data generation model:
- paper:
- License:
- Related: (if applicable)

The Multi-modal Instruction Datasets

  • Summary: LLaVA Visual Instruct 150K is a set of GPT-generated multimodal instruction-following data. It is constructed for visual instruction tuning and for building large multimodal towards GPT-4 vision/language capability.
  • Modality: Text, Image
  • Data generation model: GPT-4-0314
  • paper: Visual Instruction Tuning
  • License: CC BY-NC 4.0

The Instruction-following Datasets

  • Summary:52K data generated from modified self-instruct pipeline with human written 175 seed task.
  • Data generation model: text-davinci-003
  • paper: alpaca-blog
  • License: CC BY-NC 4.0
  • Summary: A project that manually cleaned the Alpaca 52K Dataset
  • Data generation model: text-davinci-003
  • paper: N/A
  • License: CC BY-NC 4.0
  • Summary:52K data generated from modified self-instruct pipeline with human written 429 seed task.
  • Data generation model: text-davinci-003
  • paper: N/A
  • License: InstructWild dataset is intended for non-commercial research purpose only.
  • Summary:52K instruction data generated from modified self-instruct pipeline with human written 429 seed task.
  • Data generation model: text-davinci-003
  • License: GPL-3.0
  • Summary: A datset for Chain-of-Thoughts reasoning based on LLaMA and Alpaca. Note: Their repository will continuously collect and combine various instruction tuning datasets. Github Repo
  • paper: N/A
  • License: Apache License 2.0
  • Summary: A collection of modular datasets generated by GPT-4, General-Instruct - Roleplay-Instruct - Code-Instruct - and Toolformer
  • Data generation model: GPT-4
  • paper: N/A
  • License: MIT License
  • Summary: UltraChat aims to construct an open-source, large-scale, and multi-round dialogue data. The first part of UltraChat (i.e., the Questions about the World sector) is released, which contains 280k diverse and informative dialogues. More dialogues about writing and creation, assistance on existing materials are to come.
  • Data generation model: GPT-3.5-turbo
  • paper: N/A
  • License: CC BY-NC 4.0
  • Summary: Based on the Stanford Alpaca data, ChatAlpaca extends the data to multi-turn instructions and their corresponding responses. More data (20k) and the Chinese translated version are to come.
  • Data generation model: GPT-3.5-turbo
  • paper: N/A
  • License: Apache License 2.0
  • Related: (tatsu-lab/Alpaca)|52K|EN|MT|SI
  • Summary: Chinese datasets of 23 tasks combined with human-written instruction templates.
  • Data generation model: N/A
  • paper: N/A
  • License: N/A
  • Summary: This datset was generated by thousands of Databricks employees in several of the behavioral categories outlined in the InstructGPT paper, including brainstorming, classification, closed QA, generation, information extraction, open QA, and summarization.
  • Data generation model: N/A
  • paper: Free Dolly
  • License: CC BY-SA 3.0
  • Summary: 90,000 conversations scraped via the ShareGPT API before it was shut down. These conversations include both user prompts and responses from OpenAI's ChatGPT.
  • Data generation model: GPT-4,GPT-3.5
  • paper: N/A
  • License: CC0 1.0 Universal

Reinforcement Learning from Human Feedback (RLHF) | Red-Teaming Datasets

  • Summary: Chinese safety prompts for evaluating and improving the safety of LLMs. This repository includes 100k Chinese security scene prompts and ChatGPT responses, covering various security scenarios and command attacks. It can be used for comprehensive evaluation and improvement of model security, as well as enhancing the model's knowledge of security, aligning model output with human values.
  • Data generation model: GPT-3.5
  • paper: Safety Assessment of Chinese Large Language Models
  • License: Apache License 2.0
  • Summary: Each example is a Reddit post with a question/instruction and a pair of top-level comments for that post, where one comment is more preferred by Reddit users (collectively).
  • Data generation model: N/A
  • paper: N/A
  • License: N/A
  • Summary: Ranked responses (Note: Data is evaluated by GPT-4 model NOT human) of Alpaca prompts from three models (GPT-4, GPT-3.5 and OPT-IML) by asking GPT-4 to rate the quality. Author believes "GPT-4 is capable of identifying and fixing its own mistakes, and accurately judging the quality of responses"
  • Data generation model: GPT-4
  • paper: Instruction Tuning with GPT-4
  • License: CC BY-NC 4.0
  • Related:

License that Allows Commercial Use

Note: While these licenses permit commercial use, they may have different requirements for attribution, distribution, or modification. Be sure to review the specific terms of each license before using it in a commercial project.

Commercial use licenses:

  1. Apache License 2.0
  2. MIT License
  3. BSD 3-Clause License
  4. BSD 2-Clause License
  5. GNU Lesser General Public License v3.0 (LGPLv3)
  6. GNU Affero General Public License v3.0 (AGPLv3)
  7. Mozilla Public License 2.0 (MPL-2.0)
  8. Eclipse Public License 2.0 (EPL-2.0)
  9. Microsoft Public License (Ms-PL)
  10. Creative Commons Attribution 4.0 International (CC BY 4.0)
  11. Creative Commons Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
  12. zlib License
  13. Boost Software License 1.0

About

A collection of open-source dataset to train instruction-following LLMs (ChatGPT,LLaMA,Alpaca)

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published