To enable more open-source research on instruction following large language models, we use generate 52K instruction-followng demonstrations using OpenAI's text-davinci-003 model.
- Rohan Taori
- Ishaan Gulrajani
- Tianyi Zhang
- Yann Dubois
- Xuechen Li
- Carlos Guestrin
- Percy Liang
- Tatsunori B. Hashimoto
What do the instances that comprise the dataset represent (e.g., documents, photos, people, countries)?
The instruction following demonstrations are bootstrapped by following the seed set released from the self-instruct project. Given that the dataset is generated, it is difficult to pinpoint who/what the instances represent.
In total, there are 52,002 instances in the dataset.
Does the dataset contain all possible instances or is it a sample (not necessarily random) of instances from a larger set?
not applicable.
instruction
:str
, describes the task the model should perform. Each of the 52K instructions is unique.input
:str
, optional context or input for the task. For example, when the instruction is "Summarize the following article", the input is the article. Around 40% of the examples have an input.output
:str
, the answer to the instruction as generated bytext-davinci-003
.
no.
Are relationships between individual instances made explicit (e.g., users’ movie ratings, social network links)?
not applicable.
the finetuning target is the response generated by text-davinci-003
.
The Alpaca models (both demo and the ones that will be released) are trained on all 52K data. There is no recommended data split for the dataset.
All 52k instructions are unique. However, some generated instructions may not be sensible, i.e., there may not exist any good response to the instruction.
Is the dataset self-contained, or does it link to or otherwise rely on external resources (e.g., websites, tweets, other datasets)?
the dataset is self-contained.
Does the dataset contain data that might be considered confidential (e.g., data that is protected by legal privilege or by doctor-patient confidentiality, data that includes the content of individuals’ non-public communications)?
no.
Does the dataset contain data that, if viewed directly, might be offensive, insulting, threatening, or might otherwise cause anxiety?
The generated may contain a few inappropriate responses. In our preliminary testing, we have not encountered any offensive responses.
The Github repository contains the code to generate the dataset.
The dataset is used to train the Alpaca models that are both used for the demo and released.
Please see https://github.com/tatsu-lab/stanford_alpaca
Is there anything about the composition of the dataset or the way it was collected and preprocessed/cleaned/labeled that might impact future uses?
This dataset is generated by using the OpenAI's API. Therefore, this dataset cannot be used for commerical usage that compete with OpenAI.
The dataset should not be used for commerical usage that compete with OpenAI.
Will the dataset be distributed to third parties outside of the entity (e.g., company, institution, organization) on behalf of which the dataset was created?
The dataset can be freely downloaded.
The dataset can be downloaded from the Github repository as a json file.
Will the dataset be distributed under a copyright or other intellectual property (IP) license, and/or under applicable terms of use (ToU)?
This dataset is distributed under the ODC-By license.
Have any third parties imposed IP-based or other restrictions on the data associated with the instances?
no
Do any export controls or other regulatory restrictions apply to the dataset or to individual instances?
no
The dataset is hosted on github and the Github repository is maintained by Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li.
Please open an issue in the Github repository
Will the dataset be updated (e.g., to correct labeling errors, add new instances, delete instances)?
We do not have plan to update the dataset.