This repository contains the Unnatural Instructions dataset. Unnatural Instructions is a dataset of instructions automatically generated by a Large Language model. See full details in the paper: "Unnatural Instructions: Tuning Language Models with (Almost) No Human Labor"
The data
folder contains two files: core_data.jsonl
, containing the Unnatural Instructions core dataset of 68,478 instruction-input-output triplets, and full_data.jsonl
, containing the full 240,670 Unnatural Instructions examples. The full data was constructed by expanding the core data with automatically generated instruction paraphrases.
Each line in core_data.jsonl
is a JSON object with two fields - instruction
, which is a natural language instruction describing a task, and instances
, an array of JSON objects, each contains
input
: An input for the task described by theinstruction
instruction_with_input
: The instruction concatenated with theinput
constraints
: The task's output space constraintsoutput
: The output of executinginstruction
with the giveninput
core_data.jsonl
has the same structure as core_data.jsonl
, but with one additional field - reformulations
. reformulations
is an array of JSON objects, each corresponds to an automatically generated paraphrase for the given instruction. Each reformulation contains the fields:
instruction
: A paraphrase of the original instructioninput
: An input for the task described by theinstruction
instruction_with_input
: The paraphrased instruction concatenated with theinput
output
: The output of executinginstruction
with the giveninput
If you make use of Unnatural Instructions, please cite the following paper:
@misc{honovich2022unnatural,
title = {Unnatural Instructions: Tuning Language Models with (Almost) No Human Labor},
author = {Honovich, Or and Scialom, Thomas and Levy, Omer and Schick, Timo},
url = {https://arxiv.org/abs/2212.09689},
publisher = {arXiv},
year={2022}
}