Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
20 changes: 10 additions & 10 deletions docs/source/document_dataset.mdx
Original file line number Diff line number Diff line change
@@ -1,13 +1,13 @@
# Create a document dataset

This guide will show you how to create a document dataset with `PdfFolder` and some metadata. This is a no-code solution for quickly creating a document dataset with several thousand pdfs.
This guide will show you how to create a document dataset with `PdfFolder` and some metadata. This is a no-code solution for quickly creating a document dataset with several thousand PDFs.

> [!TIP]
> You can control access to your dataset by requiring users to share their contact information first. Check out the [Gated datasets](https://huggingface.co/docs/hub/datasets-gated) guide for more information about how to enable this feature on the Hub.

## PdfFolder

The `PdfFolder` is a dataset builder designed to quickly load a document dataset with several thousand pdfs without requiring you to write any code.
The `PdfFolder` is a dataset builder designed to quickly load a document dataset with several thousand PDFs without requiring you to write any code.

> [!TIP]
> 💡 Take a look at the [Split pattern hierarchy](repository_structure#split-pattern-hierarchy) to learn more about how `PdfFolder` creates dataset splits based on your dataset repository structure.
Expand Down Expand Up @@ -72,32 +72,32 @@ file_name,additional_feature
or using `metadata.jsonl`:

```jsonl
{"file_name": "0001.pdf", "additional_feature": "This is a first value of a text feature you added to your pdfs"}
{"file_name": "0002.pdf", "additional_feature": "This is a second value of a text feature you added to your pdfs"}
{"file_name": "0003.pdf", "additional_feature": "This is a third value of a text feature you added to your pdfs"}
{"file_name": "0001.pdf", "additional_feature": "This is a first value of a text feature you added to your PDFs"}
{"file_name": "0002.pdf", "additional_feature": "This is a second value of a text feature you added to your PDFs"}
{"file_name": "0003.pdf", "additional_feature": "This is a third value of a text feature you added to your PDFs"}
```

Here the `file_name` must be the name of the PDF file next to the metadata file. More generally, it must be the relative path from the directory containing the metadata to the PDF file.

It's possible to point to more than one pdf in each row in your dataset, for example if both your input and output are pdfs:
It's possible to point to more than one PDF in each row in your dataset, for example if both your input and output are pdfs:

```jsonl
{"input_file_name": "0001.pdf", "output_file_name": "0001_output.pdf"}
{"input_file_name": "0002.pdf", "output_file_name": "0002_output.pdf"}
{"input_file_name": "0003.pdf", "output_file_name": "0003_output.pdf"}
```

You can also define lists of pdfs. In that case you need to name the field `file_names` or `*_file_names`. Here is an example:
You can also define lists of PDFs. In that case you need to name the field `file_names` or `*_file_names`. Here is an example:

```jsonl
{"pdfs_file_names": ["0001_part1.pdf", "0001_part2.pdf"], "label": "urgent"}
{"pdfs_file_names": ["0002_part1.pdf", "0002_part2.pdf"], "label": "urgent"}
{"pdfs_file_names": ["0003_part1.pdf", "0002_part2.pdf"], "label": "normal"}
```

### OCR (Optical character recognition)
### OCR (Optical Character Recognition)

OCR datasets have the text contained in a pdf. An example `metadata.csv` may look like:
OCR datasets have the text contained in a PDF. An example `metadata.csv` may look like:

```csv
file_name,text
Expand All @@ -106,7 +106,7 @@ file_name,text
0003.pdf,Attention is all you need. Abstract. The ...
```

Load the dataset with `PdfFolder`, and it will create a `text` column for the pdf captions:
Load the dataset with `PdfFolder`, and it will create a `text` column for the PDF captions:

```py
>>> dataset = load_dataset("pdffolder", data_dir="/path/to/folder", split="train")
Expand Down