Skip to content

Commit fb445ff

Browse files
authored
docs: document_dataset PDFs & OCR (#7812)
* Update document_dataset.mdx * Update document_dataset.mdx OCR
1 parent 74c7154 commit fb445ff

File tree

1 file changed

+10
-10
lines changed

1 file changed

+10
-10
lines changed

docs/source/document_dataset.mdx

Lines changed: 10 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -1,13 +1,13 @@
11
# Create a document dataset
22

3-
This guide will show you how to create a document dataset with `PdfFolder` and some metadata. This is a no-code solution for quickly creating a document dataset with several thousand pdfs.
3+
This guide will show you how to create a document dataset with `PdfFolder` and some metadata. This is a no-code solution for quickly creating a document dataset with several thousand PDFs.
44

55
> [!TIP]
66
> You can control access to your dataset by requiring users to share their contact information first. Check out the [Gated datasets](https://huggingface.co/docs/hub/datasets-gated) guide for more information about how to enable this feature on the Hub.
77
88
## PdfFolder
99

10-
The `PdfFolder` is a dataset builder designed to quickly load a document dataset with several thousand pdfs without requiring you to write any code.
10+
The `PdfFolder` is a dataset builder designed to quickly load a document dataset with several thousand PDFs without requiring you to write any code.
1111

1212
> [!TIP]
1313
> 💡 Take a look at the [Split pattern hierarchy](repository_structure#split-pattern-hierarchy) to learn more about how `PdfFolder` creates dataset splits based on your dataset repository structure.
@@ -72,32 +72,32 @@ file_name,additional_feature
7272
or using `metadata.jsonl`:
7373

7474
```jsonl
75-
{"file_name": "0001.pdf", "additional_feature": "This is a first value of a text feature you added to your pdfs"}
76-
{"file_name": "0002.pdf", "additional_feature": "This is a second value of a text feature you added to your pdfs"}
77-
{"file_name": "0003.pdf", "additional_feature": "This is a third value of a text feature you added to your pdfs"}
75+
{"file_name": "0001.pdf", "additional_feature": "This is a first value of a text feature you added to your PDFs"}
76+
{"file_name": "0002.pdf", "additional_feature": "This is a second value of a text feature you added to your PDFs"}
77+
{"file_name": "0003.pdf", "additional_feature": "This is a third value of a text feature you added to your PDFs"}
7878
```
7979

8080
Here the `file_name` must be the name of the PDF file next to the metadata file. More generally, it must be the relative path from the directory containing the metadata to the PDF file.
8181

82-
It's possible to point to more than one pdf in each row in your dataset, for example if both your input and output are pdfs:
82+
It's possible to point to more than one PDF in each row in your dataset, for example if both your input and output are pdfs:
8383

8484
```jsonl
8585
{"input_file_name": "0001.pdf", "output_file_name": "0001_output.pdf"}
8686
{"input_file_name": "0002.pdf", "output_file_name": "0002_output.pdf"}
8787
{"input_file_name": "0003.pdf", "output_file_name": "0003_output.pdf"}
8888
```
8989

90-
You can also define lists of pdfs. In that case you need to name the field `file_names` or `*_file_names`. Here is an example:
90+
You can also define lists of PDFs. In that case you need to name the field `file_names` or `*_file_names`. Here is an example:
9191

9292
```jsonl
9393
{"pdfs_file_names": ["0001_part1.pdf", "0001_part2.pdf"], "label": "urgent"}
9494
{"pdfs_file_names": ["0002_part1.pdf", "0002_part2.pdf"], "label": "urgent"}
9595
{"pdfs_file_names": ["0003_part1.pdf", "0002_part2.pdf"], "label": "normal"}
9696
```
9797

98-
### OCR (Optical character recognition)
98+
### OCR (Optical Character Recognition)
9999

100-
OCR datasets have the text contained in a pdf. An example `metadata.csv` may look like:
100+
OCR datasets have the text contained in a PDF. An example `metadata.csv` may look like:
101101

102102
```csv
103103
file_name,text
@@ -106,7 +106,7 @@ file_name,text
106106
0003.pdf,Attention is all you need. Abstract. The ...
107107
```
108108

109-
Load the dataset with `PdfFolder`, and it will create a `text` column for the pdf captions:
109+
Load the dataset with `PdfFolder`, and it will create a `text` column for the PDF captions:
110110

111111
```py
112112
>>> dataset = load_dataset("pdffolder", data_dir="/path/to/folder", split="train")

0 commit comments

Comments
 (0)