docs: document_dataset PDFs & OCR (#7812)

ethanknights · web-flow · commit fb445ff7979b · 2025-10-20T16:03:52.000+02:00
* Update document_dataset.mdx

* Update document_dataset.mdx OCR
diff --git a/docs/source/document_dataset.mdx b/docs/source/document_dataset.mdx
@@ -1,13 +1,13 @@
 # Create a document dataset
 
-This guide will show you how to create a document dataset with `PdfFolder` and some metadata. This is a no-code solution for quickly creating a document dataset with several thousand pdfs.
+This guide will show you how to create a document dataset with `PdfFolder` and some metadata. This is a no-code solution for quickly creating a document dataset with several thousand PDFs.
 
 > [!TIP]
 > You can control access to your dataset by requiring users to share their contact information first. Check out the [Gated datasets](https://huggingface.co/docs/hub/datasets-gated) guide for more information about how to enable this feature on the Hub.
 
 ## PdfFolder
 
-The `PdfFolder` is a dataset builder designed to quickly load a document dataset with several thousand pdfs without requiring you to write any code.
+The `PdfFolder` is a dataset builder designed to quickly load a document dataset with several thousand PDFs without requiring you to write any code.
 
 > [!TIP]
 > 💡 Take a look at the [Split pattern hierarchy](repository_structure#split-pattern-hierarchy) to learn more about how `PdfFolder` creates dataset splits based on your dataset repository structure.
@@ -72,32 +72,32 @@ file_name,additional_feature
 or using `metadata.jsonl`:
 
 ```jsonl
-{"file_name": "0001.pdf", "additional_feature": "This is a first value of a text feature you added to your pdfs"}
-{"file_name": "0002.pdf", "additional_feature": "This is a second value of a text feature you added to your pdfs"}
-{"file_name": "0003.pdf", "additional_feature": "This is a third value of a text feature you added to your pdfs"}
+{"file_name": "0001.pdf", "additional_feature": "This is a first value of a text feature you added to your PDFs"}
+{"file_name": "0002.pdf", "additional_feature": "This is a second value of a text feature you added to your PDFs"}
+{"file_name": "0003.pdf", "additional_feature": "This is a third value of a text feature you added to your PDFs"}
 ```
 
 Here the `file_name` must be the name of the PDF file next to the metadata file. More generally, it must be the relative path from the directory containing the metadata to the PDF file.
 
-It's possible to point to more than one pdf in each row in your dataset, for example if both your input and output are pdfs:
+It's possible to point to more than one PDF in each row in your dataset, for example if both your input and output are pdfs:
 
 ```jsonl
 {"input_file_name": "0001.pdf", "output_file_name": "0001_output.pdf"}
 {"input_file_name": "0002.pdf", "output_file_name": "0002_output.pdf"}
 {"input_file_name": "0003.pdf", "output_file_name": "0003_output.pdf"}
 ```
 
-You can also define lists of pdfs. In that case you need to name the field `file_names` or `*_file_names`. Here is an example:
+You can also define lists of PDFs. In that case you need to name the field `file_names` or `*_file_names`. Here is an example:
 
 ```jsonl
 {"pdfs_file_names": ["0001_part1.pdf", "0001_part2.pdf"], "label": "urgent"}
 {"pdfs_file_names": ["0002_part1.pdf", "0002_part2.pdf"], "label": "urgent"}
 {"pdfs_file_names": ["0003_part1.pdf", "0002_part2.pdf"], "label": "normal"}
 ```
 
-### OCR (Optical character recognition)
+### OCR (Optical Character Recognition)
 
-OCR datasets have the text contained in a pdf. An example `metadata.csv` may look like:
+OCR datasets have the text contained in a PDF. An example `metadata.csv` may look like:
 
 ```csv
 file_name,text
@@ -106,7 +106,7 @@ file_name,text
 0003.pdf,Attention is all you need. Abstract. The ...
 ```
 
-Load the dataset with `PdfFolder`, and it will create a `text` column for the pdf captions:
+Load the dataset with `PdfFolder`, and it will create a `text` column for the PDF captions:
 
 ```py
 >>> dataset = load_dataset("pdffolder", data_dir="/path/to/folder", split="train")