From adccbf70129fcf6dec7338b97891d9ea619a3cfe Mon Sep 17 00:00:00 2001 From: Ethan Knights Date: Fri, 10 Oct 2025 00:18:04 +0100 Subject: [PATCH 1/2] Update document_dataset.mdx --- docs/source/document_dataset.mdx | 18 +++++++++--------- 1 file changed, 9 insertions(+), 9 deletions(-) diff --git a/docs/source/document_dataset.mdx b/docs/source/document_dataset.mdx index 30cc1bd3121..ced8026af21 100644 --- a/docs/source/document_dataset.mdx +++ b/docs/source/document_dataset.mdx @@ -1,13 +1,13 @@ # Create a document dataset -This guide will show you how to create a document dataset with `PdfFolder` and some metadata. This is a no-code solution for quickly creating a document dataset with several thousand pdfs. +This guide will show you how to create a document dataset with `PdfFolder` and some metadata. This is a no-code solution for quickly creating a document dataset with several thousand PDFs. > [!TIP] > You can control access to your dataset by requiring users to share their contact information first. Check out the [Gated datasets](https://huggingface.co/docs/hub/datasets-gated) guide for more information about how to enable this feature on the Hub. ## PdfFolder -The `PdfFolder` is a dataset builder designed to quickly load a document dataset with several thousand pdfs without requiring you to write any code. +The `PdfFolder` is a dataset builder designed to quickly load a document dataset with several thousand PDFs without requiring you to write any code. > [!TIP] > 💡 Take a look at the [Split pattern hierarchy](repository_structure#split-pattern-hierarchy) to learn more about how `PdfFolder` creates dataset splits based on your dataset repository structure. @@ -72,14 +72,14 @@ file_name,additional_feature or using `metadata.jsonl`: ```jsonl -{"file_name": "0001.pdf", "additional_feature": "This is a first value of a text feature you added to your pdfs"} -{"file_name": "0002.pdf", "additional_feature": "This is a second value of a text feature you added to your pdfs"} -{"file_name": "0003.pdf", "additional_feature": "This is a third value of a text feature you added to your pdfs"} +{"file_name": "0001.pdf", "additional_feature": "This is a first value of a text feature you added to your PDFs"} +{"file_name": "0002.pdf", "additional_feature": "This is a second value of a text feature you added to your PDFs"} +{"file_name": "0003.pdf", "additional_feature": "This is a third value of a text feature you added to your PDFs"} ``` Here the `file_name` must be the name of the PDF file next to the metadata file. More generally, it must be the relative path from the directory containing the metadata to the PDF file. -It's possible to point to more than one pdf in each row in your dataset, for example if both your input and output are pdfs: +It's possible to point to more than one PDF in each row in your dataset, for example if both your input and output are pdfs: ```jsonl {"input_file_name": "0001.pdf", "output_file_name": "0001_output.pdf"} @@ -87,7 +87,7 @@ It's possible to point to more than one pdf in each row in your dataset, for exa {"input_file_name": "0003.pdf", "output_file_name": "0003_output.pdf"} ``` -You can also define lists of pdfs. In that case you need to name the field `file_names` or `*_file_names`. Here is an example: +You can also define lists of PDFs. In that case you need to name the field `file_names` or `*_file_names`. Here is an example: ```jsonl {"pdfs_file_names": ["0001_part1.pdf", "0001_part2.pdf"], "label": "urgent"} @@ -97,7 +97,7 @@ You can also define lists of pdfs. In that case you need to name the field `file ### OCR (Optical character recognition) -OCR datasets have the text contained in a pdf. An example `metadata.csv` may look like: +OCR datasets have the text contained in a PDF. An example `metadata.csv` may look like: ```csv file_name,text @@ -106,7 +106,7 @@ file_name,text 0003.pdf,Attention is all you need. Abstract. The ... ``` -Load the dataset with `PdfFolder`, and it will create a `text` column for the pdf captions: +Load the dataset with `PdfFolder`, and it will create a `text` column for the PDF captions: ```py >>> dataset = load_dataset("pdffolder", data_dir="/path/to/folder", split="train") From c98ad8397177f0a5640d0dac6a0c5222f06c25d6 Mon Sep 17 00:00:00 2001 From: Ethan Knights Date: Fri, 10 Oct 2025 00:19:24 +0100 Subject: [PATCH 2/2] Update document_dataset.mdx OCR --- docs/source/document_dataset.mdx | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/source/document_dataset.mdx b/docs/source/document_dataset.mdx index ced8026af21..bc2a8a229ef 100644 --- a/docs/source/document_dataset.mdx +++ b/docs/source/document_dataset.mdx @@ -95,7 +95,7 @@ You can also define lists of PDFs. In that case you need to name the field `file {"pdfs_file_names": ["0003_part1.pdf", "0002_part2.pdf"], "label": "normal"} ``` -### OCR (Optical character recognition) +### OCR (Optical Character Recognition) OCR datasets have the text contained in a PDF. An example `metadata.csv` may look like: