You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/source/document_dataset.mdx
+10-10Lines changed: 10 additions & 10 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,13 +1,13 @@
1
1
# Create a document dataset
2
2
3
-
This guide will show you how to create a document dataset with `PdfFolder` and some metadata. This is a no-code solution for quickly creating a document dataset with several thousand pdfs.
3
+
This guide will show you how to create a document dataset with `PdfFolder` and some metadata. This is a no-code solution for quickly creating a document dataset with several thousand PDFs.
4
4
5
5
> [!TIP]
6
6
> You can control access to your dataset by requiring users to share their contact information first. Check out the [Gated datasets](https://huggingface.co/docs/hub/datasets-gated) guide for more information about how to enable this feature on the Hub.
7
7
8
8
## PdfFolder
9
9
10
-
The `PdfFolder` is a dataset builder designed to quickly load a document dataset with several thousand pdfs without requiring you to write any code.
10
+
The `PdfFolder` is a dataset builder designed to quickly load a document dataset with several thousand PDFs without requiring you to write any code.
11
11
12
12
> [!TIP]
13
13
> 💡 Take a look at the [Split pattern hierarchy](repository_structure#split-pattern-hierarchy) to learn more about how `PdfFolder` creates dataset splits based on your dataset repository structure.
@@ -72,32 +72,32 @@ file_name,additional_feature
72
72
or using `metadata.jsonl`:
73
73
74
74
```jsonl
75
-
{"file_name": "0001.pdf", "additional_feature": "This is a first value of a text feature you added to your pdfs"}
76
-
{"file_name": "0002.pdf", "additional_feature": "This is a second value of a text feature you added to your pdfs"}
77
-
{"file_name": "0003.pdf", "additional_feature": "This is a third value of a text feature you added to your pdfs"}
75
+
{"file_name": "0001.pdf", "additional_feature": "This is a first value of a text feature you added to your PDFs"}
76
+
{"file_name": "0002.pdf", "additional_feature": "This is a second value of a text feature you added to your PDFs"}
77
+
{"file_name": "0003.pdf", "additional_feature": "This is a third value of a text feature you added to your PDFs"}
78
78
```
79
79
80
80
Here the `file_name` must be the name of the PDF file next to the metadata file. More generally, it must be the relative path from the directory containing the metadata to the PDF file.
81
81
82
-
It's possible to point to more than one pdf in each row in your dataset, for example if both your input and output are pdfs:
82
+
It's possible to point to more than one PDF in each row in your dataset, for example if both your input and output are pdfs:
0 commit comments