Skip to content

SCUT-DLVCLab/HisDoc1B

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 

Repository files navigation

HisDoc1B Dataset

The HisDoc1B dataset comprises 40,281 books, over 3 million document images, and over 1 billion characters across 30,615 character categories. To the best of our knowledge, HisDoc1B is the largest dataset in the field, surpassing existing datasets by more than 200 times in terms of scale (as shown in the below table). Additionally, it is the only dataset with complete book-level annotations and punctuation annotations.

Dataset #Books #Document images #Characters #Character categories Text punctuation
MTHv1[1] - 1,500 521,370 4,058 ×
MTHv2[2] - 3,199 1,081,678 6,733 ×
IC19 HDRC[3] - 11,715 2,482,994 8,353 ×
M5HisDoc[4] - 8,000 4,367,360 16,151 ×
CASIA-AHCDB[5] - - 2,276,740 10,350 ×
HisDoc1B (Ours) 40,281 3,163,330 (270×) 1,082,544,808 (248×) 30,615 (1.9×)

Table 1: Comparison of HisDoc1B with existing Chinese historical document datasets. The highest and second highest values within each column are denoted by bold and underline, respectively.

Download

OneDrive: https://1drv.ms/u/s!ApQfSeOP7LDTdPghMv281sKYsq0?e=fIuK65
BaiduYun: https://pan.baidu.com/s/1CQnfmHwh6hGigyvHNlmPCQ?pwd=aziq

Directory Format

The dataset is organized in the following directory format:

├── HisDoc1B
    ├── books
    │   ├── xxx.pdf/.djvu
    │   └── ...
    ├── annos
    │   ├── xxx.json
    │   └── ...
    ├── readme.md
    ├── book2im.py
    ├── read_anno.py

Inference codes to generate the dataset

Contact

For any questions about the dataset, please contact the authors by sending an email to [email protected].

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published