📜 Archival Transcription with LLMs: Can LLMs facilitate the OCR/HTR tasks for historical records? 📚

This repository accompanies our study on using Large Language Models (LLMs) like GPT-4o and Claude Sonnet 3.5 to transcribe historical handwritten documents (French) in a tabular format. We compare their performance against traditional OCR/HTR systems: EasyOCR, Keras OCR, Pytesseract, and TrOCR.

📜 Paper

https://arxiv.org/abs/2501.11623

🔍 Key Highlights

Models Tested: GPT-4o, Claude Sonnet 3.5, EasyOCR, Keras OCR, Pytesseract, TrOCR.
Metrics: Character Error Rate (CER) and Bilingual Evaluation Understudy (BLEU).
Findings:
- GPT-4o is best for line-by-line transcription.
- Claude Sonnet 3.5 excels at whole-scan transcription.
- In both line-by-line and whole-scan experiments, the two-shot approach yielded the best output.
- BLEU better captures transcription quality compared to CER.

🚀 Quick Start

Clone the repo:

git clone https://github.com/jbaudru/llm-ocr-htr-historical-text-recognition
cd llm-ocr-htr-historical-text-recognition

Install dependencies:

cd main/
pip install -r requirements.txt

Run

For the whole-scan experiments: run the scripts.
For the line-by-line experiments, consult the notebook.

📁 Data

The repository includes:

Historical document scans (images) and transcriptions.
Ground truth transcriptions for evaluation.
Sample model predictions.

📜 License

This project is licensed under the MIT License. (https://opensource.org/license/mit)

Name		Name	Last commit message	Last commit date
Latest commit History 289 Commits
Supplementary Material		Supplementary Material
data		data
doc		doc
main		main
mlruns		mlruns
notebooks		notebooks
results		results
.DS_Store		.DS_Store
.gitattributes		.gitattributes
.gitignore		.gitignore
readme.md		readme.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

📜 Archival Transcription with LLMs: Can LLMs facilitate the OCR/HTR tasks for historical records? 📚

📜 Paper

🔍 Key Highlights

🚀 Quick Start

📁 Data

📜 License

About

Contributors 2

Languages

jbaudru/llm-ocr-htr-historical-text-recognition

Folders and files

Latest commit

History

Repository files navigation

📜 Archival Transcription with LLMs: Can LLMs facilitate the OCR/HTR tasks for historical records? 📚

📜 Paper

🔍 Key Highlights

🚀 Quick Start

📁 Data

📜 License

About

Topics

Resources

Stars

Watchers

Forks

Contributors 2

Languages