This repository accompanies our study on using Large Language Models (LLMs) like GPT-4o and Claude Sonnet 3.5 to transcribe historical handwritten documents (French) in a tabular format. We compare their performance against traditional OCR/HTR systems: EasyOCR, Keras OCR, Pytesseract, and TrOCR.
https://arxiv.org/abs/2501.11623
- Models Tested: GPT-4o, Claude Sonnet 3.5, EasyOCR, Keras OCR, Pytesseract, TrOCR.
- Metrics: Character Error Rate (CER) and Bilingual Evaluation Understudy (BLEU).
- Findings:
- GPT-4o is best for line-by-line transcription.
- Claude Sonnet 3.5 excels at whole-scan transcription.
- In both line-by-line and whole-scan experiments, the two-shot approach yielded the best output.
- BLEU better captures transcription quality compared to CER.
-
Clone the repo:
git clone https://github.com/jbaudru/llm-ocr-htr-historical-text-recognition cd llm-ocr-htr-historical-text-recognition
-
Install dependencies:
cd main/ pip install -r requirements.txt
-
Run
- For the whole-scan experiments: run the scripts.
- For the line-by-line experiments, consult the notebook.
The repository includes:
- Historical document scans (images) and transcriptions.
- Ground truth transcriptions for evaluation.
- Sample model predictions.
This project is licensed under the MIT License. (https://opensource.org/license/mit)