Skip to content

We explore the ability of two LLMs -- GPT-4o and Claude Sonnet 3.5 -- to transcribe historical handwritten documents in a tabular format and compare their performance to traditional OCR/HTR systems: EasyOCR, Keras, Pytesseract, and TrOCR.

Notifications You must be signed in to change notification settings

jbaudru/llm-ocr-htr-historical-text-recognition

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

📜 Archival Transcription with LLMs: Can LLMs facilitate the OCR/HTR tasks for historical records? 📚

MIT License Python Jupyter Notebook OpenAI Claude Keras

This repository accompanies our study on using Large Language Models (LLMs) like GPT-4o and Claude Sonnet 3.5 to transcribe historical handwritten documents (French) in a tabular format. We compare their performance against traditional OCR/HTR systems: EasyOCR, Keras OCR, Pytesseract, and TrOCR.

📜 Paper

https://arxiv.org/abs/2501.11623

🔍 Key Highlights

  • Models Tested: GPT-4o, Claude Sonnet 3.5, EasyOCR, Keras OCR, Pytesseract, TrOCR.
  • Metrics: Character Error Rate (CER) and Bilingual Evaluation Understudy (BLEU).
  • Findings:
    • GPT-4o is best for line-by-line transcription.
    • Claude Sonnet 3.5 excels at whole-scan transcription.
    • In both line-by-line and whole-scan experiments, the two-shot approach yielded the best output.
    • BLEU better captures transcription quality compared to CER.

🚀 Quick Start

  1. Clone the repo:

    git clone https://github.com/jbaudru/llm-ocr-htr-historical-text-recognition
    cd llm-ocr-htr-historical-text-recognition
    
  2. Install dependencies:

    cd main/
    pip install -r requirements.txt
  3. Run

  • For the whole-scan experiments: run the scripts.
  • For the line-by-line experiments, consult the notebook.

📁 Data

The repository includes:

  • Historical document scans (images) and transcriptions.
  • Ground truth transcriptions for evaluation.
  • Sample model predictions.

📜 License

This project is licensed under the MIT License. (https://opensource.org/license/mit)

About

We explore the ability of two LLMs -- GPT-4o and Claude Sonnet 3.5 -- to transcribe historical handwritten documents in a tabular format and compare their performance to traditional OCR/HTR systems: EasyOCR, Keras, Pytesseract, and TrOCR.

Topics

Resources

Stars

Watchers

Forks

Languages