Skip to content

Upload multiple directories as multiple documents to collection in Transkribus using its REST API

License

Notifications You must be signed in to change notification settings

cconzen/TranskribusBatchUpload

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Transkribus Document Uploader

This Python script automates the process of uploading multiple documents with image and xml files to a Transkribus collection using Transkribus' REST API. It is tailored to mass uploading the results of running the Loghi pipeline.

Requirements

  • Python 3.11
  • requests library

You can install requests using pip:

pip install requests

Setup

Clone this repository:

git clone https://github.com/cconzen/TranskribusBatchUpload.git

Navigate to the project directory:

cd TranskribusBatchUpload

Edit main.py to include your Transkribus account details and the ID of the destination collection:

collection_id = "YOUR_COLLECTION_ID"
username = "YOUR_EMAIL"
password = "YOUR_PASSWORD"

Set the base directory for processing documents:

base_dir = 'PATH/TO/DIRECTORY'

Usage

Run the script by executing the following command:

python main.py

The script will:

  • Log in to your Transkribus account.
  • Process the specified directory and its subdirectories to find image files and their corresponding XML files.
  • Create a new job for each directory, uploading it as a document in your Transkribus collection.

Directory Structure

The directory you process should follow this structure:

base_dir/
│
└───document_name/
    │
    ├───page001.jpg
    ├───page002.jpg
    ├───page/
        ├── page001.xml
        └── page002.xml
  • document_name: The name of the directory will be used as the document name in Transkribus.
  • image1.jpg, image2.jpg: Image files representing pages of the document.
  • image1.xml, image2.xml: pageXML files.

Notes

  • Ensure that your images are in .jpg format and your XML files are named correctly.
  • The XML files should be located inside a page subdirectory under the same directory as the images.
  • The pages are sorted based on their names. Make sure they follow a naming convention which makes them sortable, or adjust the code to pay mind to the naming conventions of your files.

License

This project is licensed under the MIT License.

About

Upload multiple directories as multiple documents to collection in Transkribus using its REST API

Topics

Resources

License

Stars

Watchers

Forks

Languages