This repo contains a CLI application used for extracting text from pdf and html documents before translating these to a target language (default is English) if they are in a different language to the target language.
HTML Text Extraction:
- HTML webpages are processed by making a request to the webpage and extracting text from the html content using a combination of the
news-please
andreadability
python packages.
PDF Text Extraction:
- PDF documents are processed by downloading the pdf from the cdn (Content Delivery Network accessible via an endpoint) and using the
Azure
form recognizer API to extract text from the pdf.
Translation:
- Text is translated using the
Google
translation api.
To operate and run the CLI the repo provides useful commands in the Makefile
. This reads environment variables from a .env
file. Create this locally by running the following command and then enter the relevant values.
make setup
Once this is done we can then run the commands in the "Makefile". These split into two main groups, running directly on your machine, or in a Docker container.
To run locally run the following commands to install dependencies using Poetry and set up playwright and pre-commit. Then run the CLI locally.
Note that you will need a Python version in your virtual environment of that matches the project version. It is also recommended to run within a virtual environment.
make install
make run_local
To run in Docker with the data/raw
folder as input and data/processed
as output run:
make run_docker
The CLI operates on an input folder of tasks defined by JSON files in the following format as defined in cpr_sdk
library dependency. This can be found here.
class ParserInput(BaseModel):
"""Base class for input to a parser."""
document_id: str
document_metadata: BackendDocument
document_name: str
document_description: str
document_source_url: Optional[AnyHttpUrl]
document_cdn_object: Optional[str]
document_content_type: Optional[str]
document_md5_sum: Optional[str]
document_slug: str
It outputs JSON files named ${id}.json
, with id
being the document_id
of each input document, to an output folder in one of these formats:
class ParserOutput(BaseModel):
"""Base class for an output to a parser."""
document_id: str
document_metadata: BackendDocument
document_name: str
document_description: str
document_source_url: Optional[AnyHttpUrl]
document_cdn_object: Optional[str]
document_content_type: Optional[str]
document_md5_sum: Optional[str]
document_slug: str
languages: Optional[Sequence[str]] = None
translated: bool = False
html_data: Optional[HTMLData] = None
pdf_data: Optional[PDFData] = None