Given a list of documents as text files perform named entity recognition and disambiguation using C2DH Nerd and prepare documents as a collection of resources
with associated entities
in Histograph create resource format.
Organise your documents as a collection of files in a folder, one file per document. This app extracts document metadata from the names of the files. Therefore file names should follow the pattern:
<iso_datetime_of_document>_title_<slug_of_the_document>.txt
Where:
iso_datetime_of_document
- ISO 8601 date and time the document is associated withslug_of_the_document
- a unique readable ID of the document. It should only contain ASCII letter, numbers,-
and_
symbols. The slug is also converted into document title by replacing_
symbols with spaces. E.g. A slugonce_upon_a_time_page_29
will be converted into a "once upon a time page 29
" title.
The app will verify that the files follow the this pattern and an error will be raised if some files don't.
python -m hg_resource_creator <arguments>
Build container:
make build
Run container:
docker run --rm -it histograph-resource-creator \
-v <path_to_documents>:/hg_documents \
-v <path_to_output_folder>:/hg_output \
python -m hg_resource_creator \
--path /hg_documents \
--outpath /hg_output/mycorpus.jsons \
<optional_arguments>
Arguments are:
--path <folder>
- path to the folder with document files (seePreparing documents
section). REQUIRED--outpath <resource_jsons_filename>
- path to the file where histograph resources will be saved. REQUIRED--ner-method <method>
- named entity recognition method to be used. See this file for reference.spacy_small_multi
is used by default.--ned-method <method>
- named entity recognition an disambiguation method to be used. See this file for reference.--custom-entities-url <url>
- URL of the custom entities CSV file to use ifner/ned
method chosen above supports it.--language <code>
- two letter code of the language the documents are written in.en
by default.