Name		Name	Last commit message	Last commit date
parent directory ..
hg_resource_creator		hg_resource_creator
Dockerfile		Dockerfile
Makefile		Makefile
README.md		README.md
requirements.txt		requirements.txt

README.md

Histograph resource creator

Given a list of documents as text files perform named entity recognition and disambiguation using C2DH Nerd and prepare documents as a collection of resources with associated entities in Histograph create resource format.

Preparing documents

Organise your documents as a collection of files in a folder, one file per document. This app extracts document metadata from the names of the files. Therefore file names should follow the pattern:

<iso_datetime_of_document>_title_<slug_of_the_document>.txt

Where:

iso_datetime_of_document - ISO 8601 date and time the document is associated with
slug_of_the_document - a unique readable ID of the document. It should only contain ASCII letter, numbers, - and _ symbols. The slug is also converted into document title by replacing _ symbols with spaces. E.g. A slug once_upon_a_time_page_29 will be converted into a "once upon a time page 29" title.

The app will verify that the files follow the this pattern and an error will be raised if some files don't.

Running

In virtualenv

python -m hg_resource_creator <arguments>

In docker

Build container:

make build

Run container:

docker run --rm -it histograph-resource-creator \
  -v <path_to_documents>:/hg_documents \
  -v <path_to_output_folder>:/hg_output \
  python -m hg_resource_creator \
    --path /hg_documents \
    --outpath /hg_output/mycorpus.jsons \
    <optional_arguments>

Arguments are:

--path <folder> - path to the folder with document files (see Preparing documents section). REQUIRED
--outpath <resource_jsons_filename> - path to the file where histograph resources will be saved. REQUIRED
--ner-method <method> - named entity recognition method to be used. See this file for reference. spacy_small_multi is used by default.
--ned-method <method> - named entity recognition an disambiguation method to be used. See this file for reference.
--custom-entities-url <url> - URL of the custom entities CSV file to use if ner/ned method chosen above supports it.
--language <code> - two letter code of the language the documents are written in. en by default.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

resource_creator

resource_creator

README.md

Histograph resource creator

Preparing documents

Running

In virtualenv

In docker

Files

resource_creator

Directory actions

More options

Directory actions

More options

Latest commit

History

resource_creator

Folders and files

parent directory

README.md

Histograph resource creator

Preparing documents

Running

In virtualenv

In docker