A web api for tagging text based on the DocuScope Ity tagger.
For any questions regarding overall project or the language model used, please contact [email protected]
The project code is supported and maintained by the Eberly Center at Carnegie Mellon University. For help with this fork, project, or service please contact [email protected].
- Neo4J database.
- A DocuScope dictionary stored in the Neo4J database generated using CMU_Sidecar/docuscope-dictionary-tools/docuscope-rules> docuscope-rule-neo4j tool and a DocuScope language model.
common-dict.json
file that specifies a hierarchical organization of clusters. JSON Schemawordclasses.json
file which is the json version of a DocuScope language model's_wordclasses.txt
file converted using CMU_Sidecar/docuscope-dictionary-tools/docuscope-rules> docuscope-wordclasses tool.${DICTIONARY}_tones.json.gz
file which is the compressed json version of a DocuScope_tones.txt
file converted using CMU_Sidecar/docuscope-dictionary-tools/docuscope-tones> ds-tones tool.- MySQL database for storing CMU_Sidecar/docuscope-classroom> documents and performance measures.
- Optional: Memcached
The following environment variable should be set so that the DocuScope tagger can access the various required services. The defaults tend to be reasonable values for a development environment where everything is hosted locally and do not reflect values that should be used in any production environment.
Variable | Description | Default |
---|---|---|
DICTIONARY | String used in formulating tag labels and used to load the correct dictionary files. | default |
DICTIONARY_HOME | Path to base directory of necessary runtime dictionary files specified above. | <Application's base directory>/dictionary |
DB_HOST | Hostname of the MySQL database for storing processed documents. | 127.0.0.1 |
DB_PORT | Port of the MySQL document database. | 3306 |
DB_PASSWORD | Password for accessing the document database. 1 | 2 |
DB_USER | Username for accessing the document database. 1 | docuscope |
MEMCACHED_URL | Hostname for the optional caching service. | localhost |
MEMCACHED_PORT | Port of the caching service. | 11211 |
MYSQL_DATABASE | Identifier for document database. | docuscope |
NEO4J_DATABASE | Identifier for dictionary database. | neo4j |
NEO4J_PASSWORD | Password for accessing the dictionary database. 1 | 2 |
NEO4J_USER | Username for accessing the dictionary database. 1 | neo4j |
NEO4J_URI | URI of the dictionary database. | neo4j://localhost:7687/ 3 |
- Build docker image:
docker build -t <tag> .
When deployed, service bound to port 80 of the docker container. - Run locally:
pipenv run hypercorn app.main:app --bind 0.0.0.0:8000
This is meant to work in conjunction with CMU_Sidecar/docuscope-classroom> which is designed for visualizing and analyzing the results in a classroom setting and with DocuScope Write & Audit.
This project was partially funded by the A.W. Mellon Foundation, Carnegie Mello University's Simon Initiative Seed Grant, and the Berkman Faculty Development Fund.
Footnotes
-
It is recommended to use Docker secrets to get these values. The application is able to retrieve values from specified files if the environment variable has the
_FILE
affix added. ↩ ↩2 ↩3 ↩4 -
Passwords intentionally default to None value for security reasons. ↩ ↩2
-
See Neo4J Python Driver information for more details on the various valid protocols. ↩