Skip to content

mrkkollo/evkk-api

Repository files navigation

Installation

  1. Clone this repository and move inside it with a terminal.
  2. Install g++ and gcc on a system level, on Debian based systems sudo apt update && sudo apt install gcc g++ should be sufficient.
  3. Run conda env create -f environment.yaml to create the Python environment.
  4. Create a directory called "models" inside the root of this repository.
  5. If you posess a different model you wish to use, add the model inside the models/ directory and change the value " model_name" inside conf.json. Otherwise, without the existence of a preset model, one is downloaded for the user.
  6. Users can also directly edit the file with the pre-made token replacements under texts/dependencies/word_mapping.

Usage

  1. If using a custom JamSpell model, create a "models directory", send the model binary into that folder and change the "model_name" value inside conf.json. If this is not done, it will download the model from the web in case it is not already inside the folder. Said model is put into an automatically created "models" folder.
  2. Access the environment with conda activate evkk
  3. Inside the environment, run python manage.py migrate, this will populate a simple sqlite database with the DB schema.
  4. Run the application with python manage.py runserver. Please note that this is just a plain development server and is not appropriate for a full production setting.
  5. Access http://localhost:8000 to view the API, from there you can use the visual Browsable API to make request.

NB! Please note that on first requests, it might take a while for the models to be loaded into memory in the global namespace. This might take up to a few minutes, additional requests will be processed much faster.

NB! Since the model is loaded into memory inside the global namespace, any web server configurations that create multiple workers should take into account increased memory usage.

Jamspell

Source: https://habr.com/en/post/346618/ (In Russian)

This application uses Jamspell as a method of text correction. Jamspell is a heavily modified version of the Symspell algorithm, implemented by the use of 3-grams for language modeling, Kneser–Ney smoothing to fix the issue of a never-before met 3-gram creating 0 candidates and 0 probability, Perfect Hashing to keep the impact on RAM into useable levels and a Bloom Filter to further reduce the memory inprint the model has.

This application downloads a Jamspell model trained using a subset of the Estonian National Corpus 2019 ( etnc19_balanced_corpus.prevert). Sadly, the entirety of the national corpus ended up being too memory heavy for my machine so I was forced to limit it only to a subset of 126109243 words (calculated using a Linux wc -w command) which is only 8% of the corpus' word size.

Training Jamspell model

  1. Place the new line delimited full-text training material inside model_training/training_texts.txt.
  2. Inside this repository run docker build -f docker/jamspell_dockerfile -t jamspell . to build the Jamspell image.
  3. Run the command docker run -it jamspell bash to start the docker image and log into it.
  4. Run the command ./main/jamspell train alphabet_et.txt training_texts.txt {{model name of your choosing, nice to have a timestamp here for bookkeeping}}.bin
  5. Wait until the model is done training.
  6. Outside the jamspell image, run docker ps In a terminal to find the Docker container name that was randomly generated for the JamSpell image that you created.
  7. In an outside terminal, execute docker cp {{container name}}:/JamSpell/build/{{name you set for the model}} ./{{name you set for the model}} , this will copy the model from the Docker image to the same path where you launched the command.

If creating the model for this API:

  1. Create a models dir inside this repository.
  2. Move the model into said directory.
  3. Change the "model_name" inside conf.json to match the file name of the model.

API Docker

Since this package depends on external system dependencies to compile itself (installation is very platform dependent), I've created a Dockerfile which users can use to guarantee a working environment.

  1. Inside this repository run docker build -f docker/api_dockerfile -t jamspell-rest . to install the image.
  2. To access the installed image, run docker run -p 8000:8000 jamspell-rest, this will "put" you inside the isolated image.
  3. You can then access the page through http://localhost:8000

Links:

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages