Project to use a combination of NLP, AI and Linguists to make danish grammar assistant.
Corrected at GrammatikTAK.com., no longer live (See the code for it here). Models and datasets are not included in this repo.
The backend is no longer hosted. You can run this locally and change the code of the website to point to your locally hosted backend.
The backend uses trained models. To use the backend without the models change the first line in GrammatiktakBackend/main.py to local_models_avaliable = False
then cd GrammatiktakBackend
and host with flask --app main run
This project could definitely be better documented. If you need any assistance, want to go through the project or want my datasets/models to further experiment, feel free to contact me.
The rise in NLP and AI has greatly affected popular languages, their respective grammar assistants and NLP work. The nordic, especially danish, are sadly way behind. This repo is hopefully going to help cover some basic NLP needs and make a great danish, an potential nordic, grammar assistant.
I focus on making GrammatikTAK:
- Simple: Build in modules, a module can easily be replaced, reworked or even deleted without affecting other modules.
- Adaptable: Although speed is important, I have focused on adaptability and readability over speed.
- Well-tested: I have tried to do extensive testing to my models to secure a high accuracy.
Here is a small overview of the most important directories withing the Development folder:
- BackendAssistants: Scripts for analysing the backend performance & complexity.
- DataProcessing: Scripts & notebooks for converting text to datasets.
- FineTuneModels: Scripts for finetuning models and logging performance
- GoogleDocsAddOn: Scripts for the GrammatikTAK Google Docs Add-on
- GoogleExtension: Scripts for the GrammatikTAK Google Extension (not finished)
- GrammatiktakBackend: development of backend 2.0. main.py is the backend. Currently used.
- Other: Powerpoints
- TestingOtherModels: Scripts for testing models from other people to use or compare with.
This is the frontend script for the website at GrammatikTAK.com.
- see measurements based on each school, maybe this should be avaliable for schools to check at all times. @https://grammatiktak.com/data
- Be able to put in code in frontend to unlock better features.
- Send school id to backend
- School code should work (green if correct, red if not and then reset, should load from cookie)
- Authentication token (github secret) to backend
A collection of human annotated datasets for everyone to use distributed under the CC-BY-SA 4.0 license.
This is not a set of training data. There exist a large number of huge nordic datasets (1, 2 & 3) with different kinds of data for training.
I felt that the danish NLP community could benefit from having a high-quality, human annotated dataset to use when testing NLP models or for other usage.
I strive to make the datasets:
- Of extremely high quality (if you find a mistake, let us know).
- Large, so that the test size is big enough for you to get a reasonable accurate estimate of your models performance.
- Broad to capture different themes, sentence constructions and lengths.
You are free to use whatever file you want. Here is a description of how they are constructed:
- Every file is focused on one specific error in the selected language (currently only danish - we would like to expand this to other nordics languages).
- To see more about each file and how to load, look the README in the specific folder.
- Wikipedia
- Made up text
- Internal documents
- Danish gigaword
- Danish Named Entity Recognition