- Doccano annotation server with spacy backend
fiete@ubu:~/Documents/programming/spacy/doccano_spacy$ tree -L 2 --dirsfirst
.
├── custom-model # contains the spacy model (training) files
│ ├── model-best # trained model (best)
│ ├── model-last # trained model (last)
│ ├── base_config.cfg
│ ├── config.cfg
│ └── train.spacy
├── data # contains the source data
│ ├── exported
│ ├── captum.csv
│ ├── captum.txt
│ └── label_config.json
├── spacy-server # spacy backend server
│ ├── app
│ ├── Dockerfile
│ └── run.sh
├── convert.py # convert reports csv to doccano format
├── docker-compose.yaml
├── exporter.py # contains helper functions
├── generate_train_file.py # generate data file used for training spacy
└── README.md
9 directories, 13 files
For proper authentication, you'll need to create a .env
file with the following content in the root of this project:
SPACY_USER=admin
SPACY_PASSWORD=password
You can change the credentials to your liking, but make sure to also adjust the Authorization
headers, as described in the Server README and the Set parameters step below.
docker-compose up -d
Doccano should now be available on http://localhost:8000 in your browser (Credentials: admin, password)
Shut down:
docker-compose stop
In order to start annotating, convert your csv file (in my case data/captum.csv
) into the format doccano requires for imports.
python convert.py
Note that this requires the spacy en_core_web_md
model, which can be obtained by running python -m spacy download en_core_web_md
.
Open the web UI at http://localhost:8000.
Files you import must have a specific format. You may use the convert.py
to convert from a pandas dataframe to a textline
file.
You can create labels in the labels section (sidemenu). Labels can also be im- and exported (see the data/label_config.json
).
Important: Make sure you have created your custom labels before setting this up!
Navigate to Settings and select the Auto Labeling tab. Hit Create
and select Custom REST template
.
In the next step, we are specifying the request properties. This includes setting the Content-Type
and Authorization
headers and the request Body
. For details on how to obtain the correct Authorization
Header, also check the Server README.
If all is configured correctly, the test should return a valid response.
Here we can customize the mapping between the response we get from the annotation backend (in this case the spacy server) and doccano. For the mapping Jinja2 is used.
Finally we have to provide the mapping between the labels returned by the spacy backend and the ones present in doccano. It looks like we have to provide this even in the case that they are identical.
pip install -r requirements
Download the spacy model
python -m spacy download en_core_web_md
In Doccano, go to the Datasets page and export the dataset. This will create a zip file containing the annotations per user, i.e admin.jsonl
and unknown.jsonl
which contains all the sections that have not been annotated yet.
In the data
folder, create an exported
folder and copy over the admin.jsonl
file.
The training file is used by spacy in the spacy train
command. Run the generate_train_file.py
script, to generate the file based on the admin.jsonl
.
python generate_train_file.py
python -m spacy train custom-model/config.cfg --output ./custom-model
SciSpacy
python -m spacy train custom-model/scispacy/config.cfg --output ./custom-model/scispacy/ --paths.train ./custom-model/train.spacy --paths.dev ./custom-model/train.spacy
If everything was successfull, you should now have a model-best
and model-last
folder in the custom-model
directory.
If the containers are still running, use docker-compose stop
to stop them. Now we can recreate them with:
docker-compose up --build