Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upgrade model training dependencies #42

Merged
merged 10 commits into from
Dec 16, 2020
16 changes: 8 additions & 8 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,10 +20,10 @@ Create a new virtual environment with an environment manager of your choice. The
pip install deidentify
```

We use the spaCy tokenizer. For good compatibility with the pre-trained models, we recommend using the same spaCy tokenization models that were used at de-identification model training time:
We use the spaCy tokenizer. For good compatibility with the pre-trained models, we recommend using the same version that we used to train the de-identification models.

```sh
pip install https://github.com/explosion/spacy-models/releases/download/nl_core_news_sm-2.2.1/nl_core_news_sm-2.2.1.tar.gz#egg=nl_core_news_sm==2.2.1
pip install https://github.com/explosion/spacy-models/releases/download/nl_core_news_sm-2.3.0/nl_core_news_sm-2.3.0.tar.gz#egg=nl_core_news_sm==2.3.0
```

### Example Usage
Expand All @@ -48,7 +48,7 @@ documents = [
]

# Select downloaded model
model = 'model_bilstmcrf_ons_fast-v0.1.0'
model = 'model_bilstmcrf_ons_fast-v0.2.0'

# Instantiate tokenizer
tokenizer = TokenizerFactory().tokenizer(corpus='ons', disable=("tagger", "ner"))
Expand Down Expand Up @@ -149,12 +149,12 @@ We provide a number of pre-trained models for the Dutch language. The models wer

| Name | Tagger | Lang | Dataset | F1* | Precision* | Recall* | Tags |
|------|--------|----------|---------|----|-----------|--------|--------|
| [DEDUCE (Menger et al., 2018)](https://www.sciencedirect.com/science/article/abs/pii/S0736585316307365)** | `DeduceTagger` | NL | NUT | 0.7564 | 0.9092 | 0.6476 | [8 PHI Tags](https://github.com/nedap/deidentify/blob/168ad67aec586263250900faaf5a756d3b8dd6fa/deidentify/methods/deduce/run_deduce.py#L17) |
| [model_crf_ons_tuned-v0.1.0](https://github.com/nedap/deidentify/releases/tag/model_crf_ons_tuned-v0.1.0) | `CRFTagger` | NL | NUT | 0.9048 | 0.9632 | 0.8530 | [15 PHI Tags](https://github.com/nedap/deidentify/releases/tag/model_crf_ons_tuned-v0.1.0) |
| [model_bilstmcrf_ons_fast-v0.1.0](https://github.com/nedap/deidentify/releases/tag/model_bilstmcrf_ons_fast-v0.1.0) | `FlairTagger` | NL | NUT | 0.9461 | 0.9591 | 0.9335 | [15 PHI Tags](https://github.com/nedap/deidentify/releases/tag/model_bilstmcrf_ons_fast-v0.1.0) |
| [model_bilstmcrf_ons_large-v0.1.0](https://github.com/nedap/deidentify/releases/tag/model_bilstmcrf_ons_large-v0.1.0) | `FlairTagger` | NL | NUT | 0.9505 | 0.9683 | 0.9333 | [15 PHI Tags](https://github.com/nedap/deidentify/releases/tag/model_bilstmcrf_ons_large-v0.1.0) |
| [DEDUCE (Menger et al., 2018)](https://www.sciencedirect.com/science/article/abs/pii/S0736585316307365)** | `DeduceTagger` | NL | NUT | 0.6649 | 0.8192 | 0.5595 | [8 PHI Tags](https://github.com/nedap/deidentify/blob/168ad67aec586263250900faaf5a756d3b8dd6fa/deidentify/methods/deduce/run_deduce.py#L17) |
| [model_crf_ons_tuned-v0.2.0](https://github.com/nedap/deidentify/releases/tag/model_crf_ons_tuned-v0.2.0) | `CRFTagger` | NL | NUT | 0.8511 | 0.9337 | 0.7820 | [15 PHI Tags](https://github.com/nedap/deidentify/releases/tag/model_crf_ons_tuned-v0.2.0) |
| [model_bilstmcrf_ons_fast-v0.2.0](https://github.com/nedap/deidentify/releases/tag/model_bilstmcrf_ons_fast-v0.2.0) | `FlairTagger` | NL | NUT | 0.8914 | 0.9101 | 0.8735 | [15 PHI Tags](https://github.com/nedap/deidentify/releases/tag/model_bilstmcrf_ons_fast-v0.2.0) |
| [model_bilstmcrf_ons_large-v0.2.0](https://github.com/nedap/deidentify/releases/tag/model_bilstmcrf_ons_large-v0.2.0) | `FlairTagger` | NL | NUT | 0.8990 | 0.9240 | 0.8754 | [15 PHI Tags](https://github.com/nedap/deidentify/releases/tag/model_bilstmcrf_ons_large-v0.2.0) |

*\*All scores are token-level (tag-blind) precision/recall/F1 obtained on the test portion of each dataset. For additional metrics, see the corresponding model release.*
*\*All scores are micro-averaged entity-level precision/recall/F1 obtained on the test portion of each dataset. For additional metrics, see the corresponding model release.*

*\*\*DEDUCE was developed on a dataset of psychiatric nursing notes and treatment plans. The numbers reported here were obtained by applying DEDUCE to our NUT dataset. For more information on the development of DEDUCE, see the paper by [Menger et al. (2018)](https://www.sciencedirect.com/science/article/abs/pii/S0736585316307365).*

Expand Down
36 changes: 21 additions & 15 deletions deidentify/evaluation/evaluator.py
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
import warnings
from collections import namedtuple
from typing import List

import numpy as np
import spacy
from loguru import logger
from sklearn.metrics import confusion_matrix
from spacy.gold import biluo_tags_from_offsets
Expand All @@ -13,32 +13,34 @@
Entity = namedtuple('Entity', ['doc_name', 'start', 'end', 'tag'])
ENTITY_TAG = 'ENT'

# Silence spaCy warning regarding misaligned entity boundaries. It will show up multiple times
# because the message changes with the input text.
# More info on the warning: https://github.com/explosion/spaCy/issues/5727
warnings.filterwarnings('ignore', message=r'.*W030.*')

def flatten(lists):
return [e for l in lists for e in l]


class Evaluator:

def __init__(self, gold: List[Document], predicted: List[Document], language='nl',
tokenizer=None):
def __init__(self, gold: List[Document], predicted: List[Document], language='nl'):
self.gold = gold
self.predicted = predicted

self.tags = sorted(list(set(ann.tag for doc in gold for ann in doc.annotations)))

if tokenizer:
self.tokenize = tokenizer
else:
if language not in self.supported_languages():
logger.warning(
'Unknown language {} for evaluation. Fallback to "en"'.format(language))
language = 'en'
if language not in self.supported_languages():
logger.warning(
'Unknown language {} for evaluation. Fallback to "en"'.format(language))
language = 'en'

if language == 'nl':
self.tokenize = spacy.load('nl_core_news_sm')
else:
self.tokenize = spacy.load('en_core_web_sm')
if language == 'nl':
from deidentify.tokenizer.tokenizer_ons import TokenizerOns
self.tokenizer = TokenizerOns(disable=('tagger', 'parser', 'ner'))
else:
from deidentify.tokenizer.tokenizer_en import TokenizerEN
self.tokenizer = TokenizerEN(disable=('tagger', 'parser', 'ner'))

@staticmethod
def supported_languages():
Expand Down Expand Up @@ -108,7 +110,7 @@ def token_level_blind(self):
return metric

def token_annotations(self, doc, tag_blind=False, entity_tag=ENTITY_TAG):
parsed = self.tokenize(doc.text, disable=("tagger", "parser", "ner"))
parsed = self.tokenizer.parse_text(doc.text)
entities = [(int(ann.start), int(ann.end), ann.tag) for ann in doc.annotations]
biluo_tags = biluo_tags_from_offsets(parsed, entities)

Expand All @@ -122,6 +124,10 @@ def token_annotations(self, doc, tag_blind=False, entity_tag=ENTITY_TAG):
#
# https://spacy.io/api/goldparse#biluo_tags_from_offsets
tags.append('O')
warnings.warn(
'Some entities could not be aligned in the text. Use `spacy.gold.biluo_tags_from_offsets(nlp.make_doc(text), entities)` to check the alignment.',
UserWarning
)
elif tag_blind:
tags.append(entity_tag)
else:
Expand Down
10 changes: 3 additions & 7 deletions deidentify/evaluation/significance_testing.py
Original file line number Diff line number Diff line change
Expand Up @@ -18,20 +18,16 @@ def _load_yaml(yaml_file):
return config


def noop():
return None


def micro_f1(gold: List[Document], predicted: List[Document]):
return evaluator.Evaluator(gold, predicted, tokenizer=noop).entity_level().f_score()
return evaluator.Evaluator(gold, predicted).entity_level().f_score()


def micro_precision(gold: List[Document], predicted: List[Document]):
return evaluator.Evaluator(gold, predicted, tokenizer=noop).entity_level().precision()
return evaluator.Evaluator(gold, predicted).entity_level().precision()


def micro_recall(gold: List[Document], predicted: List[Document]):
return evaluator.Evaluator(gold, predicted, tokenizer=noop).entity_level().recall()
return evaluator.Evaluator(gold, predicted).entity_level().recall()


class SignificanceReport:
Expand Down
11 changes: 10 additions & 1 deletion deidentify/methods/tagging_utils.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
"""Utility methods to convert between standoff and BIO format.
"""

import warnings
from collections import defaultdict, namedtuple
from typing import List, Tuple

Expand All @@ -15,6 +15,11 @@
Token = namedtuple('Token', ['text', 'pos_tag', 'label', 'ner_tag'])
ParsedDoc = namedtuple('ParsedDoc', ['spacy_doc', 'name', 'text'])

# Silence spaCy warning regarding misaligned entity boundaries. It will show up multiple times
# because the message changes with the input text.
# More info on the warning: https://github.com/explosion/spaCy/issues/5727
warnings.filterwarnings('ignore', message=r'.*W030.*')


def standoff_to_sents(docs: List[Document],
tokenizer: Tokenizer,
Expand Down Expand Up @@ -220,6 +225,10 @@ def _doc_to_bio(parsed_doc: spacy.tokens.Doc, annotations: List[Annotation]):
# Returned by spacy if token boundaries mismatch entity boundaries.
# https://spacy.io/api/goldparse#biluo_tags_from_offsets
tags.append('O')
warnings.warn(
'Some entities could not be aligned in the text. Use `spacy.gold.biluo_tags_from_offsets(nlp.make_doc(text), entities)` to check the alignment.',
UserWarning
)
else:
tags.append(biluo_to_bio[tag[0:2]] + tag[2:])

Expand Down
5 changes: 5 additions & 0 deletions deidentify/tokenizer/tokenizer_ons.py
Original file line number Diff line number Diff line change
Expand Up @@ -75,6 +75,11 @@ def _metadata_sentence_segmentation(doc):
NLP.tokenizer.add_special_case(case.lower(), [{ORTH: case.lower()}])


infixes = NLP.Defaults.infixes + [r'\(', r'\)', r'(?<=[\D])\/(?=[\D])']
infix_regex = spacy.util.compile_infix_regex(infixes)
NLP.tokenizer.infix_finditer = infix_regex.finditer


class TokenizerOns(Tokenizer):

def parse_text(self, text: str) -> spacy.tokens.doc.Doc:
Expand Down
2 changes: 1 addition & 1 deletion demo.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@
]

# Select downloaded model
model = 'model_bilstmcrf_ons_fast-v0.1.0'
model = 'model_bilstmcrf_ons_fast-v0.2.0'

# Instantiate tokenizer
tokenizer = TokenizerFactory().tokenizer(corpus='ons', disable=("tagger", "ner"))
Expand Down
28 changes: 14 additions & 14 deletions environment.yml
Original file line number Diff line number Diff line change
Expand Up @@ -2,25 +2,25 @@ name: deidentify
channels:
- conda-forge
dependencies:
- python=3.7.2
- pip=19.1
- tqdm=4.29.1
- pandas=0.23.4
- matplotlib=3.0.2
- seaborn=0.9.0
- scikit-learn=0.20.3
- python=3.7.9
- pip=20.3.1
- tqdm=4.54.1
- pandas=1.1.3
- matplotlib=3.3.2
- seaborn=0.11.0
- scikit-learn=0.23.2
- unidecode=1.0.23
- pyyaml=5.1
- joblib=0.13.2
- pip:
- spacy==2.2.1
- https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.2.0/en_core_web_sm-2.2.0.tar.gz#egg=en_core_web_sm==2.2.0
- https://github.com/explosion/spacy-models/releases/download/nl_core_news_sm-2.2.1/nl_core_news_sm-2.2.1.tar.gz#egg=nl_core_news_sm==2.2.1
- spacy==2.3.5
- https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.3.1/en_core_web_sm-2.3.1.tar.gz#egg=en_core_web_sm==2.3.1
- https://github.com/explosion/spacy-models/releases/download/nl_core_news_sm-2.3.0/nl_core_news_sm-2.3.0.tar.gz#egg=nl_core_news_sm==2.3.0
- deduce==1.0.2
- py-dateinfer==0.4.5
- loguru==0.4.0
- nameparser==1.0.2
- loguru==0.5.3
- nameparser==1.0.6
- sklearn-crfsuite==0.3.6
- flair==0.6.0.post1
- flair==0.7
- requests
- torch==1.6.0
- torch==1.7.1
36 changes: 24 additions & 12 deletions tests/methods/test_flair_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,32 +12,44 @@ def test_standoff_to_flair_sents():
docs = corpus.train
sents, parsed_docs = flair_utils.standoff_to_flair_sents(docs, tokenizer)

assert len(sents) == 10
assert len(parsed_docs) == 10
assert len(sents) == 14
assert len(parsed_docs) == 14

bio_tags = [token.get_tag('ner').value for token in sents[0]]
token_texts = [token.text for token in sents[0]]

assert token_texts == [
'Linders',
',',
'Xandro',
'<',
'[email protected]',
'<'
]
assert bio_tags == [
'B-Name',
'I-Name',
'I-Name',
'O'
]

bio_tags = [token.get_tag('ner').value for token in sents[1]]
token_texts = [token.text for token in sents[1]]
assert token_texts == [
'[email protected]'
]
assert bio_tags == [
'B-Email'
]

bio_tags = [token.get_tag('ner').value for token in sents[2]]
token_texts = [token.text for token in sents[2]]
assert token_texts == [
'>',
'<SPACE>',
'07',
'apr',
'.',
'<SPACE>',
'<SPACE>'
]

assert bio_tags == [
'B-Name',
'I-Name',
'I-Name',
'O',
'B-Email',
'O',
'O',
'B-Date',
Expand Down
2 changes: 1 addition & 1 deletion tests/taggers/test_crf_tagger.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
from deidentify.tokenizer import TokenizerFactory

tokenizer = TokenizerFactory().tokenizer(corpus='ons')
tagger = CRFTagger(model='model_crf_ons_tuned-v0.1.0', tokenizer=tokenizer)
tagger = CRFTagger(model='model_crf_ons_tuned-v0.2.0', tokenizer=tokenizer)


def test_annotate():
Expand Down
2 changes: 1 addition & 1 deletion tests/taggers/test_flair_tagger.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
from deidentify.tokenizer import TokenizerFactory

tokenizer = TokenizerFactory().tokenizer(corpus='ons')
tagger = FlairTagger(model='model_bilstmcrf_ons_fast-v0.1.0', tokenizer=tokenizer)
tagger = FlairTagger(model='model_bilstmcrf_ons_fast-v0.2.0', tokenizer=tokenizer)


def test_annotate():
Expand Down
38 changes: 31 additions & 7 deletions tests/tokenizer/test_tokenizer_ons.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,30 +2,54 @@

tokenizer = TokenizerOns()


def test_tokenizer():
text = '=== Answer: 1234 ===\ntest a b c d.\n=== Report: 1234 ===\nMw. test test test'
doc = tokenizer.parse_text(text)

tokens = [t.text for t in doc]

assert tokens == ['=== Answer: 1234 ===\n', 'test', 'a', 'b', 'c',
'd.', '\n', '=== Report: 1234 ===\n', 'Mw.', 'test', 'test', 'test']
assert tokens == [
'=== Answer: 1234 ===\n', 'test', 'a', 'b', 'c', 'd.', '\n', '=== Report: 1234 ===\n',
'Mw.', 'test', 'test', 'test'
]


def test_sentence_segmentation():
text = '=== Answer: 1234 ===\ntest a b c d.\n=== Report: 1234 ===\nMw. test test test'
text = '=== Answer: 1234 ===\nDit is een zin.\n=== Report: 1234 ===\nMw. heeft goed gegeten.'
doc = tokenizer.parse_text(text)
sents = [sent.text for sent in doc.sents]

assert sents == [
'=== Answer: 1234 ===\n',
'test a b c d.\n',
'Dit is een zin.\n',
'=== Report: 1234 ===\n',
'Mw. test test test'
'Mw. heeft goed gegeten.'
]

sents = list(doc.sents)
assert [token.text for token in sents[0]] == ['=== Answer: 1234 ===\n']
assert [token.text for token in sents[1]] == ['test', 'a', 'b', 'c', 'd.', '\n']
assert [token.text for token in sents[1]] == ['Dit', 'is', 'een', 'zin', '.', '\n']
assert [token.text for token in sents[2]] == ['=== Report: 1234 ===\n']
assert [token.text for token in sents[3]] == ['Mw.', 'test', 'test', 'test']
assert [token.text for token in sents[3]] == ['Mw.', 'heeft', 'goed', 'gegeten', '.']


def test_infix_split_on_parenthesis():
text = 'GRZ(12-12-2020).'
doc = tokenizer.parse_text(text)
tokens = [t.text for t in doc]
assert tokens == 'GRZ ( 12-12-2020 ) .'.split()


def test_infix_split_on_forward_slash():
text = 'Groot/Kempers'
doc = tokenizer.parse_text(text)
tokens = [t.text for t in doc]
assert tokens == 'Groot / Kempers'.split()


def test_infix_split_on_forward_slash_exclude_dates():
text = '13/01/2020'
doc = tokenizer.parse_text(text)
tokens = [t.text for t in doc]
assert tokens == ['13/01/2020']