Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix whitespace token issue with newer flair versions #28

Merged
merged 2 commits into from
Sep 4, 2020
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 7 additions & 1 deletion deidentify/methods/bilstmcrf/flair_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -60,7 +60,13 @@ def standoff_to_flair_sents(docs: List[Document],
for sent in sents:
flair_sent = Sentence()
for token in sent:
tok = Token(token.text)
if token.text.isspace():
# spaCy preserves consecutive whitespaces, while flair ignores them.
# This would make a round-trip standoff -> token -> standoff impossible.
# To accommodate whitespace tokens with flair, we add a special token.
tok = Token('<SPACE>')
else:
tok = Token(token.text)
tok.add_tag(tag_type='ner', tag_value=token.label)
flair_sent.add_token(tok)
flair_sents.append(flair_sent)
Expand Down
2 changes: 1 addition & 1 deletion setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -46,7 +46,7 @@ def run(self):
python_requires='>=3.7',
install_requires=[
'requests',
'flair>=0.4.3,!=0.4.4,<0.5',
'flair>=0.4.3',
'torch>=1.1.0,<1.4.0',
'spacy>=2.2.1',
'tqdm>=4.29',
Expand Down
6 changes: 3 additions & 3 deletions tests/methods/test_flair_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -25,11 +25,11 @@ def test_standoff_to_flair_sents():
'<',
'[email protected]',
'>',
'\n',
'<SPACE>',
'07',
'apr',
'.',
'\n\n',
'<SPACE>',
]

assert bio_tags == [
Expand Down Expand Up @@ -108,7 +108,7 @@ def test_flair_sentence_with_whitespace_tokens():
# spaCy adds consecutive whitespace tokens as a single whitespace. These should be retained
# in the Flair sentence, otherwise it's not possible to reconstruct the original document from
# the tokenized representation.
assert [token.text for token in flair_sents[0]] == ['Mw', 'geniet', 'zichtbaar', '.', ' ']
assert [token.text for token in flair_sents[0]] == ['Mw', 'geniet', 'zichtbaar', '.', '<SPACE>']

spacy_doc = docs[0].spacy_doc
spacy_sents = list(spacy_doc.sents)
Expand Down