Annotation with any language #7

glacierck · 2022-07-04T03:35:05Z

The regular condition in dtd_parser limits the possibility of annotation of other characters. I tried to extend other characters, and the test was normal. What is the necessity of these restrictions？

hehuan2112 · 2022-07-04T13:57:01Z

Thank you so much for your feedback.
The restrictions for the dtd file are mainly because of the compatibility of MAE, and the DTD format itself (https://en.wikipedia.org/wiki/Document_type_definition). Our implementation for the DTD parsing is just a minimal set.

The dtd file defines the annotation schema and the elements are used for creating XML tags. So usually to avoid any encoding issues, we use ASCII characters in the dtd file.
As far as we know, if characters in other languages are used in dtd for values (e.g., list values, string, etc) instead of elements, it should work in the annotation XML files.

It would be great if you could share you sample annotation files and dtd schema here, then we can test and fix any issues.

varna9000 · 2022-07-06T09:32:23Z

yes, @glacierck please share your fix. I have train texts in Cyrillic letters and the parser doesn't catch the annotations correctly. I found this when tried to export in BIO format e.g. the annotated term is чл. 78а ал. 1 от НК, but in export I got:

чл	B-LAW
.	I-LAW

с	O

EDIT: Actually it might be just the BIO exporter. Other exporters catch the full term correctly. @hehuan2112 can you please advise?

glacierck · 2022-07-06T09:57:46Z

As shown in the figure, this modification enables me to customize DTDs of other characters
@varna9000

This should be a global configuration that supports user injection @hehuan2112

hehuan2112 · 2022-07-06T13:12:26Z

Thank you for your feedback @varna9000! Yes, I think this issue is because of the default sentencizer (sentence tokenization) algorithm, which splits a document based on "."
As the BIO/IOB2 format requires tokens and contextual sentences, we need to find the sentences around the annotated tokens. If the sentence cannot be correctly identified, the converted results won't be correct.

In your case, I guess the чл. and ал. may be similar to some kind of abbreviations such as Dr., Mr. or Mon.. The best way I think can be to update the sentencizer algorithm to support these cases. But as you know, there can be many corner cases. So, what I suggest is having a new feature to custom a list of punctuation characters or words that mark sentence ends or indicate non-sentence ends.

For example, we can add a config panel to input these words, or add them to the annotation schema.
In fact, we also plan to upgrade the annotation schema format by using JSON format, which is easier to modify and update.

Any suggestions?

hehuan2112 · 2022-07-06T13:23:47Z

As shown in the figure, this modification enables me to customize DTDs of other characters @varna9000

This should be a global configuration that supports user injection @hehuan2112

@glacierck Thank you so much! I see your point. Yes, the current DTD regex parsing can only support very limited characters in the schema, as well as the suggested value list. I think that's a major limitation of the current schema format. I think a better solution is to upgrade the annotation schema format by using JSON format, then we don't need to specify the value range for element names or values.

Although we use the DTD format at present, MedTator in fact uses a JSON object of the schema during annotation and other tasks, which is loaded and converted by thedtd_parser.
So, we can have an annotation schema in JSON format and it's easier to define tag names and values in other languages.

Any comments?

varna9000 · 2022-07-06T17:12:14Z

@hehuan2112 yes, a config with sentence end exeptions would be great. Many languages have different exceptions. For example SpaCy has explicit tokenizer_exeptions.py config file for every language which could be used, or even better just allow the user to put a json or plain text file (new line delimited) with the sentence end exceptions.

glacierck · 2022-07-07T06:25:36Z

Yes, using JSON format is a good solution, but will the workload be too heavy? If so, I suggest using yaml format and retaining its easy offline reading feature.
@hehuan2112

hehuan2112 · 2022-07-07T16:13:39Z

@varna9000 Thank you for the example, I will check SpaCy's implementation and how to improve our algorithm.

hehuan2112 · 2022-07-07T16:19:45Z

@glacierck I agree, YAML is also a great solution and it is easy to edit and share. I will plan to add it to the feature roadmap.

hehuan2112 · 2023-03-28T19:13:14Z

Sorry for my late reply. Last year, we added YAML format support in the 1.3.0 release. And all our sample datasets have provided both DTD and YAML format schema.

shigapov mentioned this issue Aug 12, 2022

Add manual correction of tokenization in BIO format #9

Open

hehuan2112 closed this as completed Mar 28, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Annotation with any language #7

Annotation with any language #7

glacierck commented Jul 4, 2022

hehuan2112 commented Jul 4, 2022

varna9000 commented Jul 6, 2022 •

edited

Loading

glacierck commented Jul 6, 2022 •

edited

Loading

hehuan2112 commented Jul 6, 2022 •

edited

Loading

hehuan2112 commented Jul 6, 2022

varna9000 commented Jul 6, 2022 •

edited

Loading

glacierck commented Jul 7, 2022

hehuan2112 commented Jul 7, 2022

hehuan2112 commented Jul 7, 2022

hehuan2112 commented Mar 28, 2023

Annotation with any language #7

Annotation with any language #7

Comments

glacierck commented Jul 4, 2022

hehuan2112 commented Jul 4, 2022

varna9000 commented Jul 6, 2022 • edited Loading

glacierck commented Jul 6, 2022 • edited Loading

hehuan2112 commented Jul 6, 2022 • edited Loading

hehuan2112 commented Jul 6, 2022

varna9000 commented Jul 6, 2022 • edited Loading

glacierck commented Jul 7, 2022

hehuan2112 commented Jul 7, 2022

hehuan2112 commented Jul 7, 2022

hehuan2112 commented Mar 28, 2023

varna9000 commented Jul 6, 2022 •

edited

Loading

glacierck commented Jul 6, 2022 •

edited

Loading

hehuan2112 commented Jul 6, 2022 •

edited

Loading

varna9000 commented Jul 6, 2022 •

edited

Loading