Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Annotation with any language #7

Closed
glacierck opened this issue Jul 4, 2022 · 10 comments
Closed

Annotation with any language #7

glacierck opened this issue Jul 4, 2022 · 10 comments

Comments

@glacierck
Copy link

The regular condition in dtd_parser limits the possibility of annotation of other characters. I tried to extend other characters, and the test was normal. What is the necessity of these restrictions?

@hehuan2112
Copy link
Collaborator

Thank you so much for your feedback.
The restrictions for the dtd file are mainly because of the compatibility of MAE, and the DTD format itself (https://en.wikipedia.org/wiki/Document_type_definition). Our implementation for the DTD parsing is just a minimal set.

The dtd file defines the annotation schema and the elements are used for creating XML tags. So usually to avoid any encoding issues, we use ASCII characters in the dtd file.
As far as we know, if characters in other languages are used in dtd for values (e.g., list values, string, etc) instead of elements, it should work in the annotation XML files.

It would be great if you could share you sample annotation files and dtd schema here, then we can test and fix any issues.

@varna9000
Copy link

varna9000 commented Jul 6, 2022

yes, @glacierck please share your fix. I have train texts in Cyrillic letters and the parser doesn't catch the annotations correctly. I found this when tried to export in BIO format e.g. the annotated term is чл. 78а ал. 1 от НК, but in export I got:

чл	B-LAW
.	I-LAW

с	O

EDIT: Actually it might be just the BIO exporter. Other exporters catch the full term correctly. @hehuan2112 can you please advise?

@glacierck
Copy link
Author

glacierck commented Jul 6, 2022

image
As shown in the figure, this modification enables me to customize DTDs of other characters
@varna9000

This should be a global configuration that supports user injection @hehuan2112

@hehuan2112
Copy link
Collaborator

hehuan2112 commented Jul 6, 2022

Thank you for your feedback @varna9000! Yes, I think this issue is because of the default sentencizer (sentence tokenization) algorithm, which splits a document based on "."
As the BIO/IOB2 format requires tokens and contextual sentences, we need to find the sentences around the annotated tokens. If the sentence cannot be correctly identified, the converted results won't be correct.

In your case, I guess the чл. and ал. may be similar to some kind of abbreviations such as Dr., Mr. or Mon.. The best way I think can be to update the sentencizer algorithm to support these cases. But as you know, there can be many corner cases. So, what I suggest is having a new feature to custom a list of punctuation characters or words that mark sentence ends or indicate non-sentence ends.

For example, we can add a config panel to input these words, or add them to the annotation schema.
In fact, we also plan to upgrade the annotation schema format by using JSON format, which is easier to modify and update.

Any suggestions?

@hehuan2112
Copy link
Collaborator

image As shown in the figure, this modification enables me to customize DTDs of other characters @varna9000

This should be a global configuration that supports user injection @hehuan2112

@glacierck Thank you so much! I see your point. Yes, the current DTD regex parsing can only support very limited characters in the schema, as well as the suggested value list. I think that's a major limitation of the current schema format. I think a better solution is to upgrade the annotation schema format by using JSON format, then we don't need to specify the value range for element names or values.

Although we use the DTD format at present, MedTator in fact uses a JSON object of the schema during annotation and other tasks, which is loaded and converted by thedtd_parser.
So, we can have an annotation schema in JSON format and it's easier to define tag names and values in other languages.

Any comments?

@varna9000
Copy link

varna9000 commented Jul 6, 2022

@hehuan2112 yes, a config with sentence end exeptions would be great. Many languages have different exceptions. For example SpaCy has explicit tokenizer_exeptions.py config file for every language which could be used, or even better just allow the user to put a json or plain text file (new line delimited) with the sentence end exceptions.

@glacierck
Copy link
Author

Yes, using JSON format is a good solution, but will the workload be too heavy? If so, I suggest using yaml format and retaining its easy offline reading feature.
@hehuan2112

@hehuan2112
Copy link
Collaborator

@varna9000 Thank you for the example, I will check SpaCy's implementation and how to improve our algorithm.

@hehuan2112
Copy link
Collaborator

@glacierck I agree, YAML is also a great solution and it is easy to edit and share. I will plan to add it to the feature roadmap.

@hehuan2112
Copy link
Collaborator

Sorry for my late reply. Last year, we added YAML format support in the 1.3.0 release. And all our sample datasets have provided both DTD and YAML format schema.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants