-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Annotation with any language #7
Comments
Thank you so much for your feedback. The dtd file defines the annotation schema and the elements are used for creating XML tags. So usually to avoid any encoding issues, we use ASCII characters in the dtd file. It would be great if you could share you sample annotation files and dtd schema here, then we can test and fix any issues. |
yes, @glacierck please share your fix. I have train texts in Cyrillic letters and the parser doesn't catch the annotations correctly. I found this when tried to export in BIO format e.g. the annotated term is
EDIT: Actually it might be just the BIO exporter. Other exporters catch the full term correctly. @hehuan2112 can you please advise? |
This should be a global configuration that supports user injection @hehuan2112 |
Thank you for your feedback @varna9000! Yes, I think this issue is because of the default sentencizer (sentence tokenization) algorithm, which splits a document based on "." In your case, I guess the For example, we can add a config panel to input these words, or add them to the annotation schema. Any suggestions? |
@glacierck Thank you so much! I see your point. Yes, the current DTD regex parsing can only support very limited characters in the schema, as well as the suggested value list. I think that's a major limitation of the current schema format. I think a better solution is to upgrade the annotation schema format by using JSON format, then we don't need to specify the value range for element names or values. Although we use the DTD format at present, MedTator in fact uses a JSON object of the schema during annotation and other tasks, which is loaded and converted by the Any comments? |
@hehuan2112 yes, a config with sentence end exeptions would be great. Many languages have different exceptions. For example SpaCy has explicit tokenizer_exeptions.py config file for every language which could be used, or even better just allow the user to put a json or plain text file (new line delimited) with the sentence end exceptions. |
Yes, using JSON format is a good solution, but will the workload be too heavy? If so, I suggest using yaml format and retaining its easy offline reading feature. |
@varna9000 Thank you for the example, I will check SpaCy's implementation and how to improve our algorithm. |
@glacierck I agree, YAML is also a great solution and it is easy to edit and share. I will plan to add it to the feature roadmap. |
Sorry for my late reply. Last year, we added YAML format support in the 1.3.0 release. And all our sample datasets have provided both DTD and YAML format schema. |
The regular condition in dtd_parser limits the possibility of annotation of other characters. I tried to extend other characters, and the test was normal. What is the necessity of these restrictions?
The text was updated successfully, but these errors were encountered: