Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace ANTLR with CoreNLP in text frontend #622

Merged
merged 9 commits into from
Sep 6, 2022

Conversation

SirYwell
Copy link
Contributor

Targets #556.

With this PR, text is now tokenized by CoreNLP.

This has some more or less obvious effects:

  • CoreNLP does not provide per-token line numbers. Therefore, we need to go through the file contents ourselves to keep track of line numbers and line break indexes to calculate columns. The behavior of supporting CR, LF, and CRLF stays.
  • Without further adjustments, CoreNLP recognizes almost all non-whitespace symbols as tokens. Also, the splitting behavior differs (e.g. 1.5 was tokenized as |1|5 before, now it's just |1.5). As a very simple way to get a somewhat similar behavior, I added the isWord method that checks if the token contains any alphanumeric symbols.
  • CoreNLP also seems to recognize XML. In the javadoc test, it therefore creates one token for each tag.

There are likely more differences on other inputs.

A diff for the current output of the javadoc test can be found here: https://gist.github.com/SirYwell/93c22c41af2adfdd549e56564c3260f2/revisions (I simply pasted the original and edited with the current state)

I didn't change the existing test yet as I first would like to get some feedback on the current implementation.

(cc @tsaglam, I started working on this as discussed)

@tsaglam
Copy link
Member

tsaglam commented Aug 26, 2022

@SirYwell just FYI I am out of the office until 5.8, I will have a detailed look at your work after that.

Without further adjustments, CoreNLP recognizes almost all non-whitespace symbols as tokens. Also, the splitting behavior differs (e.g. 1.5 was tokenized as |1|5 before, now it's just |1.5). As a very simple way to get a somewhat similar behavior, I added the isWord method that checks if the token contains any alphanumeric symbols.

Whitespace should (obviously) not be a token, but I would say numbers could be a single token (e.g. |1.5), as this is an actual improvement.

CoreNLP also seems to recognize XML. In the javadoc test, it therefore creates one token for each tag.

That also sounds like an improvement, I would say we keep that.

@tsaglam tsaglam linked an issue Aug 26, 2022 that may be closed by this pull request
3 tasks
@tsaglam tsaglam added enhancement Issue/PR that involves features, improvements and other changes minor Minor issue/feature/contribution/change labels Aug 26, 2022
@tsaglam tsaglam added this to the v4.0.0 milestone Aug 26, 2022
@tsaglam tsaglam self-assigned this Aug 26, 2022
Copy link
Contributor

@JanWittler JanWittler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the improvement. The code looks overall good to me, I added some minor remarks.

As a remark for @jplag/maintainer: This PR increases the binary to ~70MB (see #586). We should keep an eye on this.

README.md Show resolved Hide resolved
Copy link
Member

@tsaglam tsaglam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a few minor comments.

jplag.frontend.text/pom.xml Outdated Show resolved Hide resolved
Copy link
Member

@tsaglam tsaglam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

@SirYwell
Copy link
Contributor Author

SirYwell commented Sep 6, 2022

If you're happy with the resulting tokens, I'd update the old test with the new numbers accordingly.

@tsaglam
Copy link
Member

tsaglam commented Sep 6, 2022

If you're happy with the resulting tokens, I'd update the old test with the new numbers accordingly.

@SirYwell Yes, I think the resulting tokens look goo.! You can move ahead!

@tsaglam tsaglam merged commit adca59d into jplag:master Sep 6, 2022
@sonarcloud
Copy link

sonarcloud bot commented Sep 6, 2022

Kudos, SonarCloud Quality Gate passed!    Quality Gate passed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 1 Code Smell

92.1% 92.1% Coverage
0.0% 0.0% Duplication

@SirYwell SirYwell deleted the feature/corenlp branch September 6, 2022 17:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Issue/PR that involves features, improvements and other changes minor Minor issue/feature/contribution/change
Projects
No open projects
Status: Done
Development

Successfully merging this pull request may close these issues.

Text frontend unnecessarily overcomplicated
4 participants