-
Notifications
You must be signed in to change notification settings - Fork 319
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Replace ANTLR with CoreNLP in text frontend #622
Conversation
@SirYwell just FYI I am out of the office until 5.8, I will have a detailed look at your work after that.
Whitespace should (obviously) not be a token, but I would say numbers could be a single token (e.g.
That also sounds like an improvement, I would say we keep that. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the improvement. The code looks overall good to me, I added some minor remarks.
As a remark for @jplag/maintainer: This PR increases the binary to ~70MB (see #586). We should keep an eye on this.
jplag.frontend.text/src/main/java/de/jplag/text/TokenPosition.java
Outdated
Show resolved
Hide resolved
jplag.frontend.text/src/main/java/de/jplag/text/ParserAdapter.java
Outdated
Show resolved
Hide resolved
jplag.frontend.text/src/main/java/de/jplag/text/ParserAdapter.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a few minor comments.
jplag.frontend.text/src/main/java/de/jplag/text/ParserAdapter.java
Outdated
Show resolved
Hide resolved
jplag.frontend.text/src/main/java/de/jplag/text/ParserAdapter.java
Outdated
Show resolved
Hide resolved
jplag.frontend.text/src/main/java/de/jplag/text/ParserAdapter.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good!
If you're happy with the resulting tokens, I'd update the old test with the new numbers accordingly. |
@SirYwell Yes, I think the resulting tokens look goo.! You can move ahead! |
Kudos, SonarCloud Quality Gate passed! |
Targets #556.
With this PR, text is now tokenized by CoreNLP.
This has some more or less obvious effects:
1.5
was tokenized as|1|5
before, now it's just|1.5
). As a very simple way to get a somewhat similar behavior, I added theisWord
method that checks if the token contains any alphanumeric symbols.There are likely more differences on other inputs.
A diff for the current output of the javadoc test can be found here: https://gist.github.com/SirYwell/93c22c41af2adfdd549e56564c3260f2/revisions (I simply pasted the original and edited with the current state)
I didn't change the existing test yet as I first would like to get some feedback on the current implementation.
(cc @tsaglam, I started working on this as discussed)