Replace ANTLR with CoreNLP in text frontend #622

SirYwell · 2022-08-26T16:04:19Z

Targets #556.

With this PR, text is now tokenized by CoreNLP.

This has some more or less obvious effects:

CoreNLP does not provide per-token line numbers. Therefore, we need to go through the file contents ourselves to keep track of line numbers and line break indexes to calculate columns. The behavior of supporting CR, LF, and CRLF stays.
Without further adjustments, CoreNLP recognizes almost all non-whitespace symbols as tokens. Also, the splitting behavior differs (e.g. 1.5 was tokenized as |1|5 before, now it's just |1.5). As a very simple way to get a somewhat similar behavior, I added the isWord method that checks if the token contains any alphanumeric symbols.
CoreNLP also seems to recognize XML. In the javadoc test, it therefore creates one token for each tag.

There are likely more differences on other inputs.

A diff for the current output of the javadoc test can be found here: https://gist.github.com/SirYwell/93c22c41af2adfdd549e56564c3260f2/revisions (I simply pasted the original and edited with the current state)

I didn't change the existing test yet as I first would like to get some feedback on the current implementation.

(cc @tsaglam, I started working on this as discussed)

tsaglam · 2022-08-26T17:40:01Z

@SirYwell just FYI I am out of the office until 5.8, I will have a detailed look at your work after that.

Without further adjustments, CoreNLP recognizes almost all non-whitespace symbols as tokens. Also, the splitting behavior differs (e.g. 1.5 was tokenized as |1|5 before, now it's just |1.5). As a very simple way to get a somewhat similar behavior, I added the isWord method that checks if the token contains any alphanumeric symbols.

Whitespace should (obviously) not be a token, but I would say numbers could be a single token (e.g. |1.5), as this is an actual improvement.

CoreNLP also seems to recognize XML. In the javadoc test, it therefore creates one token for each tag.

That also sounds like an improvement, I would say we keep that.

JanWittler

Thanks for the improvement. The code looks overall good to me, I added some minor remarks.

As a remark for @jplag/maintainer: This PR increases the binary to ~70MB (see #586). We should keep an eye on this.

README.md

jplag.frontend.text/src/main/java/de/jplag/text/TokenPosition.java

jplag.frontend.text/src/main/java/de/jplag/text/ParserAdapter.java

tsaglam

Just a few minor comments.

jplag.frontend.text/pom.xml

jplag.frontend.text/src/main/java/de/jplag/text/ParserAdapter.java

tsaglam

Looks good!

SirYwell · 2022-09-06T06:48:42Z

If you're happy with the resulting tokens, I'd update the old test with the new numbers accordingly.

tsaglam · 2022-09-06T07:17:56Z

If you're happy with the resulting tokens, I'd update the old test with the new numbers accordingly.

@SirYwell Yes, I think the resulting tokens look goo.! You can move ahead!

sonarcloud · 2022-09-06T09:11:04Z

Kudos, SonarCloud Quality Gate passed!

0 Bugs
0 Vulnerabilities
0 Security Hotspots
1 Code Smell

92.1% Coverage
0.0% Duplication

SirYwell added 4 commits August 26, 2022 17:10

replace ANTLR with CoreNLP

4d49bd1

second iteration

6a3001c

add some basic tests

9692767

change entry in README.md

17f4d3d

tsaglam linked an issue Aug 26, 2022 that may be closed by this pull request

Text frontend unnecessarily overcomplicated #556

Closed

3 tasks

tsaglam added enhancement Issue/PR that involves features, improvements and other changes minor Minor issue/feature/contribution/change labels Aug 26, 2022

tsaglam added this to the v4.0.0 milestone Aug 26, 2022

tsaglam self-assigned this Aug 26, 2022

JanWittler reviewed Aug 31, 2022

View reviewed changes

SirYwell added 3 commits September 2, 2022 15:57

apply spotless

0818aa8

address comments

2977ab3

change start token

a1d28c1

tsaglam requested changes Sep 4, 2022

View reviewed changes

address comments

4b72c8d

tsaglam approved these changes Sep 6, 2022

View reviewed changes

change expected sizes in test

a0fbfc8

tsaglam merged commit adca59d into jplag:master Sep 6, 2022

SirYwell deleted the feature/corenlp branch September 6, 2022 17:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replace ANTLR with CoreNLP in text frontend #622

Replace ANTLR with CoreNLP in text frontend #622

SirYwell commented Aug 26, 2022

tsaglam commented Aug 26, 2022 •

edited

Loading

JanWittler left a comment

tsaglam left a comment

tsaglam left a comment

SirYwell commented Sep 6, 2022

tsaglam commented Sep 6, 2022

sonarcloud bot commented Sep 6, 2022

Replace ANTLR with CoreNLP in text frontend #622

Replace ANTLR with CoreNLP in text frontend #622

Conversation

SirYwell commented Aug 26, 2022

tsaglam commented Aug 26, 2022 • edited Loading

JanWittler left a comment

Choose a reason for hiding this comment

tsaglam left a comment

Choose a reason for hiding this comment

tsaglam left a comment

Choose a reason for hiding this comment

SirYwell commented Sep 6, 2022

tsaglam commented Sep 6, 2022

sonarcloud bot commented Sep 6, 2022

tsaglam commented Aug 26, 2022 •

edited

Loading