error indexing a text document with languagetool-wikipedia index #364

milekpl · 2016-02-07T11:12:12Z

When indexing this sentence in a plain-text UTF-8 file:

Dwa dni później, 1 sierpnia, ministrowie finansów Wspólnoty Europejskiej nie mogąc już dłużej walczyć z rynkiem podjęli decyzję o rozszerzeniu z 2, 25 (w przypadku hiszpańskiej pesety i portugalskiego eskudo z 6).o 15 proc. granic wahań kursów w ramach ESW.

I get the following error:

Exception in thread "main" java.lang.IllegalArgumentException: startOffset must be non-negative, and endOffset must be >= startOffset, and offsets must not go backwards startOffset=0,endOffset=0,lastStartOffset=253 for field 'field' at org.apache.lucene.index.DefaultIndexingChain$PerField.invert(DefaultIndexingChain.java:641) at org.apache.lucene.index.DefaultIndexingChain.processField(DefaultIndexingChain.java:344) at org.apache.lucene.index.DefaultIndexingChain.processDocument(DefaultIndexingChain.java:300) at org.apache.lucene.index.DocumentsWriterPerThread.updateDocument(DocumentsWriterPerThread.java:232) at org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:458) at org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1363) at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1142) at org.languagetool.dev.index.Indexer.add(Indexer.java:173) at org.languagetool.dev.index.Indexer.indexText(Indexer.java:136) at org.languagetool.dev.index.Indexer.run(Indexer.java:109) at org.languagetool.dev.index.Indexer.main(Indexer.java:73) at org.languagetool.dev.wikipedia.Main.main(Main.java:54) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:497) at com.intellij.rt.execution.application.AppMain.main(AppMain.java:144)

The text was updated successfully, but these errors were encountered:

danielnaber · 2016-02-07T17:08:14Z

That's strange, I copied that text to a plain text file and call org.languagetool.dev.index.Indexer but I don't get an exception. Maybe you can send me the file?

danielnaber · 2016-02-07T17:58:10Z

I can reproduce the issue with the file you sent. I assume the error is in LanguageToolFilter. Updating to Lucene 5.4.1 doesn't help. The issue is probably somehow that we get fed the input in chunks of 256 chars.

milekpl · 2016-02-07T18:52:39Z

Removing some parts of the string stops the bug from appearing but I have no idea why. It seems really random.

danielnaber · 2016-02-07T18:55:54Z

Shorter than 265 chars: it works, longer: doesn't work.

milekpl · 2016-02-07T19:20:03Z

But with 260 spaces it no longer gives a bug. Even 260 spaces with random characters at the end doesn't produce it.

milekpl · 2016-02-07T19:21:10Z

… yet 260 hashtags does the trick.

…#364)

danielnaber · 2016-02-08T09:42:35Z

Should be fixed now.

milekpl · 2016-02-13T10:07:49Z

Hm, I stil get it.

danielnaber · 2016-02-13T10:25:28Z

You mean with exactly the same file, the one you sent me via email? I cannot reproduce with that.

milekpl · 2016-02-13T10:29:43Z

No, another one, just sent you the sample over the email (it was from the same corpus).

danielnaber · 2016-02-13T11:01:19Z

Seems my last fix only moved the problem from texts with more than 255 chars to texts with more than 2*255 chars. However, I don't understand what exactly the problem is and I won't be able to spend more time on fixing it for the time being, sorry.

milekpl · 2016-02-13T12:04:51Z

I guess then the easy workaround is to use the fold command. How does the indexer handle newlines? Does it split on one or two?

danielnaber · 2016-02-13T14:05:11Z

I don't remember, you'll need to check the source.

…e bug in the text indexer (gihub issue #364)

milekpl · 2016-02-15T10:02:35Z

Seems to be fixed (roughly).

milekpl assigned danielnaber Feb 7, 2016

danielnaber added a commit that referenced this issue Feb 8, 2016

avoid IllegalArgumentException in Lucene on sentences with >255 chars (…

b563ec7

…#364)

danielnaber closed this as completed Feb 8, 2016

milekpl reopened this Feb 13, 2016

milekpl added a commit that referenced this issue Feb 15, 2016

[pl] add one more sentence breaking rule for dialogs in books; fix th…

fe17d20

…e bug in the text indexer (gihub issue #364)

milekpl closed this as completed Feb 15, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

error indexing a text document with languagetool-wikipedia index #364

error indexing a text document with languagetool-wikipedia index #364

milekpl commented Feb 7, 2016

danielnaber commented Feb 7, 2016

danielnaber commented Feb 7, 2016

milekpl commented Feb 7, 2016

danielnaber commented Feb 7, 2016

milekpl commented Feb 7, 2016

milekpl commented Feb 7, 2016

danielnaber commented Feb 8, 2016

milekpl commented Feb 13, 2016

danielnaber commented Feb 13, 2016

milekpl commented Feb 13, 2016

danielnaber commented Feb 13, 2016

milekpl commented Feb 13, 2016

danielnaber commented Feb 13, 2016

milekpl commented Feb 15, 2016

error indexing a text document with languagetool-wikipedia index #364

error indexing a text document with languagetool-wikipedia index #364

Comments

milekpl commented Feb 7, 2016

danielnaber commented Feb 7, 2016

danielnaber commented Feb 7, 2016

milekpl commented Feb 7, 2016

danielnaber commented Feb 7, 2016

milekpl commented Feb 7, 2016

milekpl commented Feb 7, 2016

danielnaber commented Feb 8, 2016

milekpl commented Feb 13, 2016

danielnaber commented Feb 13, 2016

milekpl commented Feb 13, 2016

danielnaber commented Feb 13, 2016

milekpl commented Feb 13, 2016

danielnaber commented Feb 13, 2016

milekpl commented Feb 15, 2016