Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

error indexing a text document with languagetool-wikipedia index #364

Closed
milekpl opened this issue Feb 7, 2016 · 14 comments
Closed

error indexing a text document with languagetool-wikipedia index #364

milekpl opened this issue Feb 7, 2016 · 14 comments
Assignees

Comments

@milekpl
Copy link
Member

milekpl commented Feb 7, 2016

When indexing this sentence in a plain-text UTF-8 file:

Dwa dni później, 1 sierpnia, ministrowie finansów Wspólnoty Europejskiej nie mogąc już dłużej walczyć z rynkiem podjęli decyzję o rozszerzeniu z 2, 25 (w przypadku hiszpańskiej pesety i portugalskiego eskudo z 6).o 15 proc. granic wahań kursów w ramach ESW.

I get the following error:

Exception in thread "main" java.lang.IllegalArgumentException: startOffset must be non-negative, and endOffset must be >= startOffset, and offsets must not go backwards startOffset=0,endOffset=0,lastStartOffset=253 for field 'field' at org.apache.lucene.index.DefaultIndexingChain$PerField.invert(DefaultIndexingChain.java:641) at org.apache.lucene.index.DefaultIndexingChain.processField(DefaultIndexingChain.java:344) at org.apache.lucene.index.DefaultIndexingChain.processDocument(DefaultIndexingChain.java:300) at org.apache.lucene.index.DocumentsWriterPerThread.updateDocument(DocumentsWriterPerThread.java:232) at org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:458) at org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1363) at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1142) at org.languagetool.dev.index.Indexer.add(Indexer.java:173) at org.languagetool.dev.index.Indexer.indexText(Indexer.java:136) at org.languagetool.dev.index.Indexer.run(Indexer.java:109) at org.languagetool.dev.index.Indexer.main(Indexer.java:73) at org.languagetool.dev.wikipedia.Main.main(Main.java:54) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:497) at com.intellij.rt.execution.application.AppMain.main(AppMain.java:144)

@danielnaber
Copy link
Member

That's strange, I copied that text to a plain text file and call org.languagetool.dev.index.Indexer but I don't get an exception. Maybe you can send me the file?

@danielnaber
Copy link
Member

I can reproduce the issue with the file you sent. I assume the error is in LanguageToolFilter. Updating to Lucene 5.4.1 doesn't help. The issue is probably somehow that we get fed the input in chunks of 256 chars.

@milekpl
Copy link
Member Author

milekpl commented Feb 7, 2016

Removing some parts of the string stops the bug from appearing but I have no idea why. It seems really random.

@danielnaber
Copy link
Member

Shorter than 265 chars: it works, longer: doesn't work.

@milekpl
Copy link
Member Author

milekpl commented Feb 7, 2016

But with 260 spaces it no longer gives a bug. Even 260 spaces with random characters at the end doesn't produce it.

@milekpl
Copy link
Member Author

milekpl commented Feb 7, 2016

… yet 260 hashtags does the trick.

@danielnaber
Copy link
Member

Should be fixed now.

@milekpl
Copy link
Member Author

milekpl commented Feb 13, 2016

Hm, I stil get it.

@milekpl milekpl reopened this Feb 13, 2016
@danielnaber
Copy link
Member

You mean with exactly the same file, the one you sent me via email? I cannot reproduce with that.

@milekpl
Copy link
Member Author

milekpl commented Feb 13, 2016

No, another one, just sent you the sample over the email (it was from the same corpus).

@danielnaber
Copy link
Member

Seems my last fix only moved the problem from texts with more than 255 chars to texts with more than 2*255 chars. However, I don't understand what exactly the problem is and I won't be able to spend more time on fixing it for the time being, sorry.

@milekpl
Copy link
Member Author

milekpl commented Feb 13, 2016

I guess then the easy workaround is to use the fold command. How does the indexer handle newlines? Does it split on one or two?

@danielnaber
Copy link
Member

I don't remember, you'll need to check the source.

milekpl added a commit that referenced this issue Feb 15, 2016
@milekpl
Copy link
Member Author

milekpl commented Feb 15, 2016

Seems to be fixed (roughly).

@milekpl milekpl closed this as completed Feb 15, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants