-
Notifications
You must be signed in to change notification settings - Fork 299
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FSCrawler can't index .doc or .docx elements #899
Comments
Could you share the document? |
Sure, is there any way to share it privately to you? |
david (at) elastic (dot) co |
Thanks for your response. I reindexed everything again in meantime and I still have the same error, but the document seems to be indexed. Somehow if I create a .pdf of a .docx, the .docx won't be shown when querying the content. Only the .pdf of it is shown. Shouldn't the score of .pdf and .docx document should be the same? Am I missing something? |
May be it's because the metadata only are indexed for the |
Ah I see! So that's why they aren't listed with the same score. Makes sense. |
You can search by the file name I guess. |
Jep, that's it. I just queried both ( Think I'm gonna write a blogpost afterwards about this topic 😄 |
Docx files are normally indexed like pdf files. But if Tika raised an exception, only metadata might be extracted. I need to see your document to comment further.
Amazing! Please ping me on Twitter when done. I'm also |
Ok, I already sent you my document to your email about an hour ago :) Maybe it's in spam? If you want I can also send you the resulting json, but I don't think you will need it.
I will :) |
I did not get it. Did you send it to |
oh damn, i thought it was a typo and sent it to elastic.com 🤦♂ |
Ah I just checked it. Is there any way I can help you fixing the request? I'm looking forward for the new version! |
Hey @dadoonet , any news about the bug? Or a estimation on when the new release will be? :) |
I need to wait for Tika project to release its new version. I don't know when this will happen. |
Mhh damn, do you got any Idea about a workaround? |
I guess that you would need to change some libs in FSCrawler lib dir. |
I'll give it a try. Thanks! |
Getting same error for few docx files out of 10 files. error: 16:54:29,143 WARN [f.p.e.c.f.t.TikaDocParser] Failed to extract [100000] characters of text for [D:\Elastic Stack\sample_files\Clothing industry.docx] -> org.apache.poi.hwmf.record.HwmfRecord.getRecordType()Lorg/apache/poi/hwmf/record/HwmfRecordType; |
No solution yet. As @dadoonet mentioned, it's a bug. |
Is the latest release by tika https://repository.apache.org/content/groups/snapshots/org/apache/tika/tika-app/2.0.0-SNAPSHOT/ |
I don't think so. |
FSCrawler is basically a wrapper of tika, maybe you missunderstood the concept. |
I voted some days ago for the release of Tika 1.24. So it should not be long now. |
I believe they have released some changes on 11 March, not sure if they are for the same |
It has just been released today so I'm expecting to upgrade in the next 24h |
Describe the bug
Whenever I try to index .doc or .docx files I get a warning and the files don't get indexed.
07:57:41,337 WARN [f.p.e.c.f.t.TikaDocParser] Failed to extract [100000] characters of text for [C:\ELK\temp\es\LBR\2016\test.docx] -> org/apache/poi/hemf/extractor/HemfExtractor
It all works fine with .pdf documents and so I expected with word documents.
Versions:
EDIT:
So I recreated a .docx file with a few sentence and it worked. So what does the above error means?
The text was updated successfully, but these errors were encountered: