Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FSCrawler can't index .doc or .docx elements #899

Closed
LaaKii opened this issue Feb 12, 2020 · 27 comments · Fixed by #925
Closed

FSCrawler can't index .doc or .docx elements #899

LaaKii opened this issue Feb 12, 2020 · 27 comments · Fixed by #925
Assignees
Labels
bug For confirmed bugs
Milestone

Comments

@LaaKii
Copy link

LaaKii commented Feb 12, 2020

Describe the bug

Whenever I try to index .doc or .docx files I get a warning and the files don't get indexed.

07:57:41,337 WARN [f.p.e.c.f.t.TikaDocParser] Failed to extract [100000] characters of text for [C:\ELK\temp\es\LBR\2016\test.docx] -> org/apache/poi/hemf/extractor/HemfExtractor

It all works fine with .pdf documents and so I expected with word documents.

Versions:

  • OS: Windows 10
  • Elasticsearch Version 7.5.2
  • FSCrawler Version 2.7

EDIT:

So I recreated a .docx file with a few sentence and it worked. So what does the above error means?

@dadoonet
Copy link
Owner

Could you share the document?

@LaaKii
Copy link
Author

LaaKii commented Feb 12, 2020

Sure, is there any way to share it privately to you?

@dadoonet
Copy link
Owner

david (at) elastic (dot) co

@dadoonet dadoonet added the check_for_bug Needs to be reproduced label Feb 12, 2020
@LaaKii
Copy link
Author

LaaKii commented Feb 12, 2020

Thanks for your response. I reindexed everything again in meantime and I still have the same error, but the document seems to be indexed. Somehow if I create a .pdf of a .docx, the .docx won't be shown when querying the content. Only the .pdf of it is shown. Shouldn't the score of .pdf and .docx document should be the same? Am I missing something?

@dadoonet
Copy link
Owner

May be it's because the metadata only are indexed for the .docx file. Not the content itself.
Could you share the JSON document that has been generated for the .docxfile?

@LaaKii
Copy link
Author

LaaKii commented Feb 12, 2020

Ah I see! So that's why they aren't listed with the same score. Makes sense.
Uhm, how do I see the generated JSON in kibana? I indexed about 12000 documents.

@dadoonet
Copy link
Owner

You can search by the file name I guess.

@LaaKii
Copy link
Author

LaaKii commented Feb 12, 2020

Jep, that's it. I just queried both (.docx and the generated .pdf and analyzed the resulting JSON documents. The .docx has no "content-section", but the .pdf does. So only the metadata of .docx documents are indexed. Is there anyway to adjust FSCrawler to also index the content of a word-document?

Think I'm gonna write a blogpost afterwards about this topic 😄

@dadoonet
Copy link
Owner

Is there anyway to adjust FSCrawler to also index the content of a word-document?

Docx files are normally indexed like pdf files. But if Tika raised an exception, only metadata might be extracted. I need to see your document to comment further.

Think I'm gonna write a blogpost afterwards about this topic 😄

Amazing! Please ping me on Twitter when done. I'm also dadoonet on Twitter.

@LaaKii
Copy link
Author

LaaKii commented Feb 12, 2020

Docx files are normally indexed like pdf files. But if Tika raised an exception, only metadata might be extracted. I need to see your document to comment further.

Ok, I already sent you my document to your email about an hour ago :) Maybe it's in spam? If you want I can also send you the resulting json, but I don't think you will need it.

Amazing! Please ping me on Twitter when done. I'm also dadoonet on Twitter.

I will :)

@dadoonet
Copy link
Owner

I did not get it. Did you send it to david @ elastic.co ?

@LaaKii
Copy link
Author

LaaKii commented Feb 12, 2020

oh damn, i thought it was a typo and sent it to elastic.com 🤦‍♂

@dadoonet
Copy link
Owner

Thank you for the file.
That's definitely a bug in FSCrawler which has been introduced by #855

To fix it, I "just" need to pull in this PR: #865 but there's still "a blocker" in that one as I have seen a regression. I need to revisit it at some point.

@dadoonet dadoonet added bug For confirmed bugs and removed check_for_bug Needs to be reproduced labels Feb 12, 2020
@dadoonet dadoonet linked a pull request Feb 12, 2020 that will close this issue
@dadoonet dadoonet added this to the 2.7 milestone Feb 12, 2020
@dadoonet dadoonet self-assigned this Feb 12, 2020
@LaaKii
Copy link
Author

LaaKii commented Feb 13, 2020

Ah I just checked it.
Firstly thanks for the effort and even for creating FSCrawler.

Is there any way I can help you fixing the request? I'm looking forward for the new version!

@LaaKii
Copy link
Author

LaaKii commented Feb 19, 2020

Hey @dadoonet , any news about the bug? Or a estimation on when the new release will be? :)

@dadoonet
Copy link
Owner

I need to wait for Tika project to release its new version. I don't know when this will happen.

@LaaKii
Copy link
Author

LaaKii commented Feb 19, 2020

Mhh damn, do you got any Idea about a workaround?

@dadoonet
Copy link
Owner

I guess that you would need to change some libs in FSCrawler lib dir.
Or revert this #855 and compile the project again.

@LaaKii
Copy link
Author

LaaKii commented Feb 19, 2020

I'll give it a try. Thanks!

@AncKal94
Copy link

Getting same error for few docx files out of 10 files.
Any solution for this.

error:

16:54:29,143 WARN [f.p.e.c.f.t.TikaDocParser] Failed to extract [100000] characters of text for [D:\Elastic Stack\sample_files\Clothing industry.docx] -> org.apache.poi.hwmf.record.HwmfRecord.getRecordType()Lorg/apache/poi/hwmf/record/HwmfRecordType;
16:54:29,700 WARN [f.p.e.c.f.t.TikaDocParser] Failed to extract [100000] characters of text for [D:\Elastic Stack\sample_files\Compressed natural gas.docx] -> org.apache.poi.hwmf.record.HwmfRecord.getRecordType()Lorg/apache/poi/hwmf/record/HwmfRecordType;
16:54:30,540 WARN [f.p.e.c.f.t.TikaDocParser] Failed to extract [100000] characters of text for [D:\Elastic Stack\sample_files\Emerging technologies.docx] -> org.apache.poi.hwmf.record.HwmfRecord.getRecordType()Lorg/apache/poi/hwmf/record/HwmfRecordType;
16:54:48,478 WARN [f.p.e.c.f.t.TikaDocParser] Failed to extract [100000] characters of text for [D:\Elastic Stack\sample_files\Local self-government in India.docx] -> org.apache.poi.hwmf.record.HwmfRecord.getRecordType()Lorg/apache/poi/hwmf/record/HwmfRecordType;
16:54:48,975 WARN [f.p.e.c.f.t.TikaDocParser] Failed to extract [100000] characters of text for [D:\Elastic Stack\sample_files\Reserve Bank of India.docx] -> org.apache.poi.hwmf.record.HwmfRecord.getRecordType()Lorg/apache/poi/hwmf/record/HwmfRecordType;
16:54:49,314 WARN [f.p.e.c.f.t.TikaDocParser] Failed to extract [100000] characters of text for [D:\Elastic Stack\sample_files\School counselor.docx] -> org.apache.poi.hwmf.record.HwmfRecord.getRecordType()Lorg/apache/poi/hwmf/record/HwmfRecordType;

@LaaKii
Copy link
Author

LaaKii commented Mar 17, 2020

No solution yet. As @dadoonet mentioned, it's a bug.
As a temporary workout your could store it as a .pdf.
PDF works perfectly fine.

@AncKal94
Copy link

@LaaKii
Copy link
Author

LaaKii commented Mar 17, 2020

I don't think so.
what made you think that?

@LaaKii
Copy link
Author

LaaKii commented Mar 17, 2020

FSCrawler is basically a wrapper of tika, maybe you missunderstood the concept.

@dadoonet
Copy link
Owner

I voted some days ago for the release of Tika 1.24. So it should not be long now.

@AncKal94
Copy link

I believe they have released some changes on 11 March, not sure if they are for the same

@dadoonet
Copy link
Owner

It has just been released today so I'm expecting to upgrade in the next 24h

@dadoonet dadoonet linked a pull request Mar 18, 2020 that will close this issue
@mergify mergify bot closed this as completed in #925 Mar 18, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug For confirmed bugs
Projects
None yet
3 participants