FSCrawler can't index .doc or .docx elements #899

LaaKii · 2020-02-12T07:32:56Z

Describe the bug

Whenever I try to index .doc or .docx files I get a warning and the files don't get indexed.

07:57:41,337 WARN [f.p.e.c.f.t.TikaDocParser] Failed to extract [100000] characters of text for [C:\ELK\temp\es\LBR\2016\test.docx] -> org/apache/poi/hemf/extractor/HemfExtractor

It all works fine with .pdf documents and so I expected with word documents.

Versions:

OS: Windows 10
Elasticsearch Version 7.5.2
FSCrawler Version 2.7

EDIT:

So I recreated a .docx file with a few sentence and it worked. So what does the above error means?

The text was updated successfully, but these errors were encountered:

dadoonet · 2020-02-12T08:09:30Z

Could you share the document?

LaaKii · 2020-02-12T09:50:00Z

Sure, is there any way to share it privately to you?

dadoonet · 2020-02-12T11:03:14Z

david (at) elastic (dot) co

LaaKii · 2020-02-12T11:06:19Z

Thanks for your response. I reindexed everything again in meantime and I still have the same error, but the document seems to be indexed. Somehow if I create a .pdf of a .docx, the .docx won't be shown when querying the content. Only the .pdf of it is shown. Shouldn't the score of .pdf and .docx document should be the same? Am I missing something?

dadoonet · 2020-02-12T11:22:27Z

May be it's because the metadata only are indexed for the .docx file. Not the content itself.
Could you share the JSON document that has been generated for the .docxfile?

LaaKii · 2020-02-12T11:58:02Z

Ah I see! So that's why they aren't listed with the same score. Makes sense.
Uhm, how do I see the generated JSON in kibana? I indexed about 12000 documents.

dadoonet · 2020-02-12T12:53:09Z

You can search by the file name I guess.

LaaKii · 2020-02-12T13:43:26Z

Jep, that's it. I just queried both (.docx and the generated .pdf and analyzed the resulting JSON documents. The .docx has no "content-section", but the .pdf does. So only the metadata of .docx documents are indexed. Is there anyway to adjust FSCrawler to also index the content of a word-document?

Think I'm gonna write a blogpost afterwards about this topic 😄

dadoonet · 2020-02-12T13:47:13Z

Is there anyway to adjust FSCrawler to also index the content of a word-document?

Docx files are normally indexed like pdf files. But if Tika raised an exception, only metadata might be extracted. I need to see your document to comment further.

Think I'm gonna write a blogpost afterwards about this topic 😄

Amazing! Please ping me on Twitter when done. I'm also dadoonet on Twitter.

LaaKii · 2020-02-12T13:50:20Z

Docx files are normally indexed like pdf files. But if Tika raised an exception, only metadata might be extracted. I need to see your document to comment further.

Ok, I already sent you my document to your email about an hour ago :) Maybe it's in spam? If you want I can also send you the resulting json, but I don't think you will need it.

Amazing! Please ping me on Twitter when done. I'm also dadoonet on Twitter.

I will :)

dadoonet · 2020-02-12T14:39:07Z

I did not get it. Did you send it to david @ elastic.co ?

LaaKii · 2020-02-12T14:43:06Z

oh damn, i thought it was a typo and sent it to elastic.com 🤦‍♂

dadoonet · 2020-02-12T15:48:56Z

Thank you for the file.
That's definitely a bug in FSCrawler which has been introduced by #855

To fix it, I "just" need to pull in this PR: #865 but there's still "a blocker" in that one as I have seen a regression. I need to revisit it at some point.

LaaKii · 2020-02-13T10:09:05Z

Ah I just checked it.
Firstly thanks for the effort and even for creating FSCrawler.

Is there any way I can help you fixing the request? I'm looking forward for the new version!

LaaKii · 2020-02-19T12:58:37Z

Hey @dadoonet , any news about the bug? Or a estimation on when the new release will be? :)

dadoonet · 2020-02-19T13:06:14Z

I need to wait for Tika project to release its new version. I don't know when this will happen.

LaaKii · 2020-02-19T13:07:01Z

Mhh damn, do you got any Idea about a workaround?

dadoonet · 2020-02-19T13:09:19Z

I guess that you would need to change some libs in FSCrawler lib dir.
Or revert this #855 and compile the project again.

LaaKii · 2020-02-19T13:21:01Z

I'll give it a try. Thanks!

AncKal94 · 2020-03-17T05:43:08Z

Getting same error for few docx files out of 10 files.
Any solution for this.

error:

16:54:29,143 WARN [f.p.e.c.f.t.TikaDocParser] Failed to extract [100000] characters of text for [D:\Elastic Stack\sample_files\Clothing industry.docx] -> org.apache.poi.hwmf.record.HwmfRecord.getRecordType()Lorg/apache/poi/hwmf/record/HwmfRecordType;
16:54:29,700 WARN [f.p.e.c.f.t.TikaDocParser] Failed to extract [100000] characters of text for [D:\Elastic Stack\sample_files\Compressed natural gas.docx] -> org.apache.poi.hwmf.record.HwmfRecord.getRecordType()Lorg/apache/poi/hwmf/record/HwmfRecordType;
16:54:30,540 WARN [f.p.e.c.f.t.TikaDocParser] Failed to extract [100000] characters of text for [D:\Elastic Stack\sample_files\Emerging technologies.docx] -> org.apache.poi.hwmf.record.HwmfRecord.getRecordType()Lorg/apache/poi/hwmf/record/HwmfRecordType;
16:54:48,478 WARN [f.p.e.c.f.t.TikaDocParser] Failed to extract [100000] characters of text for [D:\Elastic Stack\sample_files\Local self-government in India.docx] -> org.apache.poi.hwmf.record.HwmfRecord.getRecordType()Lorg/apache/poi/hwmf/record/HwmfRecordType;
16:54:48,975 WARN [f.p.e.c.f.t.TikaDocParser] Failed to extract [100000] characters of text for [D:\Elastic Stack\sample_files\Reserve Bank of India.docx] -> org.apache.poi.hwmf.record.HwmfRecord.getRecordType()Lorg/apache/poi/hwmf/record/HwmfRecordType;
16:54:49,314 WARN [f.p.e.c.f.t.TikaDocParser] Failed to extract [100000] characters of text for [D:\Elastic Stack\sample_files\School counselor.docx] -> org.apache.poi.hwmf.record.HwmfRecord.getRecordType()Lorg/apache/poi/hwmf/record/HwmfRecordType;

LaaKii · 2020-03-17T07:04:55Z

No solution yet. As @dadoonet mentioned, it's a bug.
As a temporary workout your could store it as a .pdf.
PDF works perfectly fine.

AncKal94 · 2020-03-17T12:59:21Z

Is the latest release by tika https://repository.apache.org/content/groups/snapshots/org/apache/tika/tika-app/2.0.0-SNAPSHOT/
only for pdfs?

LaaKii · 2020-03-17T13:35:27Z

I don't think so.
what made you think that?

LaaKii · 2020-03-17T13:36:18Z

FSCrawler is basically a wrapper of tika, maybe you missunderstood the concept.

dadoonet · 2020-03-17T14:27:14Z

I voted some days ago for the release of Tika 1.24. So it should not be long now.

AncKal94 · 2020-03-17T14:37:57Z

I believe they have released some changes on 11 March, not sure if they are for the same

dadoonet · 2020-03-17T18:14:45Z

It has just been released today so I'm expecting to upgrade in the next 24h

dadoonet added the check_for_bug Needs to be reproduced label Feb 12, 2020

dadoonet added bug For confirmed bugs and removed check_for_bug Needs to be reproduced labels Feb 12, 2020

dadoonet linked a pull request Feb 12, 2020 that will close this issue

Bump tika.version from 1.22 to 1.23 #865

Closed

dadoonet added this to the 2.7 milestone Feb 12, 2020

dadoonet self-assigned this Feb 12, 2020

dadoonet mentioned this issue Feb 19, 2020

java.lang.NoSuchMethodError: parsing some Word files #895

Closed

dadoonet linked a pull request Mar 18, 2020 that will close this issue

Bump tika.version from 1.22 to 1.24 #925

Merged

mergify bot closed this as completed in #925 Mar 18, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FSCrawler can't index .doc or .docx elements #899

FSCrawler can't index .doc or .docx elements #899

LaaKii commented Feb 12, 2020 •

edited

Loading

dadoonet commented Feb 12, 2020

LaaKii commented Feb 12, 2020

dadoonet commented Feb 12, 2020

LaaKii commented Feb 12, 2020 •

edited

Loading

dadoonet commented Feb 12, 2020

LaaKii commented Feb 12, 2020

dadoonet commented Feb 12, 2020

LaaKii commented Feb 12, 2020

dadoonet commented Feb 12, 2020

LaaKii commented Feb 12, 2020 •

edited

Loading

dadoonet commented Feb 12, 2020

LaaKii commented Feb 12, 2020

dadoonet commented Feb 12, 2020

LaaKii commented Feb 13, 2020

LaaKii commented Feb 19, 2020

dadoonet commented Feb 19, 2020

LaaKii commented Feb 19, 2020

dadoonet commented Feb 19, 2020

LaaKii commented Feb 19, 2020

AncKal94 commented Mar 17, 2020

LaaKii commented Mar 17, 2020

AncKal94 commented Mar 17, 2020

LaaKii commented Mar 17, 2020

LaaKii commented Mar 17, 2020

dadoonet commented Mar 17, 2020

AncKal94 commented Mar 17, 2020

dadoonet commented Mar 17, 2020

FSCrawler can't index .doc or .docx elements #899

FSCrawler can't index .doc or .docx elements #899

Comments

LaaKii commented Feb 12, 2020 • edited Loading

dadoonet commented Feb 12, 2020

LaaKii commented Feb 12, 2020

dadoonet commented Feb 12, 2020

LaaKii commented Feb 12, 2020 • edited Loading

dadoonet commented Feb 12, 2020

LaaKii commented Feb 12, 2020

dadoonet commented Feb 12, 2020

LaaKii commented Feb 12, 2020

dadoonet commented Feb 12, 2020

LaaKii commented Feb 12, 2020 • edited Loading

dadoonet commented Feb 12, 2020

LaaKii commented Feb 12, 2020

dadoonet commented Feb 12, 2020

LaaKii commented Feb 13, 2020

LaaKii commented Feb 19, 2020

dadoonet commented Feb 19, 2020

LaaKii commented Feb 19, 2020

dadoonet commented Feb 19, 2020

LaaKii commented Feb 19, 2020

AncKal94 commented Mar 17, 2020

LaaKii commented Mar 17, 2020

AncKal94 commented Mar 17, 2020

LaaKii commented Mar 17, 2020

LaaKii commented Mar 17, 2020

dadoonet commented Mar 17, 2020

AncKal94 commented Mar 17, 2020

dadoonet commented Mar 17, 2020

LaaKii commented Feb 12, 2020 •

edited

Loading

LaaKii commented Feb 12, 2020 •

edited

Loading

LaaKii commented Feb 12, 2020 •

edited

Loading