Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Japanese version of arabic numerals appear to be removed by textract #145

Closed
tansaku opened this issue Mar 13, 2018 · 4 comments
Closed
Labels

Comments

@tansaku
Copy link

tansaku commented Mar 13, 2018

Textract installed great on OSX 10.12.6 for me, and is working fine to extract English text from doc files ...

However I note a problem with the Japanese version of Arabic numerals. In both Japanese doc and text files run through textract the main Japanese text comes through fine, but where there were Arabic numerals (e.g. 2018) in Japanese text format, they are removed from the output.

2018年

becomes

Has anyone experienced anything similar?

@dbashford
Copy link
Owner

I can take a look at this. It is about time to cycle through the latests asks/bugs and get a new version out. =)

@DarrenCook
Copy link

This happens on Linux, too, with antiword. It is happening with all of doc, odt and pdf. (In all my tests, the test files were made by LibreOffice, loading from a UTF-8 plain text file.)

Double-width parentheses are also being lost: (FF08, FF09)
https://en.wikipedia.org/wiki/Halfwidth_and_fullwidth_forms#In_Unicode

In each case, a string of 1+ of these characters are being replaced with a single space.

However, ten (FF64) and maru (FF61) are coming through fine, and they are in that same unicode block.

@tansaku
Copy link
Author

tansaku commented Mar 20, 2018

I'm noticing that antiword on OSX seems to handle Japanese zenkaku characters (numbers and parentheses) - it would be cool to be able to override which word processor was used on which platform ...

@DarrenCook
Copy link

The bug is in the regexes at the top of lib/extract.js:
https://github.com/dbashford/textract/blob/master/lib/extract.js#L13

If I change the \u2C00-\uD7FF into \u2C00-\uFFFF (in both regexes), then the zenkaku problem described in this issue goes away.

I get no new test failures (I'll describe the test failures I get in a separate github issue, in a moment).

That is a band-aid fix. I think the deeper problem is the code uses a whitelist to "not remove anything that is not whitespace", when it should use a blacklist to "remove whitespace"?

Are there actually any more than listed here: https://en.wikipedia.org/wiki/Newline#Unicode

(Just checking some of my own code that does something similar, and I treat \u0085, \u2028, \u2029 the same as \n and I also have \u2009 to convert narrow space to be a normal ascii space.)

Is the intention of that code to also be removing control characters? See: https://en.wikipedia.org/wiki/Unicode_control_characters and https://en.wikipedia.org/wiki/C0_and_C1_control_codes#C1_set

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants