Japanese version of arabic numerals appear to be removed by textract #145

tansaku · 2018-03-13T16:50:42Z

Textract installed great on OSX 10.12.6 for me, and is working fine to extract English text from doc files ...

However I note a problem with the Japanese version of Arabic numerals. In both Japanese doc and text files run through textract the main Japanese text comes through fine, but where there were Arabic numerals (e.g. 2018) in Japanese text format, they are removed from the output.

２０１８年

becomes

年

Has anyone experienced anything similar?

dbashford · 2018-03-13T17:24:01Z

I can take a look at this. It is about time to cycle through the latests asks/bugs and get a new version out. =)

DarrenCook · 2018-03-20T11:22:15Z

This happens on Linux, too, with antiword. It is happening with all of doc, odt and pdf. (In all my tests, the test files were made by LibreOffice, loading from a UTF-8 plain text file.)

Double-width parentheses are also being lost: (FF08, FF09)
https://en.wikipedia.org/wiki/Halfwidth_and_fullwidth_forms#In_Unicode

In each case, a string of 1+ of these characters are being replaced with a single space.

However, ten (FF64) and maru (FF61) are coming through fine, and they are in that same unicode block.

tansaku · 2018-03-20T13:42:09Z

I'm noticing that antiword on OSX seems to handle Japanese zenkaku characters (numbers and parentheses) - it would be cool to be able to override which word processor was used on which platform ...

DarrenCook · 2018-03-20T22:14:14Z

The bug is in the regexes at the top of lib/extract.js:
https://github.com/dbashford/textract/blob/master/lib/extract.js#L13

If I change the \u2C00-\uD7FF into \u2C00-\uFFFF (in both regexes), then the zenkaku problem described in this issue goes away.

I get no new test failures (I'll describe the test failures I get in a separate github issue, in a moment).

That is a band-aid fix. I think the deeper problem is the code uses a whitelist to "not remove anything that is not whitespace", when it should use a blacklist to "remove whitespace"?

Are there actually any more than listed here: https://en.wikipedia.org/wiki/Newline#Unicode

(Just checking some of my own code that does something similar, and I treat \u0085, \u2028, \u2029 the same as \n and I also have \u2009 to convert narrow space to be a normal ascii space.)

Is the intention of that code to also be removing control characters? See: https://en.wikipedia.org/wiki/Unicode_control_characters and https://en.wikipedia.org/wiki/C0_and_C1_control_codes#C1_set

DarrenCook mentioned this issue Mar 23, 2018

Is removing 1 space, but 2+ spaces to 1 space, deliberate? #151

Open

dbashford added the 2.4 label Jun 5, 2018

dbashford closed this as completed in 72eb3be Aug 3, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Japanese version of arabic numerals appear to be removed by textract #145

Japanese version of arabic numerals appear to be removed by textract #145

tansaku commented Mar 13, 2018

dbashford commented Mar 13, 2018

DarrenCook commented Mar 20, 2018

tansaku commented Mar 20, 2018

DarrenCook commented Mar 20, 2018

Japanese version of arabic numerals appear to be removed by textract #145

Japanese version of arabic numerals appear to be removed by textract #145

Comments

tansaku commented Mar 13, 2018

dbashford commented Mar 13, 2018

DarrenCook commented Mar 20, 2018

tansaku commented Mar 20, 2018

DarrenCook commented Mar 20, 2018