Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problems with cyrillic symbols #65

Closed
TsvetkovAV opened this issue Nov 11, 2015 · 10 comments
Closed

Problems with cyrillic symbols #65

TsvetkovAV opened this issue Nov 11, 2015 · 10 comments

Comments

@TsvetkovAV
Copy link

When I execute js file with node.js with following content(for example with .doc file):
var textract = require('textract');

textract.fromFileWithPath('test.doc', function( error, text ) {
if (error) throw error;
console.log(text);
})

with .doc file, all cyrillic symbols ureadable (but when I execute Catdoc, then I can read it)
and with .docx file all cyrillic symbols removes.

@dbashford
Copy link
Owner

?

@TsvetkovAV
Copy link
Author

Sorry, pressed 'Ctrl+Enter'

@dbashford
Copy link
Owner

Can you provide or give me an idea of what I should test? Feel like this problem was solved awhile ago but there may have been a regression.

@TsvetkovAV
Copy link
Author

Now I found that for both types of files that is the samo problem
All cyrillic characters are removes, displays only latin characters, punctuation symbols and nubers.

@TsvetkovAV
Copy link
Author

test.docx
For example

And result:
c:\JSTest\catdoc>node test
textract not ready, retrying in .5 seconds
INFO: 'pdftotext' does not appear to be installed, so textract will be unable to
extract PDFs. http://www.foolabs.com/xpdf/
INFO: 'drawingtotext' does not appear to be installed, so textract will be unabl
e to extract DXFs.
, , - , - . . , , . 3.5. , , : ; , , , ; , ; , , , ( , , ); , : , , , , ; - , :
, , , , , ; : , , -, - , .., , ; ; .; ( , ; .; , ..); ; ; ; . . 4. , , , , , ,
, . 4.1. , . : BCG < >; /, . , , / (< >), - ; ; . 4.2. . 1 : < - >. 2 - < >; < >
; < >; < >. 3 - < >, , . . 5. - : , , ( , , ); < > < >, . 5.1. , , . , , , ( ),
, , , , , . , , , , . : , ; , , ; , ; , , ; , , ; , , , , ; , . , , . . 1. , . ,
. 2. , , : , ( , , , , ), , , . 3. . 4. . . . 5. : , , , , , .. . , , (, ) ().
5.2. , - . , (R) (W) ; W, , ( ); RM, RI - . . 3. , W. , . W, , R; C -, N, ( ); ,
W C , , P, -; R, , ; , R W, L, - L . , W R , , . , , R, W. 2011 DOCX

@TsvetkovAV
Copy link
Author

If file have only latin characters then 'Textract' work correct, as it can be.
Could you help me with this problem.
I found next:
If I comment next lines:
if ( options.preserveLineBreaks ) {
// text = text.replace( WHITELIST_PRESERVE_LINEBREAKS, ' ' );
} else {
// text = text.replace( WHITELIST_STRIP_LINEBREAKS, ' ' );
}
in your code in textract\lib\extract.js,then returns text edition(like paragraphs, spaces) and removed cyrillic characters but as question marks('?').
It's true for file .doc with same text as in test.docx(I attached it in previous comment), but for .docx file is changed only text edition, removed cyrillic characters stay removed.
Thank you.

@dbashford
Copy link
Owner

This should be gtg. Was only happening for .docx and for .odtx as I had an extra text stripping regex that wasn't updated to include all the non-Latin characters.

Try 1.2.0.

@Ethaan
Copy link

Ethaan commented Jan 28, 2016

Im having the exact same issue

screen shot 2016-01-28 at 4 48 04 pm

This is my config.

    var  buffer = new Buffer(base64, 'base64'),
          type = 'application/msword',
          config = {
             preserveLineBreaks: true
         };
    textract.fromBufferWithMime(type, buffer, config, function(error, text) {
      if (error) {
        console.log(error);
       return(error);
      } else {
       return(null,text);
      }
    });

NOTE: This only happend with .doc files im on a MAC also.

@dbashford
Copy link
Owner

Can you give me a sample doc? (And maybe open a new issue with it to track?)

@Ethaan
Copy link

Ethaan commented Jan 29, 2016

Sure.

Here is #71

Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants