readabilitytests problem with utf-8 characters #17

brendanwood · 2015-01-29T20:05:09Z

I ran into a problem trying to apply the readability tests to a block of text with some UTF-8 characters (fancy quotes).

Sample text: http://pastebin.com/eRKGMGYn

Test script: http://pastebin.com/aE2DaRvk

I'm not very familiar with nltk_contrib, so perhaps I'm just using it wrong...but it seems to fail regardless of whether I pass in a bytestring or unicode string to ReadabilityTool. I forked nltk_contrib and changed textanalyzer.py so that it takes unicode instead of bytes, and that seems to have fixed the problem for me.

My fork: https://github.com/priceonomics/nltk_contrib

Can someone confirm the issue I'm seeing and whether my fix is appropriate? Feel free to merge it back if it's useful.

kmike · 2015-01-29T20:57:42Z

Switching ReadabilityTool to unicode is the way to go, and your changes look good. See also: #11.

kmike · 2015-01-29T21:27:43Z

Thanks! I'll close this ticket, but leave #11 because it seems there are other unicode-related issues which could affect ReadabilityTool.

brendanwood mentioned this issue Jan 29, 2015

Modified readability.textanalyzer to use unicode internally. #18

Merged

kmike closed this as completed Jan 29, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

readabilitytests problem with utf-8 characters #17

readabilitytests problem with utf-8 characters #17

brendanwood commented Jan 29, 2015

kmike commented Jan 29, 2015

kmike commented Jan 29, 2015

readabilitytests problem with utf-8 characters #17

readabilitytests problem with utf-8 characters #17

Comments

brendanwood commented Jan 29, 2015

kmike commented Jan 29, 2015

kmike commented Jan 29, 2015