Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

readabilitytests problem with utf-8 characters #17

Closed
brendanwood opened this issue Jan 29, 2015 · 2 comments
Closed

readabilitytests problem with utf-8 characters #17

brendanwood opened this issue Jan 29, 2015 · 2 comments

Comments

@brendanwood
Copy link
Contributor

I ran into a problem trying to apply the readability tests to a block of text with some UTF-8 characters (fancy quotes).

Sample text: http://pastebin.com/eRKGMGYn

Test script: http://pastebin.com/aE2DaRvk

I'm not very familiar with nltk_contrib, so perhaps I'm just using it wrong...but it seems to fail regardless of whether I pass in a bytestring or unicode string to ReadabilityTool. I forked nltk_contrib and changed textanalyzer.py so that it takes unicode instead of bytes, and that seems to have fixed the problem for me.

My fork: https://github.com/priceonomics/nltk_contrib

Can someone confirm the issue I'm seeing and whether my fix is appropriate? Feel free to merge it back if it's useful.

@kmike
Copy link
Member

kmike commented Jan 29, 2015

Switching ReadabilityTool to unicode is the way to go, and your changes look good. See also: #11.

@kmike
Copy link
Member

kmike commented Jan 29, 2015

Thanks! I'll close this ticket, but leave #11 because it seems there are other unicode-related issues which could affect ReadabilityTool.

@kmike kmike closed this as completed Jan 29, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants