Skip to content

Conversation

@palkeo
Copy link

@palkeo palkeo commented Jul 17, 2014

Hi,

I'm very sad this library is not ported to python3.
I have made a port, that is quite different than the one of @Ftzeng as I have removed all the encoding stuff, and the tests still seems to pass with python2.7 & python3. I use requests for downloading the webpages and detecting the correct encoding.

I'm nearly sure there is still work to do, as my tests were very shallow, but I would really like to have a port done… (and that encoding.py is quite… bad, I think : you use utf-8 strings everywhere and there should be no problem).

What do you think ? :)

@buriy
Copy link
Owner

buriy commented Jul 17, 2014

I'd like to support python 2.5 and 2.6 too, and so to have python3 support in a different branch.
There are 2 tests in the package which don't check for any real-world scenarios with broken page encodings. See the ideas on this at #42
Also it's a little bit faster with utf-8 ... It's internal libxml format, you know.
I'm going to do large tests on thousands of article sources and millions of articles later this year, and perform a complete library rebuild, resulting in improving the library success extractions ratio from 95% to 99.5% , but I might need to wait for 1-2-3 months for this moment to come.

@palkeo
Copy link
Author

palkeo commented Jul 17, 2014

Python 2.5 and 2.6 are quite old, even Django or big libraries like this don't support them anymore… But yeah, that's your choice…

But yeah, so why use str ? Make readability accept only unicode input, and then unicode everywhere in readability, that's way simpler.
And for loading webpages, there is requests ! It can automatically guess the encoding and everything, and it's so simple, and it returns… unicode :)

Ok ok, BTW I'm really interested in this python library. Before, I was using a custom solution, and I have discovered that's really not easy…
Oh, and if that's of any interest for you, I have found that if you have like, 2-3 pages of the same websites with articles, you can find the article by searching for parts of the HTML that are not common in the pages you have, that works quite well too (and could be mixed with other approaches…).

@buriy
Copy link
Owner

buriy commented Jul 17, 2014

Python 2.5 and 2.6 are quite old, even Django or big libraries like this don't support them anymore… But yeah, that's your choice…

Well, mostly I do updates for my users -- I have rarely a chance to use the package myself more than once a year (until this year). I mean, for more than several sites at one time.
Django 1.6, which is stable now, supports python 2.6. (there's only Django 1.7 release candidate 1 now -- Django 1.7 is not yet finished)
And on my work company projects, I still have django versions varying from 1.2 to 1.6 (django 1.2 showcased multiple databases). For faculty works, of course you can use Python 3 or whatever. But for larger projects, upgrading takes a lot of time, brings instability and nobody would do it unless it will bring any valuable benefits.
Probably I wouldn't like to support python 2.5 anymore. But I would like to support python 2.6 .
If a person has python3, 99% is that she has also python2 installed.
So you can run the command line version of the package with just few simple commands.
Have you tried it? A user could do it even if she doesn't know python.

But yeah, so why use str ? Make readability accept only unicode input, and then unicode everywhere in readability, that's way simpler.

Libxml, which is the base of lxml, uses utf-8 under the cover. You'll get automatic conversion to utf-8 anyway, it just really a matter if you would like to see that implicit or explicit. For older lxml and python 2 there were no implicit utf8/unicode conversions, that's why I used explicit one. Maybe things has changed a little.

And for loading webpages, there is requests ! It can automatically guess the encoding and everything, and it's so simple, and it returns… unicode :)

Except that in real life requests package doesn't work for a lot of real pages.
Some pages has wrong encoding declared, some are misusing the encoding characters.
Please take a look on the other issues in this package, including the closed ones.
And you can still feed the current version with unicode if you like to deal with broken web pages yourself -- so the current solution gives good results both for you and me.

Oh, and if that's of any interest for you, I have found that if you have like, 2-3 pages of the same websites with articles, you can find the article by searching for parts of the HTML that are not common in the pages you have, that works quite well too (and could be mixed with other approaches…).

Yes, I know this, but most of all I'm interested in the scalable approaches. If one parses only several sites and some pages from them -- that's ok to use almost any tool, but if one parses thousands of sites -- you need a tool that won't break and won't need much customization for every specific site.
Currently this package is also very fragile, so I'm going to move it "to the next level of quality" in the coming months.
The biggest problem with the "not common parts" is that article sites often have:

  1. related articles blocks
  2. images with image description blocks
  3. ads in the article text
    And so anyway you need to manually inspect and choose which block holds article text, and how to remove the bad blocks.
    I would like if this package in the future could find the correct block without any manual help.
    And to make a test database, I'll use a lot of news sites with RSS -- so I'll parse snippets in the RSS and would expand them to the complete articles with the readability package.

sypa added a commit to bericht/python-readability that referenced this pull request Oct 11, 2014
@buriy
Copy link
Owner

buriy commented Jul 26, 2015

Thanks a lot!

@buriy buriy closed this Jul 26, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants