Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scraper halts upon meeting a link with unicode characters #43

Open
thibault opened this issue Nov 22, 2018 · 2 comments
Open

Scraper halts upon meeting a link with unicode characters #43

thibault opened this issue Nov 22, 2018 · 2 comments

Comments

@thibault
Copy link
Contributor

Hi,

I've been sucessfully setting up an openscraper instance. Unfortunately, the spider always stops scraping after 15 results.

After a bit of investigation, here is what seems to be the problems that put the spider to an halt:

::: ERROR scrapy.core.scraper 181122 13:46:58 ::: scraper:158 -in- handle_spider_error() ::: 		Spider error processing <GET https://www.ademe.fr/actualites/appels-a-projets> (referer: None)
    Traceback (most recent call last):
      File "/home/openscraper/.virtualenvs/openscraper/local/lib/python2.7/site-packages/scrapy/utils/defer.py", line 102, in iter_errback
        yield next(it)
      File "/home/openscraper/.virtualenvs/openscraper/local/lib/python2.7/site-packages/scrapy/spidermiddlewares/offsite.py", line 30, in process_spider_output
        for x in result:
      File "/home/openscraper/.virtualenvs/openscraper/local/lib/python2.7/site-packages/scrapy/spidermiddlewares/referer.py", line 339, in <genexpr>
        return (_set_referer(r) for r in result or ())
      File "/home/openscraper/.virtualenvs/openscraper/local/lib/python2.7/site-packages/scrapy/spidermiddlewares/urllength.py", line 37, in <genexpr>
        return (r for r in result or () if _filter(r))
      File "/home/openscraper/.virtualenvs/openscraper/local/lib/python2.7/site-packages/scrapy/spidermiddlewares/depth.py", line 58, in <genexpr>
        return (r for r in result or () if _filter(r))
      File "/home/openscraper/OpenScraper/openscraper/scraper/masterspider.py", line 609, in parse
        log_scrap.info(" --> follow_link CLEAN ({}) : {} ".format(type(follow_link),follow_link) )
    UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 45: ordinal not in range(128)

Problem was generated by line 600 and line 609.

After analysing the trace, the problem arise when the spider tries to follow this link:

https://appelsaprojets.ademe.fr/aap/H2mobilité2018-82#resultats

So it seems openscraper has a problem handling non purely ascii links.

@JulienParis
Copy link
Collaborator

JulienParis commented Nov 22, 2018

I don't think it's the scraper per say, I would say it's the log causing the spider to stop due to this error : the .format function could go bersek with accents when it's used in the log (here in the log_scrap)...

I remember I had the same issue before... so for a start I would replace :
log_scrap.info(" --> follow_link CLEAN ({}) : {} ".format(type(follow_link),follow_link) )
by
log_scrap.info(" --> follow_link CLEAN (%s) : %s ", %(type(follow_link),follow_link) )

let us know if it's working at those lines, so you could fix it up by a PR

@thibault
Copy link
Contributor Author

Well, the problem is that you are passing unicode type variables into a binary string without an explicit encoding. Since Python 2 tries to silently convert between the two types on the fly, it will be ok most of the time, but as soon as the string will not be pure ascii, an error wil be raised.

There are several ways to fix this.

  1. You could encode data every time you want to log it:
 log_scrap.info(" --> follow_link CLEAN ({}) : {} ".format(type(follow_link), follow_link.encode('utf-8')))
  1. You could import unicode_litterals in every file to make sure all strings are unicode and not binary.
from __future__ import unicode_litterals

 log_scrap.info(" --> follow_link CLEAN ({}) : {} ".format(type(follow_link), follow_link)))
  1. You could prefix every string variable with « u » to make sure it is unicode and not binary.
 log_scrap.info(u" --> follow_link CLEAN ({}) : {} ".format(type(follow_link), follow_link)))

I will publish a PR with the third solution that allowed me to scrap the entire ademe site without errors, but you might want to check the codebase for places where unicode and binary are mixed. Porting the project to python 3 could also help, since Python 3 does not silently cast unicode and binary.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants