-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Scraper halts upon meeting a link with unicode characters #43
Comments
I don't think it's the scraper per say, I would say it's the log causing the spider to stop due to this error : the I remember I had the same issue before... so for a start I would replace : let us know if it's working at those lines, so you could fix it up by a PR |
Well, the problem is that you are passing unicode type variables into a binary string without an explicit encoding. Since Python 2 tries to silently convert between the two types on the fly, it will be ok most of the time, but as soon as the string will not be pure ascii, an error wil be raised. There are several ways to fix this.
log_scrap.info(" --> follow_link CLEAN ({}) : {} ".format(type(follow_link), follow_link.encode('utf-8')))
from __future__ import unicode_litterals
log_scrap.info(" --> follow_link CLEAN ({}) : {} ".format(type(follow_link), follow_link)))
log_scrap.info(u" --> follow_link CLEAN ({}) : {} ".format(type(follow_link), follow_link))) I will publish a PR with the third solution that allowed me to scrap the entire ademe site without errors, but you might want to check the codebase for places where unicode and binary are mixed. Porting the project to python 3 could also help, since Python 3 does not silently cast unicode and binary. |
Hi,
I've been sucessfully setting up an openscraper instance. Unfortunately, the spider always stops scraping after 15 results.
After a bit of investigation, here is what seems to be the problems that put the spider to an halt:
Problem was generated by line 600 and line 609.
After analysing the trace, the problem arise when the spider tries to follow this link:
https://appelsaprojets.ademe.fr/aap/H2mobilité2018-82#resultats
So it seems openscraper has a problem handling non purely ascii links.
The text was updated successfully, but these errors were encountered: