-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Segfault or encoding error when parsing a URL #59
Comments
For clarity, you only need to extract all links from the front page. |
^ Seems to be enough to reproduce on my system.
|
That it works fine when I drop dups is somewhat intriguing. Maybe the code is running out of memory or something. (Maybe there's a memory leak in there somewhere?) If you've more memory than I do and it doesn't choke on your system as a result, you can probably use |
See #58 (comment) and #58 (comment)
Also repeating here
To reproduce, run a broad crawl on this dataset and extract all links:
https://www.kaggle.com/cheedcheed/top1m
use
urljoin()
andurlsplit()
on each one.The text was updated successfully, but these errors were encountered: