-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Installation instructions are wrong #58
Comments
This is maybe, because that subfolder lies within a git submodule. |
@nyov: not that I can recollect - certainly not if this wasn't in the installation instructions... |
This did the trick before running git submodule init
git submodule update --init --recursive |
But then it segfaults on one of the urls in my dataset. Oh well...
(The offending strings are buried in a file with millions of entries, so I'm afraid I can't locate it easily, but the utf8 encoding related error is hopefully good enough a hint as to what the issue is.) |
Glad you managed to figure it out, I forgot about the "update init". You could throw in some logging, to get the position in the file (dump a part of the raw bytes string of the line, to grep for, or something). (I haven't actually used scurl or I might help you with that error. But looks obvious: wrong encoding on some of your text ➡ mojibake.) |
No no, that's totally a bug in the library; not the data. The library is supposed to join urls from out there in the wild (this being part of scrapy), so it cannot possibly expect valid data, let alone segfault when it encounters anything wrong. To reproduce, run a broad crawl on this dataset and extract all links: https://www.kaggle.com/cheedcheed/top1m use |
Thanks for reports @ddebernardy and for the help @nyov , I created a separate issue to track the segfault/encodig issue. |
I was following the install instructions from the README (macOS 10.14.5).
There was one warning about
... which I ignored. And then this failed:
The offending folder exists but is empty.
The text was updated successfully, but these errors were encountered: