-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Strengthen research with google API #26
Comments
(it's also not verifiable) |
Wouldn't it be better to limit ourselves to more accurate and verifiable sources? I would rather do a quick crawler that downloads a website, extracts the key information we want and discards the HTML. That would save a lot of space and could be done easily. The uncompressed october dataset is 5.9GB while all the CSVs I've generated for webdevdata-reports are 567MB: ~/webmob-reports% du -sh webdevdata-latest/
5.9G webdevdata-latest/
~/webmob-reports% du -sh csv_out
567M csv_out
~/webmob-reports% wc -l csv_out/*
1933179 csv_out/all_tags.csv
527432 csv_out/link_tags.csv
275799 csv_out/link_tags_stylesheet.csv
125825 csv_out/link_tags_stylesheet_media.csv
326287 csv_out/meta_tags.csv
1816 csv_out/meta_tags_application_names.csv
15926 csv_out/meta_tags_viewport.csv
641462 csv_out/script_tags.csv
3847726 total |
On Thursday, November 28, 2013 at 11:44 PM, Ernesto Jiménez wrote:
I don’t think we should limit our selves. We are able to provide verifiable results, which is great - but as a secondary source that is able to show use at “web scale”, it certainly helps strengthen our argument. It gives an indication of the reach of a given feature beyond our dataset (even if unverifiable). Having said that, I strongly agree that we should not use it as a primary source - as we don’t know what each search from google actually means (could look that up).
That could be quite an efficient way of doing this. If we know exactly what we are looking for, then we could broaden our search - specially if we could split the task amongst a cluster of computers. Then we could easily search the top 1,000,000 if each machine d/l 100,000 home pages in a very targeted way. |
We can probably draw on the following to strengthen the findings. It's not as accurate, but it's covers a much much larger data set so it could be used to strengthen findings.
http://git.macropus.org/meta-tag-usage/
The text was updated successfully, but these errors were encountered: