Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Normalize mentions about the october data set #22

Open
ernesto-jimenez opened this issue Nov 28, 2013 · 1 comment
Open

Normalize mentions about the october data set #22

ernesto-jimenez opened this issue Nov 28, 2013 · 1 comment

Comments

@ernesto-jimenez
Copy link
Member

I think the best option would probably be to add an appendix about the october data set describing the situation and the caveats from the dataset (e.g: not using a mobile UA, so some websites might have done UA sniffing and served a desktop website without mobile specific meta tags)

Then we link to that appendix rather than using the current link to the gist we have right now.

@marcoscaceres
Copy link
Contributor

Yes! absolutely.

We need to say if the sample is probabilistic or non-probabilistic (it's non-probabilistic because we don't know how many webpages there are on the Webs). Hence, we cannot generalize from it. However, the sample size n=78k, is more than appropriate for an exploratory analysis (cf. [1]).

Selection bias: the pages were selected by Alexia's ranking algorithm - hence we need to understand how they end up with this list... and if it's representative of "the world" (i.e., are all countries represented in the set, etc.). There may be language bias. We don't need to look at this, just acknowledge it.

We know some of the data may be bad if process with grep. I think that's about it. Or good enough to start.

[1] Reference: Cochran, W. G. (1977). Sampling techniques (3rd ed.). New York: John Wiley & Sons.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants