-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
regionalization metric #177
Comments
@jread-usgs @aappling-usgs I think I setup the correct comparison data to get these ranks. I just don't really know what to do for the actual index. I'm using cached session data from the process step, # for now, using US population dataset from 1977
# get population and percentage of total US population per state
us_pop <- data.frame(state = row.names(state.x77), pop = state.x77[,'Population'], stringsAsFactors = FALSE)
us_total <- sum(us_pop$pop)
us_pop <- dplyr::mutate(us_pop, pop_pct = pop/us_total*100)
# what is being ranked? actual population percentage (x) vs percentage of traffic from each state (y)
# get traffic data based on users by state
# this is data from one app
app_traffic <- readRDS('cache/process/lastest_year.rds')
app_traffic_grp <- dplyr::group_by(app_traffic, region)
app_traffic_sum <- dplyr::summarize(app_traffic_grp, traffic = sum(users))
# merge app traffic w/ population
app_traffic_sum <- dplyr::rename(app_traffic_sum, state = region)
traffic_by_state <- dplyr::left_join(us_pop, app_traffic_sum)
# calculate percentage of traffic for each state
traffic_total <- sum(traffic_by_state$traffic)
traffic_by_state <- dplyr::mutate(traffic_by_state, traffic_pct = traffic/traffic_total*100)
# now calculate the regionality index....
corr_result <- cor.test(x = traffic_by_state$pop_pct,
y = traffic_by_state$traffic_pct,
method = "spearman") Am I trying to compare the right things? |
I think so... |
agreed with jread - looks good to me so far. |
Possible to get the app names on the bar plot @lindsaycarr ? |
The dataset I have cached only has viewIDs, and no corresponding app names. I'll run make and see if it changes. If not, I might be able to work something out -- not sure how quickly |
Working something quick out with David and Walker, so new plot w/ names coming soon |
Accompanying script:
|
looks like we need Pearson correlation instead of Spearman |
need to grab population data that is not from 1977 before we start making too many conclusions based on this |
with 2012 population data:
|
@lindsaycarr can you try it w/ Pearson's R? |
Some seem reasonable, but a lot are skewed because all the developers sit here in WI. |
I am not sure the best stat for this, but I think the linear relationship that Pearsons test is more appropriate than the ranking test Spearmans looks for (such as "are the order of states the same?"). What this isn't doing yet is properly dealing w/ zeros. For example, the NPS mercury viewer has a pretty "national" score, but really only has a small number of visits from 4 states. |
agreed w/ jread that spearman's may look prettier, but pearson's makes more sense - i'm more interested in differences between the proportions of population and proportions of traffic than i am in any differences in how states rank relative to one another by population vs traffic. alternatively, what about a metric constructed from the differences between traffic & population proportions (or quotients, i.e., traffic per capita)? what about root mean squared difference between traffic and population proportions? then the 0s would be -1, which is a big number, and big RMSDs would indicate a lack of evenness (numbers close to 0 indicate greater evenness). |
Thinking about this again...what we really want is a metric that gives us some measure of how far off the "real" percentages are from falling on a 1:1 with the "expected" (or population-based) numbers. So I think the slope here is important. Neither of the two metrics we have tried so far do this. Ok, just saw alison weighed in too |
yeah, agreed on the slope. i think RMSD fixes that part. unsure whether it introduces any other problems...would like to see how it looks. |
revisiting my previous comment: i guess i'm thinking normalized differences, like this:
with the caveat that if pop_pct is 0.18% (wyoming), then traffic_pct of 1% becomes a huge number (5), and if there were only 100 visitors in a time period, traffic_pct could pretty much randomly flip between 1% and 2%. so small states will contribute noise. the alternative is absolute differences, which will pretty much ignore small states altogether. |
that's a RMSE calculation, isn't it? ☝️ or do those two mean the same thing... |
yes; i just avoided E because they're not really errors in my mind. the clarification was RMS[E/D] of the relative differences rather than the absolute differences. absolute differences would look like this:
and i realized i hadn't specifically suggested relative differences in my previous comment. |
I think we'll need to boil this down to a scale that is going to be simple to pick up quickly- like a scale from 1-10 or 0-1. From local to national. Tricky to get a RMSD into that kind of scale but maybe there is a simple way to do that. |
what about R^2 for traffic_pct ~ pop_pct? it penalizes relationships with slope != 1, usually ranges between 0 and 1. caveat is that it can be negative for this type of dataset (actual range is -Inf to 1). we could just truncate at a lower limit of 0 for plotting purposes... example code for the above plots: pop_pct <- seq(1,100,by=2)
traffic_pct <- pmax(0, rnorm(n=50, mean=seq(0,100,length.out=50), sd=3))
plot(traffic_pct ~ pop_pct, ylim=c(0,180), xlim=c(0,180)); abline(a=0,b=1)
met <- 1-sum((traffic_pct - pop_pct)^2)/sum((traffic_pct - mean(traffic_pct))^2)
title(paste0('R^2 = ', met)) |
That allows an offset though, right? I thought this was the way to pin 1:1 and intercept of 0: lm(y ~ 0 + offset(x)) but it doesn't have any freedom to it, so you can't calculate an R2 |
i thought we wanted to penalize slopes different from 1, which the above code does. i did the calculation from formula rather than fitting an lm; i'm not sure how you'd fit an lm so it penalized non-1 slopes |
seems to always give r.squared = 0:
|
Gotcha. I understand your's now. I was thinking that was a snippet that went into an |
I didn't see your snippet at the bottom the first time I looked. |
i might have added it in an edit - that's a bad habit of mine =) |
OK, I like this. @lindsaycarr and I are working on getting the data.frame set to do the math. |
OK, so here is what I've got using R^2 for the most recent year of analytics data. The states listed are the ones that deviated the most from "expected" and the numbers in parentheses are the amounts they deviated. All negative R^2s were treated as 0s The script can be found in my branch: https://github.com/lindsaycarr/internal-analytics/blob/regionality/scripts/process/regionality_test.R |
Looks good! Alison, what do you think? |
Yeah, I like it! You could probably round the numbers more if you want - looks like you seldom need 1 digit after the decimal point, let alone 2. maybe just go for 2 sig figs? |
Good idea. @jread-usgs any desire to make this 0:10 instead of 0:1? |
PS - i couldn't remember how to do sig figs, so looked it up. in case it's not on the tip of your fingers already: signif(c(.1234, 1.234, 12.34, 123.4), digits=2)
## [1] 0.12 1.20 12.00 120.00 |
Yes, I think 0-10 as a metric would be great. Would it be possible to mock up figure1 with this as the second column (only two columns total)? |
@jread-usgs I just read this again, but I'm actually not sure what you are talking about. Add this to Fig 1 of the internal-analytics home page? So the yearly analytics bars as column one and this as column two? |
Correct @lindsaycarr |
what have you tried? does this help? https://stackoverflow.com/a/21585521/3203184 |
So, I didn't touch the plotting code; just changed the data. The plotting code has:
which I think would do this appropriately. I've tried all options for scales and space (free, fixed, free_y, free_x) |
and if you delete the regionalization rows, everything is nice again? |
facet_grid is designed to use the same scale in x in 1 column, or the same y scale in 1 row...so you have to scale it yourself if you want to bypass that. If you are using the processed data from trend_all (?), then you should be able to use "scaled_value", "scaled_newUser", and "text_placement". |
Ha - just figured it out. Needed it to ignore scaled_newUser |
Wait, that wasn't exactly it. I don't really understand why, but having both of these lines makes it not scale freely, but removing either of them allows it to.
|
/shrug ! |
@lindsayplatt what is the status of this work? Is the metric settled and ready for prime time? |
preliminary implementation of the "regionality" scale for each app at the year scale. It doesn't need to be included in the app now and you could just make one bar chart figure for it. A "spearman rank correlation" will be a good way to get the index.
@jread-usgs in #internal-analytics Slack channel
@aappling-usgs reaction:
The text was updated successfully, but these errors were encountered: