Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

regionalization metric #177

Open
lindsayplatt opened this issue Jul 17, 2017 · 48 comments
Open

regionalization metric #177

lindsayplatt opened this issue Jul 17, 2017 · 48 comments
Assignees

Comments

@lindsayplatt
Copy link
Contributor

preliminary implementation of the "regionality" scale for each app at the year scale. It doesn't need to be included in the app now and you could just make one bar chart figure for it. A "spearman rank correlation" will be a good way to get the index.

@jread-usgs in #internal-analytics Slack channel

I think we should calculate a "regionalization metric" (or some other name) that gets at the local vs national scale of the particular web page/app. I was thinking of a scale from 0 to 10, where 0 is local and 10 is national. To do this, we could (this is just an idea, but would make sense to think of different wants to do it) calculate the percentage of user traffic expected for each state based on population, and compare the actual percentages for each time period. A perfect match (e.g., 12.1% of traffic from CA, 1.8% from WI, 8.6% from TX, etc...) would be a 10 ("national"), which is very unlikely, and a lesser match (such as 80% of traffic from WI) would be closer to 0, or on the "local" end of the spectrum. So each app for each time period would have a score, and we could also have a cross-app score for comparison.

where would/could this be? It could be a horizontal bar below the map w/ fill between 0 and the score, and then a vertical slash that is taller than the bar thickness representing the mean/median of all apps for the same time period. If it was a useful way to distill this concept into a number like this, figure one could be a single column for the time period viewed, then column two could be same 0-10 bar, and a smaller column three could be the name of the state that is the greatest positive outlier from the expected distribution of visitors. Figure one would then also be reactive to the time period selected (instead of having all three).

Today's meeting discussed the need to tell better stories about what we do, and come up with better ways to measure them. I think this is a way can get a quick metric on the reach of many of the products that would contain these "stories". We have the gross numbers for visits clearly presented, but I think the "where" is just as important as how many to have access to as a quick view.

@aappling-usgs reaction:

cool idea! some of the more internally facing apps have high concentrations of users in california, colorado, and virginia - agreed that most important would be regionalization metric w.r.t. all users, but after that i'd be interested to see how enterprise apps are serving USGS users, which we could do by dividing by fraction of USGS that's in each state. could the all-users map toggle between being total per state and total per capita? (horizontal bar is fine too)

also, sorta wondering if there's a biodiversity / spatial autocorrelation metric for this

@lindsayplatt lindsayplatt self-assigned this Jul 17, 2017
@lindsayplatt
Copy link
Contributor Author

@jread-usgs @aappling-usgs I think I setup the correct comparison data to get these ranks. I just don't really know what to do for the actual index. I'm using cached session data from the process step, lastest_year.rds.

# for now, using US population dataset from 1977

# get population and percentage of total US population per state
us_pop <- data.frame(state = row.names(state.x77), pop = state.x77[,'Population'], stringsAsFactors = FALSE)
us_total <- sum(us_pop$pop)
us_pop <- dplyr::mutate(us_pop, pop_pct = pop/us_total*100)

# what is being ranked? actual population percentage (x) vs percentage of traffic from each state (y)

# get traffic data based on users by state
# this is data from one app
app_traffic <- readRDS('cache/process/lastest_year.rds')
app_traffic_grp <- dplyr::group_by(app_traffic, region)
app_traffic_sum <- dplyr::summarize(app_traffic_grp, traffic = sum(users))

# merge app traffic w/ population
app_traffic_sum <- dplyr::rename(app_traffic_sum, state = region)
traffic_by_state <- dplyr::left_join(us_pop, app_traffic_sum)

# calculate percentage of traffic for each state
traffic_total <- sum(traffic_by_state$traffic)
traffic_by_state <- dplyr::mutate(traffic_by_state, traffic_pct = traffic/traffic_total*100)

# now calculate the regionality index....
corr_result <- cor.test(x = traffic_by_state$pop_pct,
                        y = traffic_by_state$traffic_pct,
                        method = "spearman")

Am I trying to compare the right things?

@jordansread
Copy link

I think so...
traffic_by_state$traffic_pct is the actual percent of traffic for each state over the time period (yearly here, but would apply to month or week as well) and traffic_by_state$pop_pct is the percent of total us pop in that state?
can you calc the metric for a could of apps and share? Or do a barchart/lollipop with a point for each app and then we can see if it passes the smell test.

@aappling-usgs
Copy link
Member

agreed with jread - looks good to me so far.

@lindsayplatt
Copy link
Contributor Author

traffic_by_state$traffic_pct = percent of total users in each state
traffic_by_state$pop_pct = percent of total population in each state

image

image

@jordansread
Copy link

Possible to get the app names on the bar plot @lindsaycarr ?

@lindsayplatt
Copy link
Contributor Author

The dataset I have cached only has viewIDs, and no corresponding app names. I'll run make and see if it changes. If not, I might be able to work something out -- not sure how quickly

@lindsayplatt
Copy link
Contributor Author

Working something quick out with David and Walker, so new plot w/ names coming soon

@lindsayplatt
Copy link
Contributor Author

lindsayplatt commented Jul 17, 2017

image

Accompanying script:

# for now, using US population dataset from 1977

# get population and percentage of total US population per state
us_pop <- data.frame(state = row.names(state.x77), pop = state.x77[,'Population'], stringsAsFactors = FALSE)
us_total <- sum(us_pop$pop)
us_pop <- dplyr::mutate(us_pop, pop_pct = pop/us_total*100)

# what is being ranked? actual population percentage (x) vs percentage of traffic from each state (y)

# get traffic data based on users by state
# this is data from one app
traffic <- readRDS('cache/process/lastest_year.rds')
traffic_grp <- dplyr::group_by(traffic, viewID, region)
traffic_sum <- dplyr::summarize(traffic_grp, traffic = sum(users))
traffic_sum <- dplyr::ungroup(traffic_sum)

# merge app traffic w/ population
traffic_sum <- dplyr::rename(traffic_sum, state = region)
traffic_pop <- dplyr::left_join(us_pop, traffic_sum)

# throw in app names
gaList <- yaml::yaml.load_file('data/gaTable.yaml')
gaTable <- data.frame(viewID = unlist(rvest::pluck(gaList, 'viewID')),
                      appName = unlist(rvest::pluck(gaList, 'shortName')),
                      stringsAsFactors = FALSE)
traffic_pop <- dplyr::left_join(traffic_pop, gaTable)

# calculate percentage of traffic for each state
traffic_total <- sum(traffic_pop$traffic)
traffic_pop <- dplyr::mutate(traffic_pop, traffic_pct = traffic/traffic_total*100)

# now calculate the regionality index for each app...

# get spearman's rho
all_corr <- sapply(unique(traffic_pop$appName), function(nm, traffic) {
  app_traffic <- dplyr::filter(traffic, appName == nm)
  corr_result <- cor(x = app_traffic$pop_pct,
                     y = app_traffic$traffic_pct,
                     method = "spearman")
  return(corr_result)
}, traffic = traffic_pop)

par(mar = c(2, 10, 0, 2))
barplot(all_corr, xlab = "App ID", horiz=TRUE, las=1)

par(mar = c(5, 4, 4, 2))
hist(x = all_corr, xlab = "Spearman's Rho", main = "Hist of Spearman's Rho for all apps")

@jordansread
Copy link

looks like we need Pearson correlation instead of Spearman

@lindsayplatt
Copy link
Contributor Author

need to grab population data that is not from 1977 before we start making too many conclusions based on this

@lindsayplatt
Copy link
Contributor Author

with 2012 population data:

image

library(choroplethr)
library(dplyr)
library(yaml)

# get population and percentage of total US population per state
# 2012 population data

data(df_pop_state)
us_pop <- df_pop_state
us_pop <- rename(us_pop, pop = value)
us_total <- sum(us_pop$pop)
us_pop <- mutate(us_pop, pop_pct = pop/us_total*100)

# what is being ranked? actual population percentage (x) vs percentage of traffic from each state (y)

# get traffic data based on users by state
# this is data from one app
traffic <- readRDS('cache/process/lastest_year.rds')
traffic_sum <- group_by(traffic, viewID, region) %>%
  summarize(traffic = sum(users)) %>%
  ungroup() %>%
  mutate(region = tolower(region))

# merge app traffic w/ population
traffic_pop <- left_join(us_pop, traffic_sum)

# calculate percentage of traffic for each state
traffic_total <- sum(traffic_pop$traffic)
traffic_pop <- mutate(traffic_pop, traffic_pct = traffic/traffic_total*100)

# throw in app names
gaList <- yaml.load_file('data/gaTable.yaml')
gaTable <- data.frame(viewID = unlist(rvest::pluck(gaList, 'viewID')),
                      appName = unlist(rvest::pluck(gaList, 'shortName')),
                      stringsAsFactors = FALSE)
traffic_pop <- left_join(traffic_pop, gaTable)

# now calculate the regionality index for each app...

# get spearman's rho
all_corr <- sapply(unique(traffic_pop$appName), function(nm, traffic) {
  app_traffic <- filter(traffic, appName == nm)
  corr_result <- cor(x = app_traffic$pop_pct,
                     y = app_traffic$traffic_pct,
                     method = "spearman")
  return(corr_result)
}, traffic = traffic_pop)

par(mar = c(2, 10, 0, 2))
barplot(all_corr, xlab = "App ID", horiz=TRUE, las=1)

par(mar = c(5, 4, 4, 2))
hist(x = all_corr, xlab = "Spearman's Rho", main = "Hist of Spearman's Rho for all apps")

@jordansread
Copy link

@lindsaycarr can you try it w/ Pearson's R?

@lindsayplatt
Copy link
Contributor Author

lindsayplatt commented Jul 17, 2017

Hmm we've got some negative guys in here. Added the top contributing state next to each bar.

Pearson's r
image

@lindsayplatt
Copy link
Contributor Author

lindsayplatt commented Jul 17, 2017

Some seem reasonable, but a lot are skewed because all the developers sit here in WI.

@lindsayplatt
Copy link
Contributor Author

spearman w/ top states:

image

@jordansread
Copy link

I am not sure the best stat for this, but I think the linear relationship that Pearsons test is more appropriate than the ranking test Spearmans looks for (such as "are the order of states the same?").
I also think lumping all < 0.1 into a score of 0.1 for the time being would be fine.

What this isn't doing yet is properly dealing w/ zeros. For example, the NPS mercury viewer has a pretty "national" score, but really only has a small number of visits from 4 states.

@aappling-usgs
Copy link
Member

aappling-usgs commented Jul 17, 2017

agreed w/ jread that spearman's may look prettier, but pearson's makes more sense - i'm more interested in differences between the proportions of population and proportions of traffic than i am in any differences in how states rank relative to one another by population vs traffic.

alternatively, what about a metric constructed from the differences between traffic & population proportions (or quotients, i.e., traffic per capita)? what about root mean squared difference between traffic and population proportions? then the 0s would be -1, which is a big number, and big RMSDs would indicate a lack of evenness (numbers close to 0 indicate greater evenness).

@jordansread
Copy link

Thinking about this again...what we really want is a metric that gives us some measure of how far off the "real" percentages are from falling on a 1:1 with the "expected" (or population-based) numbers. So I think the slope here is important. Neither of the two metrics we have tried so far do this.

Ok, just saw alison weighed in too

@aappling-usgs
Copy link
Member

yeah, agreed on the slope. i think RMSD fixes that part. unsure whether it introduces any other problems...would like to see how it looks.

@aappling-usgs
Copy link
Member

revisiting my previous comment: i guess i'm thinking normalized differences, like this:

sqrt(mean((traffic_pct - pop_pct) / pop_pct )^2))

with the caveat that if pop_pct is 0.18% (wyoming), then traffic_pct of 1% becomes a huge number (5), and if there were only 100 visitors in a time period, traffic_pct could pretty much randomly flip between 1% and 2%. so small states will contribute noise. the alternative is absolute differences, which will pretty much ignore small states altogether.

@jordansread
Copy link

jordansread commented Jul 17, 2017

that's a RMSE calculation, isn't it? ☝️ or do those two mean the same thing...

@aappling-usgs
Copy link
Member

yes; i just avoided E because they're not really errors in my mind.

the clarification was RMS[E/D] of the relative differences rather than the absolute differences. absolute differences would look like this:

sqrt(mean((traffic_pct - pop_pct))^2))

and i realized i hadn't specifically suggested relative differences in my previous comment.

@jordansread
Copy link

I think we'll need to boil this down to a scale that is going to be simple to pick up quickly- like a scale from 1-10 or 0-1. From local to national. Tricky to get a RMSD into that kind of scale but maybe there is a simple way to do that.

@aappling-usgs
Copy link
Member

aappling-usgs commented Jul 17, 2017

what about R^2 for traffic_pct ~ pop_pct? it penalizes relationships with slope != 1, usually ranges between 0 and 1. caveat is that it can be negative for this type of dataset (actual range is -Inf to 1). we could just truncate at a lower limit of 0 for plotting purposes...

image

image

image

image

example code for the above plots:

pop_pct <- seq(1,100,by=2)
traffic_pct <- pmax(0, rnorm(n=50, mean=seq(0,100,length.out=50), sd=3))
plot(traffic_pct ~ pop_pct, ylim=c(0,180), xlim=c(0,180)); abline(a=0,b=1)
met <- 1-sum((traffic_pct - pop_pct)^2)/sum((traffic_pct - mean(traffic_pct))^2)
title(paste0('R^2 = ', met))

@jordansread
Copy link

jordansread commented Jul 17, 2017

That allows an offset though, right? I thought this was the way to pin 1:1 and intercept of 0:

lm(y ~ 0 + offset(x))

but it doesn't have any freedom to it, so you can't calculate an R2

@aappling-usgs
Copy link
Member

i thought we wanted to penalize slopes different from 1, which the above code does. i did the calculation from formula rather than fitting an lm; i'm not sure how you'd fit an lm so it penalized non-1 slopes

@aappling-usgs
Copy link
Member

summary(lm(traffic_pct ~ 0 + offset(pop_pct)))

seems to always give r.squared = 0:

List of 9
 $ call         : language lm(formula = traffic_pct ~ 0 + offset(pop_pct))
 $ terms        :Classes 'terms', 'formula'  language traffic_pct ~ 0 + offset(pop_pct)
  .. ..- attr(*, "variables")= language list(traffic_pct, offset(pop_pct))
  .. ..- attr(*, "offset")= int 2
  .. ..- attr(*, "factors")= int(0) 
  .. ..- attr(*, "term.labels")= chr(0) 
  .. ..- attr(*, "order")= int(0) 
  .. ..- attr(*, "intercept")= int 0
  .. ..- attr(*, "response")= int 1
  .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv> 
  .. ..- attr(*, "predvars")= language list(traffic_pct, offset(pop_pct))
  .. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "numeric"
  .. .. ..- attr(*, "names")= chr [1:2] "traffic_pct" "offset(pop_pct)"
 $ aliased      : logi(0) 
 $ residuals    : Named num [1:50] 42.8 36.8 40.8 -7 15.3 ...
  ..- attr(*, "names")= chr [1:50] "1" "2" "3" "4" ...
 $ df           : int [1:3] 0 50 0
 $ coefficients : logi[0 , 1:4] 
  ..- attr(*, "dimnames")=List of 2
  .. ..$ : NULL
  .. ..$ : chr [1:4] "Estimate" "Std. Error" "t value" "Pr(>|t|)"
 $ sigma        : num 38.4
 $ adj.r.squared: num 0
 $ r.squared    : num 0
 - attr(*, "class")= chr "summary.lm"

@jordansread
Copy link

Gotcha. I understand your's now. I was thinking that was a snippet that went into an lm()

@jordansread
Copy link

I didn't see your snippet at the bottom the first time I looked.

@aappling-usgs
Copy link
Member

i might have added it in an edit - that's a bad habit of mine =)

@jordansread
Copy link

OK, I like this. @lindsaycarr and I are working on getting the data.frame set to do the math.

@lindsayplatt
Copy link
Contributor Author

lindsayplatt commented Jul 18, 2017

OK, so here is what I've got using R^2 for the most recent year of analytics data. The states listed are the ones that deviated the most from "expected" and the numbers in parentheses are the amounts they deviated.

image

All negative R^2s were treated as 0s

The script can be found in my branch: https://github.com/lindsaycarr/internal-analytics/blob/regionality/scripts/process/regionality_test.R

@jordansread
Copy link

Looks good! Alison, what do you think?

@aappling-usgs
Copy link
Member

Yeah, I like it! You could probably round the numbers more if you want - looks like you seldom need 1 digit after the decimal point, let alone 2. maybe just go for 2 sig figs?

@lindsayplatt
Copy link
Contributor Author

Good idea. @jread-usgs any desire to make this 0:10 instead of 0:1?

@aappling-usgs
Copy link
Member

PS - i couldn't remember how to do sig figs, so looked it up. in case it's not on the tip of your fingers already:

signif(c(.1234, 1.234, 12.34, 123.4), digits=2)
## [1]   0.12   1.20  12.00 120.00

@jordansread
Copy link

Yes, I think 0-10 as a metric would be great.

Would it be possible to mock up figure1 with this as the second column (only two columns total)?

@lindsayplatt
Copy link
Contributor Author

@jread-usgs I just read this again, but I'm actually not sure what you are talking about. Add this to Fig 1 of the internal-analytics home page? So the yearly analytics bars as column one and this as column two?

@jordansread
Copy link

Correct @lindsaycarr

@lindsayplatt
Copy link
Contributor Author

struggling to get the facets for yearly session totals to scale freely

image

@aappling-usgs
Copy link
Member

what have you tried? does this help? https://stackoverflow.com/a/21585521/3203184

@lindsayplatt
Copy link
Contributor Author

So, I didn't touch the plotting code; just changed the data. The plotting code has:

facet_grid(bin ~ type, scales = "free",
               space = "free_y", drop = TRUE)

which I think would do this appropriately. I've tried all options for scales and space (free, fixed, free_y, free_x)

@aappling-usgs
Copy link
Member

and if you delete the regionalization rows, everything is nice again?

@ldecicco-USGS
Copy link
Contributor

facet_grid is designed to use the same scale in x in 1 column, or the same y scale in 1 row...so you have to scale it yourself if you want to bypass that.

If you are using the processed data from trend_all (?), then you should be able to use "scaled_value", "scaled_newUser", and "text_placement".

@lindsayplatt
Copy link
Contributor Author

Ha - just figured it out. Needed it to ignore scaled_newUser

@lindsayplatt
Copy link
Contributor Author

Wait, that wasn't exactly it. I don't really understand why, but having both of these lines makes it not scale freely, but removing either of them allows it to.

geom_segment(aes(xend = longName), yend=0, size = 0.65, color = bar_line_col) +
geom_segment(aes(xend = longName, y = scaled_value),
                 yend=0, col=bar_line_col, size=1.15)

Result when removing first line:
image

@aappling-usgs
Copy link
Member

/shrug !

@wdwatkins
Copy link
Collaborator

@lindsayplatt what is the status of this work? Is the metric settled and ready for prime time?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants