regionalization metric #177

lindsayplatt · 2017-07-17T14:21:53Z

preliminary implementation of the "regionality" scale for each app at the year scale. It doesn't need to be included in the app now and you could just make one bar chart figure for it. A "spearman rank correlation" will be a good way to get the index.

@jread-usgs in #internal-analytics Slack channel

I think we should calculate a "regionalization metric" (or some other name) that gets at the local vs national scale of the particular web page/app. I was thinking of a scale from 0 to 10, where 0 is local and 10 is national. To do this, we could (this is just an idea, but would make sense to think of different wants to do it) calculate the percentage of user traffic expected for each state based on population, and compare the actual percentages for each time period. A perfect match (e.g., 12.1% of traffic from CA, 1.8% from WI, 8.6% from TX, etc...) would be a 10 ("national"), which is very unlikely, and a lesser match (such as 80% of traffic from WI) would be closer to 0, or on the "local" end of the spectrum. So each app for each time period would have a score, and we could also have a cross-app score for comparison.

where would/could this be? It could be a horizontal bar below the map w/ fill between 0 and the score, and then a vertical slash that is taller than the bar thickness representing the mean/median of all apps for the same time period. If it was a useful way to distill this concept into a number like this, figure one could be a single column for the time period viewed, then column two could be same 0-10 bar, and a smaller column three could be the name of the state that is the greatest positive outlier from the expected distribution of visitors. Figure one would then also be reactive to the time period selected (instead of having all three).

Today's meeting discussed the need to tell better stories about what we do, and come up with better ways to measure them. I think this is a way can get a quick metric on the reach of many of the products that would contain these "stories". We have the gross numbers for visits clearly presented, but I think the "where" is just as important as how many to have access to as a quick view.

@aappling-usgs reaction:

cool idea! some of the more internally facing apps have high concentrations of users in california, colorado, and virginia - agreed that most important would be regionalization metric w.r.t. all users, but after that i'd be interested to see how enterprise apps are serving USGS users, which we could do by dividing by fraction of USGS that's in each state. could the all-users map toggle between being total per state and total per capita? (horizontal bar is fine too)

also, sorta wondering if there's a biodiversity / spatial autocorrelation metric for this

lindsayplatt · 2017-07-17T15:29:45Z

@jread-usgs @aappling-usgs I think I setup the correct comparison data to get these ranks. I just don't really know what to do for the actual index. I'm using cached session data from the process step, lastest_year.rds.

# for now, using US population dataset from 1977

# get population and percentage of total US population per state
us_pop <- data.frame(state = row.names(state.x77), pop = state.x77[,'Population'], stringsAsFactors = FALSE)
us_total <- sum(us_pop$pop)
us_pop <- dplyr::mutate(us_pop, pop_pct = pop/us_total*100)

# what is being ranked? actual population percentage (x) vs percentage of traffic from each state (y)

# get traffic data based on users by state
# this is data from one app
app_traffic <- readRDS('cache/process/lastest_year.rds')
app_traffic_grp <- dplyr::group_by(app_traffic, region)
app_traffic_sum <- dplyr::summarize(app_traffic_grp, traffic = sum(users))

# merge app traffic w/ population
app_traffic_sum <- dplyr::rename(app_traffic_sum, state = region)
traffic_by_state <- dplyr::left_join(us_pop, app_traffic_sum)

# calculate percentage of traffic for each state
traffic_total <- sum(traffic_by_state$traffic)
traffic_by_state <- dplyr::mutate(traffic_by_state, traffic_pct = traffic/traffic_total*100)

# now calculate the regionality index....
corr_result <- cor.test(x = traffic_by_state$pop_pct,
                        y = traffic_by_state$traffic_pct,
                        method = "spearman")

Am I trying to compare the right things?

jordansread · 2017-07-17T16:13:05Z

I think so...
traffic_by_state$traffic_pct is the actual percent of traffic for each state over the time period (yearly here, but would apply to month or week as well) and traffic_by_state$pop_pct is the percent of total us pop in that state?
can you calc the metric for a could of apps and share? Or do a barchart/lollipop with a point for each app and then we can see if it passes the smell test.

aappling-usgs · 2017-07-17T16:32:49Z

agreed with jread - looks good to me so far.

lindsayplatt · 2017-07-17T17:03:44Z

traffic_by_state$traffic_pct = percent of total users in each state
traffic_by_state$pop_pct = percent of total population in each state

jordansread · 2017-07-17T17:06:13Z

Possible to get the app names on the bar plot @lindsaycarr ?

lindsayplatt · 2017-07-17T17:08:01Z

The dataset I have cached only has viewIDs, and no corresponding app names. I'll run make and see if it changes. If not, I might be able to work something out -- not sure how quickly

lindsayplatt · 2017-07-17T17:12:00Z

Working something quick out with David and Walker, so new plot w/ names coming soon

lindsayplatt · 2017-07-17T17:24:21Z

Accompanying script:

# for now, using US population dataset from 1977

# get population and percentage of total US population per state
us_pop <- data.frame(state = row.names(state.x77), pop = state.x77[,'Population'], stringsAsFactors = FALSE)
us_total <- sum(us_pop$pop)
us_pop <- dplyr::mutate(us_pop, pop_pct = pop/us_total*100)

# what is being ranked? actual population percentage (x) vs percentage of traffic from each state (y)

# get traffic data based on users by state
# this is data from one app
traffic <- readRDS('cache/process/lastest_year.rds')
traffic_grp <- dplyr::group_by(traffic, viewID, region)
traffic_sum <- dplyr::summarize(traffic_grp, traffic = sum(users))
traffic_sum <- dplyr::ungroup(traffic_sum)

# merge app traffic w/ population
traffic_sum <- dplyr::rename(traffic_sum, state = region)
traffic_pop <- dplyr::left_join(us_pop, traffic_sum)

# throw in app names
gaList <- yaml::yaml.load_file('data/gaTable.yaml')
gaTable <- data.frame(viewID = unlist(rvest::pluck(gaList, 'viewID')),
                      appName = unlist(rvest::pluck(gaList, 'shortName')),
                      stringsAsFactors = FALSE)
traffic_pop <- dplyr::left_join(traffic_pop, gaTable)

# calculate percentage of traffic for each state
traffic_total <- sum(traffic_pop$traffic)
traffic_pop <- dplyr::mutate(traffic_pop, traffic_pct = traffic/traffic_total*100)

# now calculate the regionality index for each app...

# get spearman's rho
all_corr <- sapply(unique(traffic_pop$appName), function(nm, traffic) {
  app_traffic <- dplyr::filter(traffic, appName == nm)
  corr_result <- cor(x = app_traffic$pop_pct,
                     y = app_traffic$traffic_pct,
                     method = "spearman")
  return(corr_result)
}, traffic = traffic_pop)

par(mar = c(2, 10, 0, 2))
barplot(all_corr, xlab = "App ID", horiz=TRUE, las=1)

par(mar = c(5, 4, 4, 2))
hist(x = all_corr, xlab = "Spearman's Rho", main = "Hist of Spearman's Rho for all apps")

jordansread · 2017-07-17T17:26:26Z

looks like we need Pearson correlation instead of Spearman

lindsayplatt · 2017-07-17T17:27:56Z

need to grab population data that is not from 1977 before we start making too many conclusions based on this

lindsayplatt · 2017-07-17T17:55:53Z

with 2012 population data:

library(choroplethr)
library(dplyr)
library(yaml)

# get population and percentage of total US population per state
# 2012 population data

data(df_pop_state)
us_pop <- df_pop_state
us_pop <- rename(us_pop, pop = value)
us_total <- sum(us_pop$pop)
us_pop <- mutate(us_pop, pop_pct = pop/us_total*100)

# what is being ranked? actual population percentage (x) vs percentage of traffic from each state (y)

# get traffic data based on users by state
# this is data from one app
traffic <- readRDS('cache/process/lastest_year.rds')
traffic_sum <- group_by(traffic, viewID, region) %>%
  summarize(traffic = sum(users)) %>%
  ungroup() %>%
  mutate(region = tolower(region))

# merge app traffic w/ population
traffic_pop <- left_join(us_pop, traffic_sum)

# calculate percentage of traffic for each state
traffic_total <- sum(traffic_pop$traffic)
traffic_pop <- mutate(traffic_pop, traffic_pct = traffic/traffic_total*100)

# throw in app names
gaList <- yaml.load_file('data/gaTable.yaml')
gaTable <- data.frame(viewID = unlist(rvest::pluck(gaList, 'viewID')),
                      appName = unlist(rvest::pluck(gaList, 'shortName')),
                      stringsAsFactors = FALSE)
traffic_pop <- left_join(traffic_pop, gaTable)

# now calculate the regionality index for each app...

# get spearman's rho
all_corr <- sapply(unique(traffic_pop$appName), function(nm, traffic) {
  app_traffic <- filter(traffic, appName == nm)
  corr_result <- cor(x = app_traffic$pop_pct,
                     y = app_traffic$traffic_pct,
                     method = "spearman")
  return(corr_result)
}, traffic = traffic_pop)

par(mar = c(2, 10, 0, 2))
barplot(all_corr, xlab = "App ID", horiz=TRUE, las=1)

par(mar = c(5, 4, 4, 2))
hist(x = all_corr, xlab = "Spearman's Rho", main = "Hist of Spearman's Rho for all apps")

jordansread · 2017-07-17T18:00:14Z

@lindsaycarr can you try it w/ Pearson's R?

lindsayplatt · 2017-07-17T18:20:26Z

Hmm we've got some negative guys in here. Added the top contributing state next to each bar.

Pearson's r

lindsayplatt · 2017-07-17T18:21:22Z

Some seem reasonable, but a lot are skewed because all the developers sit here in WI.

lindsayplatt · 2017-07-17T18:24:21Z

spearman w/ top states:

jordansread · 2017-07-17T18:31:58Z

I am not sure the best stat for this, but I think the linear relationship that Pearsons test is more appropriate than the ranking test Spearmans looks for (such as "are the order of states the same?").
I also think lumping all < 0.1 into a score of 0.1 for the time being would be fine.

What this isn't doing yet is properly dealing w/ zeros. For example, the NPS mercury viewer has a pretty "national" score, but really only has a small number of visits from 4 states.

aappling-usgs · 2017-07-17T19:04:00Z

agreed w/ jread that spearman's may look prettier, but pearson's makes more sense - i'm more interested in differences between the proportions of population and proportions of traffic than i am in any differences in how states rank relative to one another by population vs traffic.

alternatively, what about a metric constructed from the differences between traffic & population proportions (or quotients, i.e., traffic per capita)? what about root mean squared difference between traffic and population proportions? then the 0s would be -1, which is a big number, and big RMSDs would indicate a lack of evenness (numbers close to 0 indicate greater evenness).

jordansread · 2017-07-17T19:05:34Z

Thinking about this again...what we really want is a metric that gives us some measure of how far off the "real" percentages are from falling on a 1:1 with the "expected" (or population-based) numbers. So I think the slope here is important. Neither of the two metrics we have tried so far do this.

Ok, just saw alison weighed in too

aappling-usgs · 2017-07-17T19:06:47Z

yeah, agreed on the slope. i think RMSD fixes that part. unsure whether it introduces any other problems...would like to see how it looks.

aappling-usgs · 2017-07-17T19:53:31Z

revisiting my previous comment: i guess i'm thinking normalized differences, like this:

sqrt(mean((traffic_pct - pop_pct) / pop_pct )^2))

with the caveat that if pop_pct is 0.18% (wyoming), then traffic_pct of 1% becomes a huge number (5), and if there were only 100 visitors in a time period, traffic_pct could pretty much randomly flip between 1% and 2%. so small states will contribute noise. the alternative is absolute differences, which will pretty much ignore small states altogether.

jordansread · 2017-07-17T19:59:01Z

that's a RMSE calculation, isn't it? ☝️ or do those two mean the same thing...

aappling-usgs · 2017-07-17T20:01:22Z

yes; i just avoided E because they're not really errors in my mind.

the clarification was RMS[E/D] of the relative differences rather than the absolute differences. absolute differences would look like this:

sqrt(mean((traffic_pct - pop_pct))^2))

and i realized i hadn't specifically suggested relative differences in my previous comment.

jordansread · 2017-07-17T20:05:33Z

I think we'll need to boil this down to a scale that is going to be simple to pick up quickly- like a scale from 1-10 or 0-1. From local to national. Tricky to get a RMSD into that kind of scale but maybe there is a simple way to do that.

aappling-usgs · 2017-07-17T20:41:09Z

what about R^2 for traffic_pct ~ pop_pct? it penalizes relationships with slope != 1, usually ranges between 0 and 1. caveat is that it can be negative for this type of dataset (actual range is -Inf to 1). we could just truncate at a lower limit of 0 for plotting purposes...

example code for the above plots:

pop_pct <- seq(1,100,by=2)
traffic_pct <- pmax(0, rnorm(n=50, mean=seq(0,100,length.out=50), sd=3))
plot(traffic_pct ~ pop_pct, ylim=c(0,180), xlim=c(0,180)); abline(a=0,b=1)
met <- 1-sum((traffic_pct - pop_pct)^2)/sum((traffic_pct - mean(traffic_pct))^2)
title(paste0('R^2 = ', met))

jordansread · 2017-07-17T20:48:58Z

That allows an offset though, right? I thought this was the way to pin 1:1 and intercept of 0:

lm(y ~ 0 + offset(x))

but it doesn't have any freedom to it, so you can't calculate an R2

aappling-usgs · 2017-07-17T20:54:35Z

i thought we wanted to penalize slopes different from 1, which the above code does. i did the calculation from formula rather than fitting an lm; i'm not sure how you'd fit an lm so it penalized non-1 slopes

aappling-usgs · 2017-07-17T20:56:43Z

summary(lm(traffic_pct ~ 0 + offset(pop_pct)))

seems to always give r.squared = 0:

List of 9
 $ call         : language lm(formula = traffic_pct ~ 0 + offset(pop_pct))
 $ terms        :Classes 'terms', 'formula'  language traffic_pct ~ 0 + offset(pop_pct)
  .. ..- attr(*, "variables")= language list(traffic_pct, offset(pop_pct))
  .. ..- attr(*, "offset")= int 2
  .. ..- attr(*, "factors")= int(0) 
  .. ..- attr(*, "term.labels")= chr(0) 
  .. ..- attr(*, "order")= int(0) 
  .. ..- attr(*, "intercept")= int 0
  .. ..- attr(*, "response")= int 1
  .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv> 
  .. ..- attr(*, "predvars")= language list(traffic_pct, offset(pop_pct))
  .. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "numeric"
  .. .. ..- attr(*, "names")= chr [1:2] "traffic_pct" "offset(pop_pct)"
 $ aliased      : logi(0) 
 $ residuals    : Named num [1:50] 42.8 36.8 40.8 -7 15.3 ...
  ..- attr(*, "names")= chr [1:50] "1" "2" "3" "4" ...
 $ df           : int [1:3] 0 50 0
 $ coefficients : logi[0 , 1:4] 
  ..- attr(*, "dimnames")=List of 2
  .. ..$ : NULL
  .. ..$ : chr [1:4] "Estimate" "Std. Error" "t value" "Pr(>|t|)"
 $ sigma        : num 38.4
 $ adj.r.squared: num 0
 $ r.squared    : num 0
 - attr(*, "class")= chr "summary.lm"

jordansread · 2017-07-17T20:57:52Z

Gotcha. I understand your's now. I was thinking that was a snippet that went into an lm()

jordansread · 2017-07-17T20:58:10Z

I didn't see your snippet at the bottom the first time I looked.

aappling-usgs · 2017-07-17T20:58:31Z

i might have added it in an edit - that's a bad habit of mine =)

jordansread · 2017-07-17T20:59:13Z

OK, I like this. @lindsaycarr and I are working on getting the data.frame set to do the math.

lindsayplatt · 2017-07-18T18:14:14Z

OK, so here is what I've got using R^2 for the most recent year of analytics data. The states listed are the ones that deviated the most from "expected" and the numbers in parentheses are the amounts they deviated.

All negative R^2s were treated as 0s

The script can be found in my branch: https://github.com/lindsaycarr/internal-analytics/blob/regionality/scripts/process/regionality_test.R

jordansread · 2017-07-18T18:50:19Z

Looks good! Alison, what do you think?

aappling-usgs · 2017-07-18T18:54:05Z

Yeah, I like it! You could probably round the numbers more if you want - looks like you seldom need 1 digit after the decimal point, let alone 2. maybe just go for 2 sig figs?

lindsayplatt · 2017-07-18T18:56:50Z

Good idea. @jread-usgs any desire to make this 0:10 instead of 0:1?

aappling-usgs · 2017-07-18T18:57:50Z

PS - i couldn't remember how to do sig figs, so looked it up. in case it's not on the tip of your fingers already:

signif(c(.1234, 1.234, 12.34, 123.4), digits=2)
## [1]   0.12   1.20  12.00 120.00

jordansread · 2017-07-24T16:49:29Z

Yes, I think 0-10 as a metric would be great.

Would it be possible to mock up figure1 with this as the second column (only two columns total)?

lindsayplatt · 2017-07-25T13:01:34Z

@jread-usgs I just read this again, but I'm actually not sure what you are talking about. Add this to Fig 1 of the internal-analytics home page? So the yearly analytics bars as column one and this as column two?

jordansread · 2017-07-25T13:53:31Z

Correct @lindsaycarr

lindsayplatt · 2017-07-25T19:08:28Z

struggling to get the facets for yearly session totals to scale freely

aappling-usgs · 2017-07-25T19:15:01Z

what have you tried? does this help? https://stackoverflow.com/a/21585521/3203184

lindsayplatt · 2017-07-25T19:29:19Z

So, I didn't touch the plotting code; just changed the data. The plotting code has:

facet_grid(bin ~ type, scales = "free",
               space = "free_y", drop = TRUE)

which I think would do this appropriately. I've tried all options for scales and space (free, fixed, free_y, free_x)

aappling-usgs · 2017-07-25T19:32:33Z

and if you delete the regionalization rows, everything is nice again?

ldecicco-USGS · 2017-07-25T19:33:52Z

facet_grid is designed to use the same scale in x in 1 column, or the same y scale in 1 row...so you have to scale it yourself if you want to bypass that.

If you are using the processed data from trend_all (?), then you should be able to use "scaled_value", "scaled_newUser", and "text_placement".

lindsayplatt · 2017-07-25T19:37:02Z

Ha - just figured it out. Needed it to ignore scaled_newUser

lindsayplatt · 2017-07-25T19:40:11Z

Wait, that wasn't exactly it. I don't really understand why, but having both of these lines makes it not scale freely, but removing either of them allows it to.

geom_segment(aes(xend = longName), yend=0, size = 0.65, color = bar_line_col) +
geom_segment(aes(xend = longName, y = scaled_value),
                 yend=0, col=bar_line_col, size=1.15)

Result when removing first line:

aappling-usgs · 2017-07-25T19:42:39Z

/shrug !

wdwatkins · 2019-04-25T22:41:58Z

@lindsayplatt what is the status of this work? Is the metric settled and ready for prime time?

lindsayplatt self-assigned this Jul 17, 2017

lindsayplatt mentioned this issue May 22, 2019

Confirm that regionality metric calculation is correct usgs-makerspace/makerspace-sandbox#13

Closed

regionalization metric #177

regionalization metric #177

Comments

lindsayplatt commented Jul 17, 2017

lindsayplatt commented Jul 17, 2017

jordansread commented Jul 17, 2017

aappling-usgs commented Jul 17, 2017

lindsayplatt commented Jul 17, 2017

jordansread commented Jul 17, 2017

lindsayplatt commented Jul 17, 2017

lindsayplatt commented Jul 17, 2017

lindsayplatt commented Jul 17, 2017 • edited Loading

jordansread commented Jul 17, 2017

lindsayplatt commented Jul 17, 2017

lindsayplatt commented Jul 17, 2017

jordansread commented Jul 17, 2017

lindsayplatt commented Jul 17, 2017 • edited Loading

lindsayplatt commented Jul 17, 2017 • edited Loading

lindsayplatt commented Jul 17, 2017

jordansread commented Jul 17, 2017

aappling-usgs commented Jul 17, 2017 • edited Loading

jordansread commented Jul 17, 2017

aappling-usgs commented Jul 17, 2017

aappling-usgs commented Jul 17, 2017

jordansread commented Jul 17, 2017 • edited Loading

aappling-usgs commented Jul 17, 2017

jordansread commented Jul 17, 2017

aappling-usgs commented Jul 17, 2017 • edited Loading

jordansread commented Jul 17, 2017 • edited Loading

aappling-usgs commented Jul 17, 2017

aappling-usgs commented Jul 17, 2017

jordansread commented Jul 17, 2017

jordansread commented Jul 17, 2017

aappling-usgs commented Jul 17, 2017

jordansread commented Jul 17, 2017

lindsayplatt commented Jul 18, 2017 • edited Loading

jordansread commented Jul 18, 2017

aappling-usgs commented Jul 18, 2017

lindsayplatt commented Jul 18, 2017

aappling-usgs commented Jul 18, 2017

jordansread commented Jul 24, 2017

lindsayplatt commented Jul 25, 2017

jordansread commented Jul 25, 2017

lindsayplatt commented Jul 25, 2017

aappling-usgs commented Jul 25, 2017

lindsayplatt commented Jul 25, 2017

aappling-usgs commented Jul 25, 2017

ldecicco-USGS commented Jul 25, 2017

lindsayplatt commented Jul 25, 2017

lindsayplatt commented Jul 25, 2017

aappling-usgs commented Jul 25, 2017

wdwatkins commented Apr 25, 2019

lindsayplatt commented Jul 17, 2017 •

edited

Loading

lindsayplatt commented Jul 17, 2017 •

edited

Loading

lindsayplatt commented Jul 17, 2017 •

edited

Loading

aappling-usgs commented Jul 17, 2017 •

edited

Loading

jordansread commented Jul 17, 2017 •

edited

Loading

aappling-usgs commented Jul 17, 2017 •

edited

Loading

jordansread commented Jul 17, 2017 •

edited

Loading

lindsayplatt commented Jul 18, 2017 •

edited

Loading