Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

formalize preprocessing + incorporate into rest of workflow #1

Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
34 commits
Select commit Hold shift + click to select a range
33d401e
create separate yaml and workflow for preprocessing steps
Feb 13, 2018
ce1d650
remove preprocessing-related things from viz.yaml
Feb 13, 2018
3c5d6c6
adding aws credentials
Feb 13, 2018
f965be3
move bucket creation script
Feb 14, 2018
4d90509
add script to share code for running the preprocessing steps
Feb 14, 2018
78be601
udpate aws credential profile reference
Feb 14, 2018
d2d9d1b
format + correct filename
Feb 14, 2018
f8cb41e
save instead of get for s3 download
Feb 14, 2018
88f5ba0
convert to remake file
Feb 14, 2018
6eb45f6
rework into remake format
Feb 14, 2018
121c34a
still reformatting
Feb 14, 2018
b202b10
add full target
Feb 14, 2018
c073b43
rename
Feb 14, 2018
95ecbc6
cleanup unused stuff
Feb 14, 2018
cc71fc2
shuffle/rename preprocessing scripts
Feb 14, 2018
8267603
remake for preprocessing working
Feb 14, 2018
ebbc33e
detail info for preprocessing workflow
Feb 14, 2018
b143cac
work in preprocessed county boundaries zip from s3 to viz.yaml
Feb 14, 2018
3bf9191
rename preprocessing file
Feb 14, 2018
074a606
get data pushed to target (boundary json not working yet)
Feb 14, 2018
3fb8e79
loop & pass args into shell script
Feb 15, 2018
675bdc9
instructions for topojson install
Feb 15, 2018
444c785
use fips csv in shell script
Feb 15, 2018
afae3cf
add publisher for moving multiple json files into target/data
Feb 15, 2018
28d56ee
delete unused things from previous attempt at publishing multiple jso…
Feb 15, 2018
c8669dd
move preprocessing script + add remake pkg details
Feb 15, 2018
f77e5bb
change description for a copied function
Feb 15, 2018
2c9295a
reword description of execute_preprocessing.R
Feb 15, 2018
62bf58d
add as.viz
Feb 15, 2018
36391cf
move install comment to correct spot
Feb 15, 2018
eaf9d16
add r package to function call
Feb 15, 2018
3eecee5
add message to simplify shell to follow progress
Feb 15, 2018
f5fd084
remove code that limits to just AZ and AL
Feb 15, 2018
f6dd4af
clean up yaml + add estimates for time and storage for each step
Feb 15, 2018
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
49 changes: 49 additions & 0 deletions preprocess.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,49 @@
target_default: preprocess

packages:
- vizlab

sources:
- scripts/preprocess/fetch_s3_object.R
- scripts/preprocess/clean_county_boundaries.R
- scripts/preprocess/save_state_fips.R
- scripts/preprocess/execute_shell_script.R
- scripts/preprocess/push_s3_object.R

targets:

# --- fetch --- #

# takes about 12 minutes
cache/IPUMS_NHGIS_counties.zip:
command: fetch_s3_object(target_name, I("IPUMS_NHGIS_counties.zip"), I("viz-water-use-15"))

# --- process --- #

# takes about 45 minutes & about 5.5 GB of disk space for all FIPS
cache/county_boundaries_geojson.zip:
command: clean_county_boundaries(target_name, "cache/IPUMS_NHGIS_counties.zip")

# takes about 30 seconds
cache/state_fips.csv:
command: save_state_fips(target_name, "cache/county_boundaries_geojson.zip", I("states.json"))

# takes about 10 minutes & about 3 GB of disk space for all FIPS
cache/county_boundaries_topojson.zip:
command: execute_shell_script(target_name, "cache/county_boundaries_geojson.zip",
I("scripts/preprocess/topo_county_boundaries.sh"),
"cache/state_fips.csv")

# --- publish --- #

# takes about 12 minutes
s3boundariesfile:
command: push_s3_object(I("county_boundaries_topojson.zip"),
"cache/county_boundaries_topojson.zip",
I("viz-water-use-15"))

# --- final --- #

preprocess:
depends:
- s3boundariesfile
15 changes: 15 additions & 0 deletions scripts/fetch/s3_object.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@

fetchTimestamp.s3_object <- vizlab::alwaysCurrent
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is fine for now but reminds me that we ought to get an s3 fetcher implemented in vizlab proper, so we can use s3 timestamps/hashes to keep everybody's files up to date. USGS-VIZLAB/vizlab#333


fetch.s3_object <- function(viz){

args <- viz[["fetch_args"]]

aws.signature::use_credentials(profile='default', file=aws.signature::default_credentials_file())

# download object from an s3 bucket
object_fn <- aws.s3::save_object(object = args[["object_name"]],
bucket = args[["bucket_name"]],
file = viz[["location"]])
return(object_fn)
}
Original file line number Diff line number Diff line change
@@ -1,9 +1,13 @@
#' Cleans data for historical county polygons.
process.county_boundaries <- function(viz){
deps <- readDepends(viz)
clean_county_boundaries <- function(location, shp_zip_fn){

library(sf)
library(dplyr)
library(geojsonio)
library(jsonlite)

# unzip the shapefiles, which are zip files within a zip file
map_zip <- deps$county_boundaries_zip
map_zip <- shp_zip_fn
map_dir <- file.path(tempdir(), 'county_boundaries')
unzip(map_zip, exdir=map_dir)
map_shp_zips <- dir(dir(map_dir, full.names=TRUE), full.names=TRUE)
Expand Down Expand Up @@ -38,7 +42,7 @@ process.county_boundaries <- function(viz){
counties <- consolidate_county_info(all_shps_simple)

# split the country-wide shapefiles into state-wide shapefiles
split_shps <- lapply(setNames(nm=states$state_FIPS[c(1,4)]), function(state_fips) {
split_shps <- lapply(setNames(nm=states$state_FIPS), function(state_fips) {
message('splitting out shapefiles for state ', state_fips)

# subset to just one state
Expand Down Expand Up @@ -73,7 +77,7 @@ process.county_boundaries <- function(viz){
# save to one big zip file
oldwd <- setwd(geojsondir)
on.exit(setwd(oldwd))
zipfile <- file.path(oldwd, viz[['location']])
zipfile <- file.path(oldwd, location)
if(file.exists(zipfile)) file.remove(zipfile)
filestozip <- dir()
zip(zipfile, files=filestozip)
Expand Down
47 changes: 47 additions & 0 deletions scripts/preprocess/execute_preprocessing.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,47 @@
# This file describes how to run `preprocess.yaml`, which is the yaml that
# orchestrates the steps required for preprocessing the county boundary and
# state/county fips data. This should not need to be executed by every
# contributor because the results are stored in the S3 bucket. Most should
# just worry about the viz.yaml.

# This workflow assumes that you have the required R packages and appropriate
# credentials (with the profile as "default") stored in:
aws.signature::default_credentials_file()

# required for topo_county_boundaries.sh
# install node.js https://nodejs.org/en/, then run
# npm install -g topojson

# required R packages:
#
# aws.s3:
# repo: CRAN
# version: 0.3.3
# aws.signature:
# repo: CRAN
# version: 0.3.5
# dplyr:
# repo: CRAN
# version: 0.7.4
# geojsonio:
# repo: CRAN
# version: 0.5.0
# jsonlite:
# repo: CRAN
# version: 1.5
# remake:
# repo: github
# version: 0.3.0
# name: richfitz/remake
# sf:
# repo: CRAN
# version: 0.6.0

# run the full preprocesing workflow
# this will take ~ 30 minutes, the longest step is fetching the data from s3
remake::make(target_names = "preprocess",
remake_file = "preprocess.yaml")

# run an individual target:
remake::make(target_names = "cache/county_boundaries_topojson.zip",
remake_file = "preprocess.yaml")
6 changes: 6 additions & 0 deletions scripts/preprocess/execute_shell_script.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
execute_shell_script <- function(location, zipfilepath, shell_script_fn, statecsvpath){

cmd <- paste("bash", shell_script_fn, zipfilepath, statecsvpath, location)
system(cmd)

}
10 changes: 10 additions & 0 deletions scripts/preprocess/fetch_s3_object.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
fetch_s3_object <- function(location, obj_name, bucket_name){

aws.signature::use_credentials(profile='default', file=aws.signature::default_credentials_file())

# download object from an s3 bucket
object_fn <- aws.s3::save_object(object = obj_name,
bucket = bucket_name,
file = location)
return(object_fn)
}
10 changes: 10 additions & 0 deletions scripts/preprocess/push_s3_object.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
push_s3_object <- function(s3_fn, existing_fn, bucket_name) {

aws.signature::use_credentials(profile='default', file=aws.signature::default_credentials_file())

s3_push <- aws.s3::put_object(file = existing_fn,
object = s3_fn,
bucket = bucket_name)

return(s3_push)
}
13 changes: 13 additions & 0 deletions scripts/preprocess/save_state_fips.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
save_state_fips <- function(location, zipfilepath, jsonfilepath) {

# get states.json into cache/
dir_name <- dirname(zipfilepath)
unzip(zipfilepath, files = jsonfilepath, exdir = dir_name)

# read json and create vector of just fips
states_info <- jsonlite::fromJSON(file.path(dir_name, jsonfilepath))
fips <- states_info[["state_FIPS"]]

# write.csv did not allow col.names to work
write.table(fips, location, sep = ",", col.names = FALSE, row.names = FALSE, quote = FALSE)
}
45 changes: 45 additions & 0 deletions scripts/preprocess/topo_county_boundaries.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,45 @@
#!/bin/bash

# create a temp directory
TMP=$(mktemp -d)

# unzip the geojson
unzip $1 -d $TMP

# pick out the geojson files (exclude counties.js and states.js)
GJ=$(dir $TMP/*.geojson)

# list state fips for now
while read fip
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

neat! good job getting this bash loop figured out.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the real mvp: @wdwatkins

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💯

do

fipfixed=$(echo "$fip" | tr -d '\r')
path="$TMP"/"$fipfixed"

# convert to topojson
geo2topo \
state=$path.geojson \
-o $path.json

# simplify
toposimplify -s 1e-4 -f \
$path.json \
-o $path-simple.json

# quantize (store as integers, scale later)
topoquantize 1e5 \
$path-simple.json \
-o $path-quantized.json

echo "Finished $fipfixed"

done < $2

echo All done

# zip back up for storage in cache/
WD=$(pwd)
cd "$TMP"
zip "$WD/$3" ./*quantized.json states.json counties.json
cd "$WD"

41 changes: 0 additions & 41 deletions scripts/process/topo_county_boundaries.sh

This file was deleted.

40 changes: 40 additions & 0 deletions scripts/publish/multiple_json_files.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
publish.multiple_json_files <- function(viz) {
deps <- readDepends(viz)
args <- viz[["publish_args"]]
file_pattern <- args[["pattern"]]

# unzip if it's a zip file
if(grepl(".zip", deps[["files_location"]])) {
# unzip and cache in a folder before publishing
extract_boundary_files(deps[["files_location"]], file_pattern, viz[["location"]])
paths_to_use <- list.files(viz[["location"]], full.names = TRUE)

} else {
# paths are just the files in the passed in location if they aren't zipped
paths_to_use <- list.files(deps[["files_location"]], full.names = TRUE)
}

for(fp in paths_to_use) {

# create viz-like item to use in publish
viz_json <- vizlab::as.viz(list(location = fp, mimetype = "application/json"))

# use publisher to follow typical json publishing steps to get file to target
vizlab::publish(viz_json)
}

}

#' Extract files from a zipfile
#'
#' @filepath the name of the .zip file
#' @pattern argument that represents the pattern in filenames to
#' extract with grep
#' @exdir where to extract the zipfiles
extract_boundary_files <- function(zipfile, pattern, exdir) {

allfiles <- unzip(zipfile=zipfile, list=TRUE)[["Name"]]
boundaryfiles <- allfiles[grep(pattern, allfiles)]

unzip(zipfile=zipfile, files=boundaryfiles, exdir=exdir, overwrite=TRUE)
}
14 changes: 14 additions & 0 deletions scripts/s3_bucket_setup.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
# This file only needs to be run one time, by one person, for the whole project.
# Including so it's easier to create buckets again in the future.

library(aws.signature)
message('check that credentials for dev profile at ', aws.signature::default_credentials_file(), " match those in get_dssecret('dev-owi-s3-access')")
aws.signature::use_credentials(profile='dev', file=aws.signature::default_credentials_file())

library(aws.s3)
bucketlist() # to see which buckets are already there
new_bucket_name <- 'viz-water-use-15' # convention: 'viz-' followed by the github repo name for the vizzy
put_bucket(new_bucket_name, region='us-west-2', acl='private') # gives error if bucket already exists

# this command posted the data (took 1.5 hrs)
put_object(file='data/nhgis0002_shape.zip', object='IPUMS_NHGIS_counties.zip', bucket='viz-water-use-15')
Loading