The purpose of this pipeline is to merge raw ground truth data in raster or vector format with satellite imagery for the production of a set of standardized image tiles for the purpose of training a machine learning model.
This particular implementation assumes a desired output of paired (data, mask) tilesets.
Inputs
Name | Data Type | Description |
---|---|---|
Ground Truth | spatial Raster File (.tif , .jp2 , etc) |
Contains information about ground state (in this case, snow presence or absence) as measured by another method (e.g. ASO/SnowEX lidar). Can either be a GeoJson binary vector or a raw or binary raster. |
Date | string or datetime |
date of ground truth data acqusition. Used to determine which imagery to acquire. |
Date Range | integer | number of days around ground truth acqusition date to search for imagery. |
Outputs
Name | Data Type | Description |
---|---|---|
Image Tiles | Cloud Storage bucket (S3) | Bucket with either {z}/{x}/{y}.tif structure or {z}_{x}_{y}.tif filenames containing tiles with same data architecture as original imagery. |
Mask Tiles | Cloud Storage Bucket (S3, GCS) | Bucket with either {z}/{x}/{y}.tif structure or {z}_{x}_{y}.tif filenames containing binary masks for training. |
The output Image Tiles are cropped to the extent of the ground truth information. The set of Image Tiles and Mask Tiles is identical.
The primary steps to completing this data transformation are 1) ground truth pre-processing, 2) image acquisition and storage, 3) image preprocessing, 4) image and mask tiling.
input parameter | description |
---|---|
--gt_file |
ground truth data file, as below |
--threshold (optional) |
threshold for real-valued raster input |
--dst_crs (optional) |
EPSG code to reproject input into. Default is original CRS |
output_dir (required) | directory for output |
Output: this stage of the pipeline outputs the binary raster produced via this processing step as <output_dir>/<gt_file>_binary.tif
for use by future steps. It also produces a .GeoJSON
file containing the spatial extent of the ground truth for use by the image acquisition step.
input parameter | description |
---|---|
--zoom |
zoom level for output tiles |
--indexes |
raster band indices to include in tiles |
--quant (optional) |
value to divide bands with, if input data is quantized |
--aws_profile (optional) |
aws profile name for s3:// destinations |
--skip_blanks (optional) |
skip blank tiles. |
--cover (optional) |
csv file containing tiles to produce (default: all) |
files | file or files to tile |
output_dir | place to put tile directory. can be s3:// |
Output: A directory at zoom level <zoom>
containing GeoTIFF files representing original input imagery.
All details below are out of date but kept for reference
click to expand
### Image Acquisitioninput parameter | description |
---|---|
--gt_date |
date of ground truth acquisition |
--date_range |
number of days around gt_date to search for imagery in catalog |
--max_images (optional) |
used to constrain the number of images downloaded |
Using the gt_date
and date_range
parameters we compute a date range to search the imagery catalog for. We also use the GeoJSON output from step 1 to geographically constrain the imagery search. Eventually this process will be imagery-agnostic, but we currently implement using the Planet Labs API.
Open Questions:
- How do we select what images to download? Cloud cover? Sort by date?
- Do we select images that overlap spatially but not in time? (e.g. what if the same meter of Earth is covered by several images on different days --- do we just select closest to
gt_date
?)
Output: A cloud storage bucket containing GeoTIFF files representing raw 4-band images cropped to the extent of the ground truth dataset.
Not totally sure what goes in here but I'm sure we're going to want to do something to the imagery before it gets tiled. Perhaps a TOA correction or some such thing. Wanted to leave room for it.
input parameter | description |
---|
Three steps here:
- Tile the binary raster data mask into cloud storage bucket.
- Tile all images into cloud storage bucket.
- Be sure that all image tiles have a paired ground truth tile and that there are no orphan tiles.
- Come up with some sort of standardized directory structure (maybe best to stick with XYZ/OSM tiles here and reorganize later for training?)
Output: A cloud storage bucket containing an /images
and /masks
directory with some sort of standardized directory structure.
The major steps in this pipeline will be implemented as containerized Python modules and be linked together with the Kubeflow pipeline system. Some of the components may contain some Cloud Dataflow (i.e. Apache Beam) workflow elements.
This document outlines the containers which will be connected together to perform the intermediary operations.
Consumes ground truth data as above (--gt_raw
) and outputs {gt_raw}_gt_binary_raster
and {gt_raw}_gt_footprint
into a directory (--output_dir
). Binary raster is created either by:
- rasterizing a polygon
- thresholding a real-valued raster via
--threshold
arg, or - doing nothing (returning input binary raster)
{gt_raw}_gt_binary_raster.tif
and {gt_raw}_gt_footprint.geojson
are placed into /gt_processed
either in a cloud storage bucket or local folder (KubeFlow global pipeline variable output_dir
).
How in particular do we pass around the variables / inputs / outputs?
Consumes {gt_raw}_gt_footprint.geojson
(--footprint
), along with --date
and --date_range
arguments and queries image search API to identify download candidates. Selects candidates (several options available here for this, potentially: --max_images
, --max_cloud
, etc) and uses Planet Clips API to download imagery within bounds of data footprint. Imagery with ID = ID
is unzipped and placed into /images/{ID}
within local storage or a cloud storage bucket (--output_dir
).
still in progress here –– not entirely sure whether it's best to keep each image tiled in its own directory or try to merge all images together. seems like keeping image tiles in their own directory allows for more downstream flexibility.
The purpose of this
We'll use this container for two steps in the pipeline: first to tile the binary mask raster, and again to run a distributed tiling operation on the images in /image/{ids}
.
As a result, this container will contain two related but distinct python functions. The first will tile a single image, and the second will be a Cloud Dataflow operation to tile a while directory of images. The Pipeline will run these two distinct operations seperately but both derived from this tile
container.
Single Image tiler: will take in --image
and perhaps --zoom_level
and produce an XYZ/OSM tile structure from the image. Except: these images will likely remain as TIFF files so we can use multiple bands in training, rather than the typical PNG format used for OSM tiles.
Multiple Image tiler: TBD, still not quite sure how to structure the beam dataflow here.
This module is responsible for creating a train-validation split of the images