Skip to content

Conversation

@weiji14
Copy link
Owner

@weiji14 weiji14 commented Aug 29, 2019

Split off the the slow REMA gapfilling task from the bulky data_prep.selective_tile script, adhering to DRY principles (don't have a with in a with statement reading from a raster...). The lighter function will give us more room to play with parallel code, potentially making our tiling code perform faster, squeezing out more precision without worrying about speed!

TODO:

  • Deprecate selective_tile_old and ability to gapfill using another raster (690c365)
  • Refactor selective_tiling to be faster and even more precise
  • Write the one off script to gapfill 100m REMA with 200m filler version
  • Create new training tiles once more, consider resampling MEaSUREs Ice Velocity back to 500m

Setting fire to all that code to gapfill a raster with another raster as it's very messy, and we only need it for REMA now since we are no longer gapfilling MEaSUREs with #165 merged in. Still keeping the option to gapfill with a single floating point number, but we're removing the selective_tile_old function that has sat aside selective_tile since aac21fb in #156 of v0.9.2. Temporarily using a bilinear resampled 200m REMA in deepbedmap.ipynb. Will follow up with code to produce a gapfilled 100m resolution REMA geotiff!
@weiji14 weiji14 added bug 🪲 Something isn't working enhancement ✨ New feature or request data 🗃️ Pull requests that update input datasets labels Aug 29, 2019
@weiji14 weiji14 self-assigned this Aug 29, 2019
@review-notebook-app
Copy link

Check out this pull request on ReviewNB: https://app.reviewnb.com/weiji14/deepbedmap/pull/167

You'll be able to see notebook diffs and discuss changes. Powered by ReviewNB.

With great precision comes great need for optimization. Forcing our data_prep.selective_tile function to precisely bilinear interpolate the Z value down to every XY point, instead of 'slicing' that might still have sub-pixel discrepancies. Extends the work done in 7fd3345. The out_shape option is replaced with a 'resolution' setting now, and the geotiff/netcdf files are loaded in as a dask array by default now.

Yes, this slows things down by a fair bit, but we've wrapped the computationally heavy interpolation and masking methods using dask.delayed, so that tasks can be gathered up and processed in one go (if I can get the right scheduler settings...). Technically the code should be able to run nicely in parallel now, but GeoTIFFs aren't exactly built for parallel reads and big ol' REMA is a RAM-hungry beast so we're stopping short here of using dask.distributed.
Write script for gapfilling the 100m REMA Ice Surface Elevation DEM with the seamless 200m version (bilinear interpolated), i.e. a more proper version of cd72a30 of #64. Based on the old selective_tiling code's gapfill_raster section that we deprecated in 690c365. I've experimented with alternative methods such as making a virtual (.vrt) GeoTIFF, mosaicking using pure GDAL and rasterio's merge tool, even considered GMT's grdblend, but nothing really merges the two together (with REMA_100 as highest priority, then REMA_200) properly the way I want it, in a reasonable-ish amount of time. Might be good to actually output this homemade 100m gapfilled REMA to a Cloud-Optimized GeoTIFF, NetCDF or Zarr, but we'll stick to good ol' GeoTIFF for now, even if it is 9.9GB.
@ghost
Copy link

ghost commented Aug 30, 2019

DeepCode Report (#1e7f69)

DeepCode analyzed this pull request.
There is 1 new info report.

Keeping things DRY again, moving the save_array_to_grid function from deepbedmap.ipynb to data_prep.ipynb since we're using it to write the REMA_100_dem_filled GeoTIFF. Made it into a more reusable function, though there's a lot of nasty hardcoded default settings specific to DeepBedMap (who uses a -2000m value for NaN?!) and far too many options that probably should be turned into kwargs. It now takes in an ndim=3 array in CHW format instead of NCHW as before, has optional saving to NetCDF, and defaults to creating a BigTIFF!

It was really just the GeoTIFF compression I wanted, because Quilt wasn't accepting the >9GB REMA_100m_dem_filled.tif file. Using LZW compression, the filesize is now down to about 4.7GB, an we are trading off size for speed here, i.e. reading from this compressed REMA GeoTIFF can be significantly slower than the original uncompressed version. Preferrably would have used ZSTD compression (see https://gis.stackexchange.com/questions/1104/should-gdal-be-set-to-produce-geotiff-files-with-compression-which-algorithm-sh/333578#333578), if only the rasterio wheels actually had it (see rasterio/rasterio-wheels#23). . Was also doing some detective work on why rasterio has GDAL 2.4.1 whereas our conda version is 2.4.2, might need to set a GDAL_DRIVER_PATH?
Bringing back the 1km padding to resolve lousy predictions at the border areas, and resample the MEaSUREs Ice Velocity dataset from 450m to 500m resolution again, sigh. Basically reverting 7cd28af, but with a bit of a twist. Also using the new REMA_100m_dem_filled.tif we've recently created. Quilt packages reuploaded, now with hash e9494ecd4b4fe1bc1f8b28d4d73a287d093c83dc76eb173822b7e90753d92f27.
Sorry, didn't upload the correct BEDMAP2 (X_data) tiles in my haste yesterday. Properly reuploading those padded tiles with shape (3379,1,11,11) instead of (3379,1,9,9), patching 50ddb5e. Correct quilt data hash to use now is d092c5c00e9b6ceaac3a3bf431dd070f0a2809ef4f552e55faa9d01c1d5dd270.
Towards even better pixel registration at the expense of more computing cost! Patches #150 properly. The xyz_to_grid function has been giving us gridline node registered grids since forever, because that's `gmt surface`'s default setting. Now converting it to pixel node registration to align better with rasterio and other geospatial packages. Also using simple slicing for our high resolution tiles to avoid interpolating to NaNs at the edges! Total tile counts have increased from 3379 to 4028, O.O, mostly from the Basler grid around the Siple Coast region. New training tiles updated on Quilt with hash af86cf135ffe5ed9f78fc65231e3aa0bfc90f45e33b6bccda9f08f392c090113.

Though `surface` does have an `-r` setting to set pixel node registration directly, I looked at it and it gave strange diagonal strips for 2007tx.nc, so nope, we'll use `grdsample` instead, the only fault being we lose data lineage/provenance when checking the grid using `gmt grdinfo`. Hopefully the half pixel ghost doesn't haunt us ever again. Also note to self, to write up wrappers for `gmt blockmedian` and `gmt grdsample` to reduce boilerplate code in xyz_to_grid function.

Reason for doing this was because the more precise selective_tile script (since e7936db) was so exact, that it realized our bounding boxes was outside of the (gridline registered) groundtruth grids and gave us NULL values! This problem didn't surface itself until I was trying to train the neural network, and was strangely getting NaN values in the discriminator loss after just one epoch, no matter what hyperparameters I used.
@weiji14 weiji14 changed the title Make tiling more efficient using parallelization Slicing and interpolating tiles as accurately as possible Sep 2, 2019
In order to get those border predictions better, we are very carefully reverting some aspects of 75b7493. Specifically, we're going for valid padding in our ESRGAN model's input block convolutions and directly using the MEaSUREs Ice Velocity we've resampled back to 500m in 50ddb5e instead of resampling on the fly. Keeping the 4km kernel/filters though! The amazing part is that the parameter counts all stay the same, +1 for efficiency of convolutions! Also made small adjustments to the hardcoded Topographic Loss Mean Absolute Error function in d599ee8. Test dataset reuploaded, so again we have a new quilt hash to use - e11988479975a091dd52e44b142370c37a03409f41cb6fec54fd7382ee1f99bc.
@weiji14 weiji14 force-pushed the refactor/selective_tile branch from 59ea059 to 1e7f699 Compare September 3, 2019 02:29
@weiji14 weiji14 marked this pull request as ready for review September 3, 2019 02:39
weiji14 added a commit that referenced this pull request Sep 3, 2019
Closes #167 Slicing and interpolating tiles as accurately as possible.
@weiji14 weiji14 merged commit 1e7f699 into enh/revise_deepbedmap Sep 3, 2019
@weiji14 weiji14 deleted the refactor/selective_tile branch September 3, 2019 02:47
@weiji14 weiji14 mentioned this pull request Sep 14, 2019
10 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug 🪲 Something isn't working data 🗃️ Pull requests that update input datasets enhancement ✨ New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant