Here, we document steps for acquiring and pre-processing raw ERA5 data for cloud-optimization. In this directory, we've included configuration files to describe and expediently acquire the data.
All data can be ingested from Copernicus with google-weather-tools
,
specifically weather-dl
(see weather-tools.readthedocs.io).
Pre-requisites:
-
Install the weather tools, with at least version 0.3.1:
pip install google-weather-tools>=0.3.1
-
Acquire one or more licenses from Copernicus.
Recommended: Download configs allow users to specify multiple API keys in a single data request via "parameter subsections". We highly recommend that institutions pool licenses together for faster downloads.
-
Set up a cloud project with sufficient permissions to use cloud storage (such as GCS) and a Beam runner (such as Dataflow).
Note: Other cloud systems should work too, such as S3 and Elastic Map Reduce. However, these are untested. If you experience an error here, please let us know by filing an issue.
Steps:
-
Update the
parameters
section of the desired config file, e.g.raw/era5_ml_dv.cfg
, with the appropriate information.- First, update the
target_path
to point to the right cloud bucket. - Add one or more CDS API keys, as is described here.
- First, update the
-
(optional, recommended) Preview the download with a dry run:
weather-dl raw/era5_ml_dv.cfg --dry-run
-
Once the config looks sounds, execute the download on your preferred Beam runner, for example, the Apache Spark runner. We ingested data with GCP's Dataflow runner, like so:
export PROJECT=<your-project-id> export BUCKET=<your-gcs-bucket> export REGION=us-central1 weather-dl raw/era5_mv_dv.cfg \ --runner DataflowRunner \ --project $PROJECT \ --region $REGION \ --temp_location "gs://$BUCKET/tmp/" \ --disk_size_gb 75 \ --job_name era5-ml-dv
If you'd like to download the data locally, you can run the following, though this isn't recommended (the data is large!):
weather-dl raw/era5_mv_dv.cfg --local-run
Check out the
weather-dl
docs for more information. -
Repeat for the rest of the config files.
Grib is an idiosyncratic format. For example, a single grib file can contain multiple level types, standard table
versions, or grids. This often makes grib
files difficult to open. The system we've employed to
convert data to Zarr, Pangeo Forge Recipes, is
not (yet) able to handle this complexity. Thus, to
prepare the raw data for conversion, we need to perform one additional processing step: splitting grib files by
variable. This can be done with google-weather-tools
, specifically weather-sp
(see
weather-tools.readthedocs.io).
The only datasets we needed to split by variable are soil
and pcp
, since they mix levels and table versions. These
steps will prepare the data for conversion by scripts in the src/
directory.
Pre-requisites:
- Install the weather tools, with at least version 0.3.0:
pip install google-weather-tools>=0.3.0
- Acquire read access to the datasets (e.g. via
era5_sfc_soil.cfg
) from some cloud storage bucket.
Steps:
- Preview the data split by running the following command. Make sure to change the file paths if the data locations
differ.
export DATASET=soil weather-sp --input-pattern "gs://gcp-public-data-arco-era5/raw/ERA5GRIB/HRES/Month/**/*_hres_$DATASET.grb2" \ --output-template "gs://gcp-public-data-arco-era5/raw/ERA5GRIB/HRES/Month/{1}/{0}.grb2_{typeOfLevel}_{shortName}.grib" \ --dry-run
- Execute the data split on your
preferred Beam runner. For example, here are the
arguments to run the splitter on Dataflow:
export DATASET=soil export PROJECT=<your-project> export BUCKET=<your-bucket> export REGION=us-central1 weather-sp --input-pattern "gs://gcp-public-data-arco-era5/raw/ERA5GRIB/HRES/Month/**/*_hres_$DATASET.grb2" \ --output-template "gs://gcp-public-data-arco-era5/raw/ERA5GRIB/HRES/Month/{1}/{0}.grb2_{typeOfLevel}_{shortName}.grib" \ --runner DataflowRunner \ --project $PROJECT \ --region $REGION \ --temp_location gs://$BUCEKT/tmp \ --disk_size_gb 100 \ --job_name split-soil-data
- Repeat this process, except change the dataset to
pcp
:export DATASET=pcp
This script is designed to validate data files from the "gcp-public-data-arco-era5" Google Cloud Storage bucket. It ensures that the required files for specific years and reports all extra files which are in the bucket.
Pre-requisites:
- Python 3.x installed
- Required Python packages installed (e.g., fsspec, pandas)
Configuration:
You can modify the configuration in the script:
- BUCKET: The name of the Google Cloud Storage bucket containing the data.
- START_YEAR and END_YEAR: Define the range of years to validate.
Steps:
-
Run the script with Python:
python raw/gcs_data_consistency_checker.py