Reformat weather datasets into zarr.
See the dataset integration guide to integrate a new dataset to be reformatted.
We use
uv
to manage dependencies and python environmentsruff
for linting and formattingmypy
for type checkingpytest
for testingpre-commit
to automatically lint and format as you git commit
- Install uv
- Run
uv run pre-commit install
to setup the git hooks - If you use VSCode, you may want to install the extensions (ruff, mypy) it will recommend when you open this folder
uv run main --help
uv run main <DATASET_ID> update-template
uv run main <DATASET_ID> backfill-local <INIT_TIME_END>
- Add dependency:
uv add <package> [--dev]
. Use--dev
to add a development only dependency. - Lint:
uv run ruff check
- Type check:
uv run mypy
- Format:
uv run ruff format
- Tests:
- Run tests in parallel on all available cores:
uv run pytest
- Run tests serially:
uv run pytest -n 0
- Run tests in parallel on all available cores:
To reformat a large archive we parallelize work across multiple cloud servers.
We use
docker
to package the code and dependencieskubernetes
indexed jobs to run work in parallel
- Install
docker
andkubectl
. Make suredocker
can be found at /usr/bin/docker andkubectl
at /usr/bin/kubectl. - Setup a docker image repository and export the DOCKER_REPOSITORY environment variable in your local shell. eg.
export DOCKER_REPOSITORY=us-central1-docker.pkg.dev/<project-id>/reformatters/main
- Setup a kubernetes cluster and configure kubectl to point to your cluster. eg
gcloud container clusters get-credentials <cluster-name> --region <region> --project <project>
- Create a kubectl secret containing your Source Coop S3 credentials
kubectl create secret generic source-coop-storage-options-key --from-literal=contents='{"key": "...", "secret": "..."}'
.
- `DYNAMICAL_ENV=prod uv run main <DATASET_ID> backfill-kubernetes <INIT_TIME_END> <JOBS_PER_POD> <MAX_PARALLELISM>