Skip to content

Commit

Permalink
split out docs into development and production instructions
Browse files Browse the repository at this point in the history
  • Loading branch information
ohnorobo committed May 3, 2021
1 parent 7718e3f commit ed9e0af
Show file tree
Hide file tree
Showing 3 changed files with 93 additions and 88 deletions.
89 changes: 1 addition & 88 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,91 +15,4 @@ pipeline works.

![System Diagram](system-diagram.svg)

## Running as an Automated Pipeline

There are two main top-level pieces

`python -m mirror.data_transfer`

This sets up a daily data transfer job to copy scan files from the Censored
Planet cloud bucket to an internal bucket.

`python -m schedule_pipeline`

This does some additional daily data processing and schedules an daily
incremental Apache Beam pipeline over the data. It expects to be run via a
Docker container on a GCE machine.

`./deploy.sh prod`

Will deploy the main pipeline loop to a GCE machine

## Running Manually

Individual pieces of the pipeline can be run manually.

### Mirroring Data

These scripts pull in a large amount of data from external datasources and
mirror it in the correct locations in google cloud buckets.

`python -m mirror.untar_files.sync_files`

Decompresses any Censored Planet scan files which have been transfered into the
project but are still compressed. This can also be used as a backfill tool to
decompress missing files.

`python -m mirror.routeviews.sync_routeviews`

Transfers in the latest missing CAIDA routeview files. (Can only mirror in data
from the last 30 days of data.)

`python -m mirror.routeviews.bulk_download`

Transfers in all CAIDA routeview files from a certain date. This is used for
backfilling data from older than 30 days.

In all cases to fix missing or incorrect data simple delete the incorrect data
from the google cloud bucket and re-run the appropriate script to re-mirror it.

`python -m mirror.internal.sync`

Downloads most recent version of Censored Planet resources locally.

### Processing Data

`python -m pipeline.run_beam_tables --env=prod --full`

Runs the full Apache Beam pipeline. This will re-process all data and rebuild
existing base tables.

`python -m table.run_queries`

Runs queries to recreate any tables derived from the base tables.

## Testing

All tests and lint checks will be run on pull requests using
[github actions](https://github.com/Jigsaw-Code/censoredplanet-analysis/actions)

To run all tests run

`python -m unittest`

There are a few end-to-end tests which aren't run by the unittest framework
because they require cloud resource access. To run these tests manually use the
command

`python -m unittest pipeline.manual_e2e_test.PipelineManualE2eTest`

To typecheck all files install `mypy` and run

`mypy **/*.py --namespace-packages --explicit-package-bases`

To format all files install `yapf` and run

`yapf --in-place --recursive .`

To get all lint errors install `pylint` and run

`python -m pylint **/*.py --rcfile=setup.cfg`
To contribute to this pipeline see the [development documentation](docs/development.md).
72 changes: 72 additions & 0 deletions docs/development.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,72 @@
# Development

## Testing Changes

All tests and lint checks will be run on pull requests using
[github actions](https://github.com/Jigsaw-Code/censoredplanet-analysis/actions)

To run all tests run

`python -m unittest`

There are a few end-to-end tests which aren't run by the unittest framework
because they require cloud resource access. To run these tests manually use the
command

`python -m unittest pipeline.manual_e2e_test.PipelineManualE2eTest`

To typecheck all files install `mypy` and run

`mypy **/*.py --namespace-packages --explicit-package-bases`

To format all files install `yapf` and run

`yapf --in-place --recursive .`

To get all lint errors install `pylint` and run

`python -m pylint **/*.py --rcfile=setup.cfg`

## Running Manually

Individual pieces of the pipeline can be run manually.

### Mirroring Data

These scripts pull in a large amount of data from external datasources and
mirror it in the correct locations in google cloud buckets.

`python -m mirror.untar_files.sync_files`

Decompresses any Censored Planet scan files which have been transfered into the
project but are still compressed. This can also be used as a backfill tool to
decompress missing files.

`python -m mirror.routeviews.sync_routeviews`

Transfers in the latest missing CAIDA routeview files. (Can only mirror in data
from the last 30 days of data.)

`python -m mirror.routeviews.bulk_download`

Transfers in all CAIDA routeview files from a certain date. This is used for
backfilling data from older than 30 days.

In all cases to fix missing or incorrect data simple delete the incorrect data
from the google cloud bucket and re-run the appropriate script to re-mirror it.

`python -m mirror.internal.sync`

Downloads most recent version of Censored Planet resources locally.

### Processing Data

`python -m pipeline.run_beam_tables --env=prod --full`

Runs the full Apache Beam pipeline. This will re-process all data and rebuild
existing base tables.

`python -m table.run_queries`

Runs queries to recreate any tables derived from the base tables.

20 changes: 20 additions & 0 deletions docs/production.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
# Production

## Running an Automated Pipeline

There are two main top-level pieces

`python -m mirror.data_transfer`

This sets up a daily data transfer job to copy scan files from the Censored
Planet cloud bucket to an internal bucket.

`python -m schedule_pipeline`

This does some additional daily data processing and schedules an daily
incremental Apache Beam pipeline over the data. It expects to be run via a
Docker container on a GCE machine.

`./deploy.sh prod`

Will deploy the main pipeline loop to a GCE machine

0 comments on commit ed9e0af

Please sign in to comment.