A data pipeline that utilizes the ELT (Extract, Load, Transform) process to fetch, load, and transform San Francisco Police Department incident data, with the aim of generating insights.
As new incidents occur on a daily basis, I am attempting to gather insights that will enable us to take better measures. Specifically, I am seeking answers to the following questions:
-
What is the frequency of incidents over time?
-
Which incidents happen most frequently?
-
Which areas are hotspots of crime in San Francisco?
-
On which day and month do most incidents happen?
-
What are the trends of incidents in districts over the years?
-
How many active and closed cases are there?
Infrastructure: Terraform
Cloud: Google Cloud
Data lake: Google Cloud Storage
Data warehouse: BigQuery
Orchestration: Prefect
Data transformation: DBT
Data visualization: Google Looker Studio
Here is brief explanation:
- Using Google Cloud infrastructure
- Setting it up using Terraform
- Extracting data from api servers
- Ingesting it to google bucket
- Loading into Bigquery, with basic checks
- DBT to transform
- The whole process is being orchestrated by Prefect
- Finally using Data Studio for Analytics
Column | Description |
---|---|
incident_datetime | The date and time when the incident occurred |
incident_date | The date when the incident occurred |
incident_time | The time when the incident occurred |
incident_year | The year of the incident |
incident_month | The month of the incident (as a number) |
incident_month_name | The name of the month when the incident occurred |
incident_day | The day of the month when the incident occurred |
incident_day_of_week | The day of the week when the incident occurred |
report_datetime | The date and time when the incident was reported |
incident_id | A unique identifier for the incident |
incident_number | A unique identifier assigned to the incident by the reporting system |
report_type_code | A code representing the type of report made |
report_type_description | A description of the type of report made |
incident_code | A code representing the type of incident |
incident_category | The general category of the incident |
incident_subcategory | A more specific category of the incident |
incident_description | A brief description of the incident |
resolution | The outcome or resolution of the incident (e.g. arrest made, no further action) |
police_district | The police district in which the incident occurred |
filed_online | A string indicating whether the incident was reported online |
intersection | The intersection where the incident occurred |
analysis_neighborhood | The neighborhood in which the incident occurred, based on an analysis by the reporting system |
supervisor_district | The district number of the supervisor representing the area where the incident occurred |
supervisor_district_name | The name of the supervisor district where the incident occurred |
supervisor_name | The name of the supervisor representing the area where the incident occurred |
latitude | The latitude coordinate of the location where the incident occurred |
longitude | The longitude coordinate of the location where the incident occurred |
geo_location | A string containing both latitude and longitude coordinates of the location where the incident occurred |
Here is the lineage graph:
- Make sure you have
docker
,make
andgit
installed in your pc. - Git clone the repo
git clone https://github.com/rkscodes/incident_intelligence.git cd incident_intelligence
- Make a virtual env with python=3.9 and install prefect:
conda create -n incidence python=3.9 prefect
- Activate your env:
conda activate incidence
- Refer Setup Terraform to setup cloud infra.
- Make changes in
config.json
file with same details as invariables.tf
. - Run prefect server:
prefect server start
- Register blocks and deployments using:
make prefect_setup
- Set prefect api url:
prefect config set PREFECT_API_URL=http://127.0.0.1:4200/api
- Run docker Deployment
infra-docker-storage-docker
using prefect ui or:prefect deployment run parent-etl-flow/infra-docker-storage-docker
- To run other deployment install all dependencies using:
make env_setup
(Tested on Mac, optional)
Show instructions
- Install terraform on your pc
- Clone this (if not already):
git clone https://github.com/rkscodes/incident_intelligence.git cd incident_intelligence/terraform
- Initialise terraform with :
terraform init
- Refer Setup GCP and follow all steps.
- Update
project
var invariables.tf
with GCP project id. - Authenticate the service first:
export GOOGLE_APPLICATION_CREDENTIALS={{path_to_application_credential}} gcloud auth activate-service-account --key-file ${GOOGLE_APPLICATION_CREDENTIALS}
- Generate a SSH key using:
ssh-keygen -t rsa -f ~/.ssh/<name> -C <username> -b 2048
- Update
gce_ssh_user, gce_ssh_pub_key_file
variable invariables.tf
with generated public key username and file path. - Optionally could modify
region, zone, data_lake_bucket
if required - Please make sure you have exported:
export GOOGLE_APPLICATION_CREDENTIALS={{path_to_application_credential}}
terraform validate
terraform plan
terraform apply
- Your infra should be up and running.
- One caveat if you decide to use
terraform destroy
VPC network are sometimes not destroyed so destroymy-network
manually in google cloud ui.
Show Instructions
- Write unit test
I would like to express my sincere gratitude to the team behind the Data Engineering Zoomcamp course for providing me with the opportunity to enhance my skills in this field. The course has been an enriching and insightful experience, and I have gained a deeper understanding of the concepts and practices related to data engineering.
Additionally, I want to express my appreciation for the vibrant community on the course Slack channel. The discussions and interactions with fellow students have been an excellent source of support, and I have learned a great deal from their insights and experiences.