Pipeline to ingest data from a MSSQL source, land it in Google Cloud Storage, then finally BigQuery.
- Incremental Load
- Automatic conversion of data types
- google-cloud-bigquery
- google-cloud-storage
- pandas
- pyodbc
I'd also recommend installing gsutil for authentication.
Edit the config.yml
file to provide your own environment variables as needed.
gcp_project
- Google Cloud Platform - Project Namegcp_dataset
- BigQuery Dataset (destination)gcs_bucket
- Google Cloud Storage Bucketgcs_prefix
- Path in Storage Bucket where project will create rootmssql_hostname
- SQL Server Hostnamemssql_username
- Login User (Optional)mssql_password
- Login Password (Optional)
You can also skip usage of this file and set the variables independently.
For each extract, you must define some parameters passed to the MSSQLtoBigQuery
class when initialized:
Required
ingestion_type
full_replace
append
- use a column (UUID or created date) to only ingest new recordsincremental
- similar to append, will ingest new AND records that have changed
database
schema
mssql_object_name
- view or table you want to process
Optional
bq_table_name
- destination table name if different than sourcechunks
- rows per file iterationkeys
- str or list of str - fields that represent unique constraintsql_query
- path to .sql file used to define custom querylast_val_column
- column used to limit records for append/incremental loadslast_val_query
- path to .sql file used to define custom last vallast_val_default
- fallback value to be used if table does not already exist in destinationincremental_keys
- list of columns to match when comparing staging to existing records