GitHub

#GCP Challenge - John Aven

About

The purpose of the challenge was to do the following:

Get data from cryptocurrency API
Put data in csv
load csv to google bucket
load csv from bucket into bigquery
perform queries on bigquery via datalabgm

...

Solution

The solution consists of two parts. The front-end (queries) and the back-end (everything else)

Front-End

I will discuss the front-end first. In the front-end, then aim is to provide answers to data insight requests. These request are pre-canned and can be expressed in SQL queries against BigQuery. This is all straightforward in the notebooks in DataLab.

Back-nd

The back-end is more involved (see code), but can be expressed simply. First, the API has robot-rules; no more than 30 queries per minute. To obey the rules and to be extra friendly, the queries happen in the neighborhood of 15 per minute. This slower than best performance is alright though. It avoids the risk of violating the restrictions and getting black-listed. In production this could be tuned by looking at the average roundtrip time and expected deviations from that rate. The code, as the nature of the data changes over time, should be able to keep the data ever-green. The code was thus written as a service (ideally hosted in a docker container - or similar) and scheduled to update at a rate of about once every 5-8 minutes. The code should not hard-code configuration, but should store it externally. The decision was to store this configuration in a YAML file in a google-bucket. This required writing a stream reader around the Google-API for buckets. Each of the tables (Listing, Global and Tickers) are queried separately (and tickers is done iteratively) and go into building a single, joined, DataFrame. The global data is only pulled once. For production I would add this in to pull periodically and update the total number of currencies. Errors are logged to JSON files (need some formatting still). So are event notices and errors from API queries.

To simplify the workflow and allow all of it to run within the notebook without any fancy threading work, I modified the code to have a forced interrupt after the first update to the data.

Future Work

If I were to change anything, I would remove (or do in addition to) the writing to the CSV file in the google bucket, and use the API to write directly to the BigQuery, keeping the data truly evergreen. In fact, this could be done on pull and not written to internal DB (pandas DataFrame as in memory DB).

Cheers

Util

All code which is not in the runtime will

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
__pycache__		__pycache__
.DS_Store		.DS_Store
.gitignore		.gitignore
BucketFileStreamReader.py		BucketFileStreamReader.py
BucketFileWriter.py		BucketFileWriter.py
ConfigReader.py		ConfigReader.py
CryptoDataIngester.py		CryptoDataIngester.py
CryptoIngestResponseError.py		CryptoIngestResponseError.py
CryptoIngesterError.py		CryptoIngesterError.py
NotebookVersion.ipynb		NotebookVersion.ipynb
README.md		README.md
config copy.yml		config copy.yml
config.yml		config.yml
crypto_table.csv		crypto_table.csv
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

#GCP Challenge - John Aven

About

Solution

Front-End

Back-nd

Future Work

Util

About

Releases

Packages

Languages

snoodleboot/gcp_crypto_demo

Folders and files

Latest commit

History

Repository files navigation

#GCP Challenge - John Aven

About

Solution

Front-End

Back-nd

Future Work

Util

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages