Project: Data Engineering

This is the monorepo containing all necessary components to run the project for the data engineering course.

Service Descriptions

Instagram API Mock Service: This services continuously pushes data into the Kafka topics, to simulate the process of scraping data from the Instagram API. Additionally the service creates mock pictures by calling thispersondoesnotexist.com and hosts them under the path /images.
Storage Service: This service subscribes to the Kafka Topic: instagram-profiles and upon new messages creates new entries in the postgresql data base. Any new media entries it finds (e.g. posts) triggers a download of the associated image, that is then uploaded to the minio instance.
Data Access Service: This REST-service can be called at /api/all to get a basic aggregation of the insight data of all currently tracked profiles.
Visualization Service: This service hosts a nodejs-based frontend that visualizes some basic statistics (number of likes, reach, saved, comments) over the lifetime of all media.

Configuration

All configuration should be done in the accompanying docker compose file using the environment variables.
All variables need to be set in the docker-compose file or the project won't start.

Variable	Description	Default
S3_EXTERNAL_ENDPOINT	The external endpoint under which the S3-based storage can be accessed	localhost:10002
S3_ENDPOINT	The docker-network internal endpoint to call the S3-storage	minio:10002
S3_ACCESS_KEY_ID	Access key for accessing the minio storage/ui as a root user	admin
S3_SECRET_ACCESS_KEY	The password for accessing the minio storage/ui as a root user	admin123
POSTGRES_USER	Default postgres database user	postgres
POSTGRES_PASSWORD	Default password for the POSTGRES_USER	test
POSTGRES_DB	Default database to create/use in postgres	postgres
POSTGRES_HOST	Host under which postgres is reachable	postgres
POSTGRES_PORT	Port used to communicate with postgres	5432
CLICKHOUSE_ENDPOINT	Endpoint used by the internal clients to communicate with the clickhouse instance	clickhouse:9000
DATA_ACCESS_SERVICE_HOST	Host called from within the visualization service to receive data	data-access-service
DATA_ACCESS_SERVICE_PORT	Port called from within the visualization service to receive data	3001
KAFKA_BOOTSTRAP_SERVERS	Location of kafka servers	kafka:9092

Additionally the instagram API mock service offers a few additional variables to configure the data creation.

Variable	Description	Default
NUM_PROFILES	Number of profiles to simulate	10
NUM_PICTURES_PER_PROFILE	Number of pictures per profile to simulate	10
INSIGHT_UPDATE_FREQ_MS	Time in milliseconds between updates for insights of a media post	30
PROFILE_UPDATE_FREQ_MS	Time in milliseconds between updates for a singular profile	50

Running the Project

There are currently two ways to start up the project:

Use the included reset.sh-script. This script deletes any old leftovers of previous runs, rebuilds all containers and starts the environment.

./reset.sh

Use docker compose directly to start the application

docker compose build 
docker compose up -d

The project can be configured by adjusting the variables in the docker compose file.
Depending on the variables, the required start up time can vary in length. The Instagram API Mock Service in particular requires a longer start time, as there is a 500ms delay between fetching the individual images from thispersondoesnotexist.com to guarantee that unique images are fetched.
Therefore, this service usually needs (NUM_PROFILES*NUM_PICTURES_PER_PROFILE / 2) seconds to start.
After starting the project the data can be visualized by opening the url: http://localhost:5473 in a browser.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
clickhouse		clickhouse
data-access-service		data-access-service
instagram-api-mock-service		instagram-api-mock-service
postgresql		postgresql
storage-service		storage-service
visualization-service		visualization-service
README.md		README.md
docker-compose.yaml		docker-compose.yaml
reset.sh		reset.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Project: Data Engineering

Service Descriptions

Configuration

Running the Project

About

Releases

Packages

Languages

Taragos/project_data-engineering

Folders and files

Latest commit

History

Repository files navigation

Project: Data Engineering

Service Descriptions

Configuration

Running the Project

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages