Skip to content
This repository was archived by the owner on Mar 1, 2021. It is now read-only.

Commit 563e97c

Browse files
committed
Merge branch 'master' into 173-create-api-for-face-search
2 parents 6a629c3 + fc0002d commit 563e97c

File tree

72 files changed

+895
-474
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

72 files changed

+895
-474
lines changed

.dockerignore

+1-1
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@
44

55
# git stuff
66
.gitignore
7-
README.md
7+
*.md
88

99
# DOCKER stuff
1010
docker-compose.yml

Makefile

+18-2
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,4 @@
1+
# API
12
gen-server:
23
protoc --go_out=plugins=grpc:. api/proto/usersearch.proto
34

@@ -9,8 +10,23 @@ gen-client:
910
gen-faces:
1011
protoc --go_out=plugins=grpc:. faces/proto/recognizer.proto
1112

12-
run:
13-
docker-compose up -d my-kafka postgres connect
13+
# INSTAGRAM
14+
15+
run-instagram:
16+
docker-compose up -d zookeeper my-kafka postgres connect minio neo4j
17+
docker-compose up -d --build es-with-plugin
1418
sleep 5
19+
docker-compose up --build migrate-postgres
1520
docker-compose up -d --build
1621
docker-compose logs -f
22+
23+
24+
# TWITTER
25+
26+
TWITTER_COMPOSE_FILE:=twitter-compose.yml
27+
28+
run-twitter:
29+
docker-compose -f $(TWITTER_COMPOSE_FILE) up -d my-kafka postgres connect
30+
sleep 5
31+
docker-compose -f $(TWITTER_COMPOSE_FILE) up -d --build
32+
docker-compose -f $(TWITTER_COMPOSE_FILE) logs -f

README.md

+60-48
Original file line numberDiff line numberDiff line change
@@ -1,73 +1,85 @@
1-
# SMAG - mvp
2-
> social media graph abusal
1+
# Social Record
2+
3+
> Distributed scraping and analysis pipeline for a range of social media platforms
4+
5+
**Table of content**
36

47
- [About](#about)
58
- [Architectural overview](#architectural-overview)
6-
- [Api](#api)
7-
- [Postgres DB](#postgres-db)
8-
- [Requirements](#requirements)
9+
- [Further reading](#further-reading)
10+
- [Detailed documentation](#detailed-documentation)
11+
- [Wanna contribute?](#wanna-contribute)
12+
- [List of contributors](#list-of-contributors)
13+
- [Deployment](#deployment)
914
- [Getting started](#getting-started)
10-
- [scraper in docker](#scraper-in-docker)
11-
- [scraper locally](#scraper-locally)
12-
- [Postgres change stream](#postgres-change-stream)
15+
- [Requirements](#requirements)
16+
- [Preparation](#preparation)
17+
- [Scraper](#scraper)
1318

1419
## About
15-
The goal of this project is to raise awareness about data privacy. The mean to do so is a tool to scrape, combine and analyze public social media data.
20+
21+
The goal of this project is to raise awareness about data privacy. The mean to do so is a tool to scrape, combine and analyze public data from multiple social media sources. <br>
1622
The results will be available via an API, used for some kind of art exhibition.
1723

1824
## Architectural overview
19-
You can find a overview about our architecture on this [miro board](https://miro.com/app/board/o9J_kw7a-qM=/)
2025

21-
### Api
22-
see details [here](api/README.md)
26+
![](docs/architecture.png)
2327

24-
### Postgres DB
25-
see details [here](db/README.md)
28+
You can find an more detailed overview [here](https://drive.google.com/a/code.berlin/file/d/1uE8oTku322-_eN3QGuiM4ayWZiRXfn9F/view?usp=sharing). <br>
29+
Open it in draw.io and have a look at the different tabs "High level overview", "Distributed Scraper" and "Face Search".
2630

27-
## Requirements
31+
## Further reading
2832

29-
- go 1.13 _(or go 1.11+ with the env var `GO111MODULEs=on`)_
30-
- `docker` and `docker-compose` are available and up-to-date
33+
### Detailed documentation
3134

32-
## Getting started
35+
| part | docs | contact |
36+
| :---------- | :----------------------------------------- | :----------------------------------------------- |
37+
| Api | [`api/README.md`](api/README.md) | [@jo-fr](https://github.com/jo-fr) |
38+
| Frontend | [`frontend/README.md`](frontend/README.md) | [@lukas-menzel](https://github.com/lukas-menzel) |
39+
| Postgres DB | [`db/README.md`](db/README.md) | [@alexmorten](https://github.com/alexmorten) |
3340

34-
If this is your first time running this:
41+
### Wanna contribute?
3542

36-
1. Add `127.0.0.1 my-kafka` and `127.0.0.1 minio` to your `/etc/hosts` file
37-
2. Choose a user_name as a starting point and run `go run cli/main/main.go <instagram|twitter> <user_name>`
38-
39-
As alternative, you can also add the cli to the docker-compose:
40-
41-
```yaml
42-
cli:
43-
build:
44-
context: "."
45-
dockerfile: "cli/Dockerfile"
46-
command: ["<instagram|twitter>", "<user_name>"]
47-
depends_on:
48-
- "my-kafka"
49-
environment:
50-
KAFKA_ADDRESS: "my-kafka:9092"
51-
```
43+
If you want to join us raising awareness for data privacy have a look into [`CONTRIBUTING.md`](CONTRIBUTING.md)
5244

53-
### scraper in docker
45+
### List of contributors
5446

55-
```bash
56-
$ make run
57-
```
47+
- @1Jo1 Josef Grieb
48+
- @Urhengulas Johann Hemmann
49+
- @alexmorten Alexander Martin
50+
- @jo-fr Jonathan Freiberger
51+
- @m-lukas Lukas Müller
52+
- @lukas-menzel Lukas Menzel
53+
- @SpringHawk Martin Zaubitzer
5854

59-
### scraper locally
55+
### Deployment
6056

61-
Have a look into [`docker-compose.yml`](docker-compose.yml), set the neccessary environment variables and run it with the command from the regarding dockerfile.
57+
The deployment of this project to kubernetes happens in [codeuniversity/smag-deploy](https://github.com/codeuniversity/smag-deploy) _(this is a private repo!)_
6258

63-
## Postgres change stream
59+
## Getting started
60+
61+
### Requirements
6462

65-
The debezium connector generates a change stream from all the changes in postgres
63+
| depency | version |
64+
| :----------------------------------------------------------- | :----------------------------------------------------------------- |
65+
| [`go`](https://golang.org/doc/install) | `v1.13` _([go modules](https://blog.golang.org/using-go-modules))_ |
66+
| [`docker`](https://docs.docker.com/install/) | `v19.x` |
67+
| [`docker-compose`](https://docs.docker.com/compose/install/) | `v1.24.x` |
6668

67-
To read from this stream you can
69+
### Preparation
6870

69-
- get [kt](https://github.com/fgeller/kt)
70-
- inspect the topic list in kafka `kt topic`, all topic starting with `postgres` are streams from individual tables
71-
- consume a topic with, for example `kt consume --topic postgres.public.users`
71+
If this is your first time running this:
72+
73+
1. Add `127.0.0.1 my-kafka` and `127.0.0.1 minio` to your `/etc/hosts` file
74+
2. Choose a `<user_name>` for your platform of choice `<instagram|twitter>` as a starting point and run
75+
```bash
76+
$ go run cli/main/main.go <instagram|twitter> <user_name>
77+
```
7278

73-
The messages are quite verbose, since they include their own schema description. The most interesting part is the `value.payload` -> `kt consume --topic postgres.public.users | jq '.value | fromjson | .payload'`
79+
### Scraper
80+
81+
Run the instagram- or twitter-scraper in docker:
82+
83+
```bash
84+
$ make run-<platform_name>
85+
```

api/README.md

+40-32
Original file line numberDiff line numberDiff line change
@@ -1,39 +1,47 @@
1-
# gRPC API
1+
# SMAG gRPC Web API
22

3-
- [Usage](#usage)
4-
- [Functions](#functions)
5-
- [Testing](#testing)
3+
## About
64

7-
## Usage
8-
- Make sure to `npm install google-protobuf grpc-web`
9-
- Then import the auto-generated proto files
10-
```javascript
11-
import {User, UserName, UserSearchResponse} from "./proto/client/usersearch_pb.js";
12-
import {UserSearchServiceClient} from "./proto/client/usersearch_grpc_web_pb.js";
5+
In our project we are using a [gRPC Web](https://grpc.io/docs/) API. For that we are using an [envoy proxy](https://www.envoyproxy.io/docs/envoy/latest/) to be able to connect to the gRPC Server. As our system is not publicly accessible an AWS Account in our Organisation with the appropriate access is required.
6+
7+
## Requirements
138

14-
var userSearch = new UserSearchServiceClient('http://localhost:8080');
9+
In order to successfully use our api make sure to have:
1510

16-
var request = new UserName();
17-
request.setUserName("codeuniversity");
18-
19-
userSearch.getUserWithUsername(request, {},function(err, response) {
20-
//...
21-
});
22-
- The default address for the database is `localhost`.
23-
If you want to change that simply add the enviroment variable `GRPC_POSTGRES_HOST` to the `grpc-server`container
11+
- a running [kubernetes setup](https://github.com/codeuniversity/smag-deploy/blob/master/README.md) (permssion required)
12+
- _optional for local testing_: [protoc](http://google.github.io/proto-lens/installing-protoc.html) to generate the protofiles for the frontend
13+
14+
## Usage
2415

16+
To use the production enviroment do the following steps:
17+
18+
1. Get name of envoy proxy `kubectl get pods | grep envoy`
19+
2. Forward the envoy-pod port with `kubectl port-forward envoy-proxy-deployment-6b89675d5b-d86c4 4000:8080`
20+
3. To make use of the API in the React Frontend import and run the following:
21+
```javascript
22+
import {
23+
User,
24+
UserNameRequest,
25+
UserIdRequest,
26+
InstaPostsResponse,
27+
UserSearchResponse
28+
} from "./protofiles/client/usersearch_pb.js";
29+
import { UserSearchServiceClient } from "./protofiles/client/usersearch_grpc_web_pb.js";
30+
var userSearch = new UserSearchServiceClient("http://localhost:4000");
31+
var request = new UserName();
32+
request.setUserName("codeuniversity");
33+
userSearch.getUserWithUsername(request, {}, function(err, response) {
34+
//example function call...
35+
});
36+
```
2537

2638
## Functions
27-
- `getUserWithUsername(UserNameRequest) User`
28-
> Queries the Database for one specific User
29-
- `getAllUsersWithUsername(UserNameRequest) UserSearchResponse`
30-
> Queries the database for all users that have a similar usenames and returns array of user
31-
- `getInstaPostssWithUserid(UserIdRequest) InstaPostsResponse`
32-
> GetInstaPostsWithUserId takes the User id and returns all Instagram Posts of a User
33-
- `getTaggedPostsWithUserId(UserIdRequest) InstaPostsResponse`
34-
> GetTaggedPostsWithUserId returns all Posts the given User is tagged on
35-
36-
## Testing
37-
1. `docker-compose up`
38-
1. initialize the Database with `make init-db`
39-
1. Then connect with the envoy-proxy via `localhost:4000`
39+
40+
To check the attributes of the proto messages take a look at the protofile [userserach.proto](https://github.com/codeuniversity/smag-mvp/blob/master/api/proto/usersearch.proto)
41+
42+
| **Method** | **Function Name** | **Input Message** | **Return Message** |
43+
| ---------- | ------------------------ | ----------------- | ------------------ |
44+
| GET | getUserWithUsername | UserNameRequest | User |
45+
| GET | getAllUsersLikeUsername | UserNameRequest | UserSearchResponse |
46+
| GET | getTaggedPostsWithUserId | UserIdRequest | InstaPostsResponse |
47+
| GET | getInstaPostssWithUserid | UserIdRequest | UserSearchResponse |

cli/main/main.go

+1-1
Original file line numberDiff line numberDiff line change
@@ -13,7 +13,7 @@ import (
1313
func main() {
1414
kafkaAddress := utils.GetStringFromEnvWithDefault("KAFKA_ADDRESS", "my-kafka:9092")
1515
instagramTopic := utils.GetStringFromEnvWithDefault("KAFKA_INSTAGRAM_TOPIC", "user_names")
16-
twitterTopic := utils.GetStringFromEnvWithDefault("KAFKA_TWITTER_TOPIC", "twitter-user_names")
16+
twitterTopic := utils.GetStringFromEnvWithDefault("KAFKA_TWITTER_TOPIC", "twitter.scraped.user_names")
1717

1818
if len(os.Args) < 3 {
1919
panic("Invalid argumemts. Usage: cli <instagram|twitter> <username>")

db/README.md

+47-2
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,49 @@
11
# postgres database
22

3-
## schema
4-
![db_schema](../docs/db_schema.png)
3+
We are using [POSTGRESQL](https://www.postgresql.org/) as the store for the raw scraped data from the various data sources. <br>
4+
The schemas are quite similar to the scraped data structures.
5+
6+
**Table of Contents**
7+
8+
- [Instagram](#instagram)
9+
- [Remarks](#remarks)
10+
- [Twitter](#twitter)
11+
- [Debezium](#debezium)
12+
13+
## [Instagram](https://www.instagram.com/)
14+
15+
This database is the more sophisticated one and is running in production.
16+
17+
![insta_schema](../docs/insta_schema.png)
18+
19+
### Remarks
20+
21+
- `internal_picture_url` is pointing to the downloaded picture on S3
22+
23+
## Twitter
24+
25+
This database is not in production yet and at the moment only dumps the tweaked scraped data.
26+
27+
![twitter_schema](../docs/twitter_schema.png)
28+
29+
## Debezium
30+
31+
The [debezium](https://github.com/debezium/debezium) connector generates a change stream from all change events in postgres (`read`, `create`, `update`, `delete`) and writes them into a kafka-topic `"postgres.public.<table_name>"`
32+
33+
To read from this stream you can:
34+
35+
- get [`kafkacat`](https://github.com/edenhill/kafkacat)
36+
- inspect the topic list in kafka:
37+
```bash
38+
$ kafkacat -L -b my-kafka | grep 'topic "postgres'
39+
```
40+
- consume a topic with
41+
```bash
42+
$ kafkacat -b my-kafka -t <topic_name>
43+
```
44+
45+
The messages are quite verbose, since they include their own schema description. The most interesting part is the `value.payload`:
46+
47+
```bash
48+
$ kafkacat -b my-kafka -topic postgres.public.users | jq '.value | fromjson | .payload'`
49+
```

0 commit comments

Comments
 (0)