GitHub

Preprocess dataset

Preprocessing is packaged as a Spark Job.

Prerequisites

Install Confluent platform v3.0.x from http://www.confluent.io
Start zookeeper, kafka broker/s and schema registry. Note - schema registry default port is 8081, but latest version of code looks at You have either to change port in code or in schema registry properties
Create topics

    dim_part_8
    fact_part_8
    fact_part_8_key_by_join_key
    merged_output

Install sbt and run sbt compile or install IntelliJ and import project

Running

Run PrepareDataset main class with single argument - name of directory which contains perf_dump.txt. It doesn't terminate itself, you have to monitor progress at http://localhost:4040
Run PublishData with same argument.

Now your data is in Kafka topics

Running KafkaStreams join

Just run KafkaStreamsMap and hit ENTER. It has some significant startup lag, but later you should see numbers running.

Sample output:

M/ps: 31250.00 (20685.76 avg), with app start overhead - 19841.32, Total: 11991000
M/ps: 29411.76 (20686.28 avg), with app start overhead - 19841.85, Total: 11992000
M/ps: 31250.00 (20686.86 avg), with app start overhead - 19842.46, Total: 11993000
M/ps: 25000.00 (20687.16 avg), with app start overhead - 19842.80, Total: 11994000
M/ps: 29411.76 (20687.67 avg), with app start overhead - 19843.34, Total: 11995000
M/ps: 30303.03 (20688.22 avg), with app start overhead - 19843.91, Total: 11996000
M/ps: 23809.52 (20688.44 avg), with app start overhead - 19844.18, Total: 11997000
M/ps: 29411.76 (20688.95 avg), with app start overhead - 19844.72, Total: 11998000
M/ps: 28571.43 (20689.43 avg), with app start overhead - 19845.23, Total: 11999000
M/ps: 19607.84 (20689.33 avg), with app start overhead - 19845.21, Total: 12000000
M/ps: 26315.79 (20689.70 avg), with app start overhead - 19845.61, Total: 12001000
M/ps: 24390.24 (20689.96 avg), with app start overhead - 19845.92, Total: 12002000

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
core/src/main		core/src/main
flink/src/main/scala		flink/src/main/scala
kafka/src/main/scala		kafka/src/main/scala
project		project
spark/src/main/scala		spark/src/main/scala
.gitignore		.gitignore
README.md		README.md
build.sbt		build.sbt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Preprocess dataset

Prerequisites

Running

Running KafkaStreams join

About

Releases

Packages

Languages

ytaras/kafka_join_poc

Folders and files

Latest commit

History

Repository files navigation

Preprocess dataset

Prerequisites

Running

Running KafkaStreams join

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages