Skip to content

Commit 5980176

Browse files
authored
Opensource code for GAIA (#173)
1 parent 705a7e6 commit 5980176

File tree

376 files changed

+48933
-1
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

376 files changed

+48933
-1
lines changed

NOTICE.txt

+4
Original file line numberDiff line numberDiff line change
@@ -55,3 +55,7 @@ This product includes software from the hgraphdb project (Apache 2.0)
5555
This product includes software from the cuba-platform/cuba project (Apache 2.0)
5656
* Copyright (c) 2008-2016 Haulmont
5757
* https://github.com/cuba-platform/cuba
58+
59+
This product includes software from the natelandau/shell-scripts project
60+
* Which has no license, but open sourced to public domain by Nathaniel Landau
61+
* https://github.com/natelandau/shell-scripts

research/gaia/README.md

+105-1
Original file line numberDiff line numberDiff line change
@@ -1 +1,105 @@
1-
# GAIA
1+
# Overview
2+
GAIA (GrAph Interactive Analytics) is a full-fledged system for large-scale interactive graph analytics in the distributed context.
3+
GAIA is based on the Tinkerpop's Gremlin query language (https://tinkerpop.apache.org/). Given a Gremlin query, Gaia
4+
will compile it into a dataflow with the help of the powerful Scope abstraction, and then schedule the computation in
5+
a distributed runtime.
6+
7+
GAIA has been deployed at [Alibaba Corporation](https://www.alibaba.com/) to support a wide range of businesses from
8+
e-commerce to cybersecurity. This repository contains three main components of its architecture:
9+
* GAIA compiler: As a main technical contribution of GAIA, we propose a powerful abstraction
10+
called *Scope* in order to hide the complex control flow (e.g. conditional and loop) and fine-grained dependency in
11+
a Gremlin query from the dataflow engine. Taking a Gremlin query as input, the GAIA compiler is responsible for
12+
compiling it to a dataflow (with Scope abstraction) in order to be executed in the dataflow engine. The compiler
13+
is built on top of the [Gremlin server](http://tinkerpop.apache.org/docs/3.4.3/reference/##connecting-gremlin-server)
14+
interface so that the system can seamlessly interact with the TinkerPop ecosystem, including development tools
15+
such as [Gremlin Console](http://tinkerpop.apache.org/docs/3.4.3/reference/##gremlin-console)
16+
and language wrappers such as Java and Python.
17+
* Distributed runtime: The GAIA execution runtime provides automatic support for efficient execution of Gremlin
18+
queries at scale. Each query is compiled by the GAIA compiler into a distributed execution plan that is
19+
partitioned across multiple compute nodes for parallel execution. Each partition runs on a separate compute node,
20+
managed by a local executor, that schedules and executes computation on a multi-core server.
21+
* Distributed graph store: The storage layer maintains an input graph that is hash-partitioned across a cluster,
22+
with each vertex being placed together with its adjacent (both incoming and outgoing) edges and their attributes.
23+
Here we assume that the storage is coupled with the execution runtime for simplicity, that is each
24+
local executor holds a separate graph partition. In production, more functionalities of storage have been developed,
25+
including snapshot isolation, fault tolerance and extensible apis for cloud storage services, while they are
26+
excluded from the open-sourced stack for conflict of interest.
27+
28+
# Preparement
29+
## Dependencies
30+
GAIA builds, runs, and has been tested on GNU/Linux (more specifically Centos 7).
31+
Even though GAIA may build on systems similar to Linux, we have not tested correctness or performance,
32+
so please beware.
33+
34+
At the minimum, Galois depends on the following software:
35+
* [Rust](https://www.rust-lang.org/) (>= 1.49): GAIA currently works on Rust 1.49, but we suppose that it also works
36+
for any later version.
37+
* Java (jdk 8): Due to a known issue of gRPC that uses an older version of java annotation apis, the project is
38+
subject to jdk 8 for now.
39+
* Protobuf (3.0): The rust codegen is powered by [prost](https://github.com/danburkert/prost).
40+
* gRPC: gRPC is used for communication between Rust (engine) and Java (Gremlin server/client). The Rust
41+
implementation is powered by [tonic](https://github.com/hyperium/tonic)
42+
* Other Rust and Java dependencies, check
43+
* `./gremlin/compiler/pom.xml`
44+
* `./gremlin/gremlin_core/Cargo.toml`
45+
* `./graph_store/Cargo.toml`
46+
* `./pegasus/Cargo.toml`
47+
48+
## Building codes
49+
TODO
50+
51+
## Generate Graph Data
52+
Please refer to `./graph_store/README.rd` for details.
53+
54+
# Deployment
55+
## Deploy GAIA services
56+
TODO
57+
### Single-machine Deployment
58+
### Distributed Deployment
59+
## Start Gremlin Server
60+
After successfully building the codes, you can find `gremlin-server-plugin-1.0-SNAPSHOT-jar-with-dependencies.jar` in
61+
`./gremlin/compiler/gremlin-server-plugin/target`, copy it to wherever you want to start the server
62+
```
63+
cp ./gremlin/compiler/gremlin-server-plugin/target/gremlin-server-plugin-1.0-SNAPSHOT-jar-with-dependencies.jar /path/to/your/dir
64+
cp -r ./gremlin/compiler/conf /path/to/your/dir
65+
cd /path/to/your/dir
66+
```
67+
68+
There are some configurations to make in `./conf`:
69+
* Gremlin server address and port: TODO
70+
* The graph storage schema: For your reference, we've provided the schema file
71+
`./conf/modern.schema.json` for [Tinkerpop's modern graph](https://tinkerpop.apache.org/docs/current/tutorials/getting-started/),
72+
and `./conf/ldbc.schema.json` for [LDBC generated data](https://github.com/ldbc/ldbc_snb_datagen).
73+
TODO: How to customize the schema
74+
75+
Then start up the Gremlin server using
76+
```
77+
java -cp .:gremlin-server-plugin-1.0-SNAPSHOT-jar-with-dependencies.jar com.compiler.demo.server.GremlinServiceMain
78+
```
79+
80+
## Run Query
81+
- Download TinkerPop's official [gremlin-console](https://archive.apache.org/dist/tinkerpop/3.4.9/apache-tinkerpop-gremlin-console-3.4.9-bin.zip)
82+
- cd `path/to/gremlin/console`, modify `conf/remote.yaml`
83+
```
84+
hosts: [localhost] # TODO: The hosts and port should align to the above server configuration?
85+
port: 8182
86+
serializer: { className: org.apache.tinkerpop.gremlin.driver.ser.GryoMessageSerializerV1d0, config: { serializeResultToString: true }}
87+
```
88+
- Console startup
89+
```
90+
./bin/gremlin.sh
91+
:remote connect tinkerpop.server conf/remote.yaml
92+
:remote console
93+
```
94+
- Submit query in console. Have fun!!
95+
96+
# Contact
97+
TODO
98+
99+
# Acknowledge
100+
TODO
101+
102+
# Publications
103+
1. GAIA: A System for Interactive Analysis on Distributed Graphs Using a High-Level Language, Zhengping Qian,
104+
Chenqiang Min, Longbin Lai, Yong Fang, Gaofeng Li, Youyang Yao, Bingqing Lyu, Xiaoli Zhou, Zhimin Chen, Jingren Zhou,
105+
18th USENIX Symposium on Networked Systems Design and Implementation (NSDI 2021), to appear.

research/gaia/graph_store/Cargo.toml

+34
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,34 @@
1+
[package]
2+
name = "graph_store"
3+
version = "0.2.0"
4+
edition = "2018"
5+
6+
[features]
7+
jemalloc = ["jemallocator"]
8+
9+
[dependencies]
10+
abomonation = "0.7.3"
11+
abomonation_derive = "0.5"
12+
bincode = "1.0.1"
13+
clap = "2.32.0"
14+
csv = "1.1"
15+
env_logger = "0.7.1"
16+
lazy_static = "1.1.1"
17+
log = "0.4"
18+
indexmap = { version = "1.3", features = ["serde-1"] }
19+
itertools = "0.9"
20+
jemallocator = { version = "0.3.0", optional = true }
21+
petgraph = { version = "0.5.0", features = ["serde-1"] }
22+
rand = "0.5.5"
23+
rocksdb = "0.14.0"
24+
serde = { version = "1.0", features = ["derive"] }
25+
serde_cbor = "0.9.0"
26+
serde_derive = "1.0"
27+
serde_json = "1.0"
28+
tempdir = "0.3.7"
29+
timely = "0.10"
30+
walkdir = "2"
31+
32+
[profile.release]
33+
lto = true
34+
panic = "abort"

research/gaia/graph_store/README.md

+69
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,69 @@
1+
# Overview
2+
The codes maintain the distributed graph storage for GAIA. We adopt the property graph model as
3+
advocated in modern graph databses such as [Neo4j](https://neo4j.com/) adn [Tinkerpop](https://tinkerpop.apache.org/).
4+
We split the graph data into two parts, namely structure data and property data. The structure data contains
5+
vertices and edges and their labels. Each vertex is identified by a globally unique identity, while the edges are
6+
maintained in the associated vertices using the conventional adjacency list. We leverage the Rust graph library
7+
[petgraph](https://github.com/petgraph/petgraph) to maintain the structure data.
8+
9+
The property data are maintained in a variety of ways, as can be found in `src/table.rs`, namely:
10+
* `PropertyTable`: The default option of in-memory hash table.
11+
* `SingleValueTable`: An optimized in-memory table that maintains one single value. Altough vertices
12+
usually contain multiple properties, it is very common for the edges to only contain on single property in practice.
13+
In addition, edges are often in a much larger order (10X~100X largers) than vertices. We thus implement `SingleValueTable`
14+
as an optimization to ease the edges' storage burden.
15+
* `RocksTable`: A option to leverage [RocksDB](https://rocksdb.org/) for disk-based storage.
16+
17+
# Usage of LDBC Parser
18+
## Preliminaries
19+
We currently provide a tool for parsing (and partitioning) the LDBC raw data generated by
20+
[LDBC Datagen](https://github.com/ldbc/ldbc_snb_datagen) into our distributed storage.
21+
LDBC vertices are uniquely identified by its vertex type (label) and id. We leverage this feature by mapping
22+
each vertex label to a label id, and then assign each vertex a globally unique id using the combination
23+
of its label id and ldbc id. Note that certain vertex may have two-level (parimary and secondary) label, for
24+
example, a `Company` vertex also has a primary label of `Organization`. In this case, the primary
25+
label will be used. Edge is not the first-class citizen in our design, and will be indexed
26+
according to its source (and target) vertex. Given the global id, partitioning is straightforward:
27+
in a cluster of k machines, we randomly assign each vertex to one of the machines according to the hash value of its global id;
28+
the edges associated with a vertex v will be placed in the machine of v. Here, both the incoming and outgoing edges
29+
are considered by default, while an option of only outgoing edges will be provided.
30+
31+
## LDBC Data Gen & Preprocess
32+
After generating LDBC data in HDFS, there is a folder like `hdfs://<ip:port>/path/to/ldbc/data/social_network_xx/`, in which
33+
the vertex data of type `VType` is maintained in the file of `VType_0_0.csv`, and the edge data of type `EType` (with
34+
source vertex typed `<SrcType>` and target vertex typed `<TgtType>`) in maintained in the file of
35+
`<SrcType_EType_TgtType_0_0.csv>`. We require users to write a Hadoop MR program to initially partition the raw graph
36+
data. After the pre-partitioning, the vertex data `VType_0_0.csv` must be stored in a folder of
37+
`hdfs://<ip:port>/path/to/partitioned/data/VType/` that has the file fragments of `part-0000`, `part-0001` etc.
38+
Same applies to each edge data.
39+
40+
## Data Schema
41+
The schema contains the following metadata for the graph storage:
42+
* Mapping from vertex label to label id.
43+
* Mapping from edge label to a 3-tuple, which contains edge label id, ource vertex label id, and target vertex
44+
label id.
45+
* The properties (name and datatype) of each type of vertex/edge.
46+
47+
The schema file is formatted using Json. We have provided a sampled schema file for LDBC data in `data/schema.json`.
48+
49+
## Parsing tools
50+
Suppose the LDBC raw data has been preprocessed, and stored in
51+
either local disk as `fs:///path/tp/ldbc/data`, or
52+
HDFS as `hdfs://<ip_addr:port>/path/tp/ldbc/data`. In addition, we are parsing the data in
53+
a cluster of k machines, configured in a file named `hosts`, in which each line has the form of `<ip_addr:port>`.
54+
55+
One simply calls the following to parse the graph data:
56+
```
57+
./parse.sh -r <root_dir> -d <ldbc_dir> -p <ldbc_partitions> -g <graph_dir> -w <graph_partitions>
58+
-s <graph_schema> -t <hosts>
59+
```
60+
where:
61+
* `root_dir` is the working directory of the parsing tools
62+
* `ldbc_dir` records the LDBC raw data's directory (local fs or HDFS).
63+
* `ldbc_partitions` is the number of partitions of LDBC raw data (after preprocessed).
64+
* `graph_dir` is the main direcotry in which the parsed data will be maintained.
65+
* `graph_partitions` is the number of partitions of the parsed data per **each machine**.
66+
* `graph_schema` is the json-formated schema file of the graph data.
67+
* `hosts` records the hosts in the cluster.
68+
69+
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
111|2012-07-21T07:59:14.322+000|92.39.58.88|Chrome|yes|3
2+
222|2012-07-21T07:59:14.322+000|213.55.127.9|Internet Explorer|thanks|6
3+
333|32012-07-21T07:59:14.322+000|213.55.127.9|Internet Explorer|LOL|3
4+
444|2012-07-21T07:59:14.322+000|213.55.127.9|Internet Explorer|I see|5
5+
555|2012-07-21T07:59:14.322+000|213.55.127.9|Internet Explorer|fine|4
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,4 @@
1+
666|2012-07-21T07:59:14.322+000|92.39.58.88|Chrome|right|5
2+
777|2012-07-21T07:59:14.322+000|204.79.128.176|Firefox|About George Frideric Handel, ful with hisAbout Erwin Rommel, mandy. As onAbout|79
3+
888|2012-07-21T07:59:14.322+000|92.39.58.88|Chrome|good|4
4+
999|2012-07-21T07:59:14.322+000|46.19.159.176|Safari|no|2
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
111|222
2+
222|333
3+
333|333
4+
444|333
5+
555|333
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,4 @@
1+
666|777|2061584302088|10027
2+
777|777|2061584302089|10027
3+
888|777|2061584302090|10027
4+
999|888|2061584302091|2199023256684
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
111|Mahinda|Perera|male|19891203|20100214153210447|119.235.7.103|Firefox
2+
222|Carmen|Lepland|female|19840218|20100128063958781|195.20.151.175|Internet Explorer
3+
333|Hồ Chí|Do|male|19881014|20100215004617657|103.2.223.188|Internet Explorer
4+
444|Rahul|Kumar|female|19800202|20100212122143365|27.97.186.123|Internet Explorer
5+
555|Rahul|Reddy|female|19820529|20100121104441479|27.97.237.23|Firefox
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,4 @@
1+
666|Albin|Monteno|male|19860409|20100216235236860|94.250.4.124|Internet Explorer
2+
777|Meera|Rao|female|19821208|20100122195959221|49.249.98.96|Firefox
3+
888|A.|Rao|female|19850802|20100423225226582|49.202.188.25|Firefox
4+
999|Jack|Smith|male|19810419|20100425054511772|24.212.6.75|Internet Explorer
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
111|222|20100313073721718
2+
111|333|20100920094243187
3+
111|444|20110102064341955
4+
111|555|20120907011130195
5+
111|666|20120717080449463
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,4 @@
1+
222|444|20100804033836982
2+
222|777|20100202163844119
3+
222|888|20100331220757321
4+
222|999|20100724111548162
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.

0 commit comments

Comments
 (0)