alibaba
diff --git a/‎NOTICE.txt
+4 b/‎NOTICE.txt
+4
diff --git a/‎research/gaia/README.md
+105-1 b/‎research/gaia/README.md
+105-1
diff --git a/‎research/gaia/graph_store/Cargo.toml
+34 b/‎research/gaia/graph_store/Cargo.toml
+34
diff --git a/‎research/gaia/graph_store/README.md
+69 b/‎research/gaia/graph_store/README.md
+69
diff --git a/‎research/gaia/graph_store/data/more_data/graph_data/comment/part-00
+5 b/‎research/gaia/graph_store/data/more_data/graph_data/comment/part-00
+5
diff --git a/‎research/gaia/graph_store/data/more_data/graph_data/comment/part-01
+4 b/‎research/gaia/graph_store/data/more_data/graph_data/comment/part-01
+4
diff --git a/‎research/gaia/graph_store/data/more_data/graph_data/comment_hasCreator_person/part-00
+5 b/‎research/gaia/graph_store/data/more_data/graph_data/comment_hasCreator_person/part-00
+5
diff --git a/‎research/gaia/graph_store/data/more_data/graph_data/comment_hasCreator_person/part-01
+4 b/‎research/gaia/graph_store/data/more_data/graph_data/comment_hasCreator_person/part-01
+4
diff --git a/‎research/gaia/graph_store/data/more_data/graph_data/person/part-00
+5 b/‎research/gaia/graph_store/data/more_data/graph_data/person/part-00
+5
diff --git a/‎research/gaia/graph_store/data/more_data/graph_data/person/part-01
+4 b/‎research/gaia/graph_store/data/more_data/graph_data/person/part-01
+4
diff --git a/‎research/gaia/graph_store/data/more_data/graph_data/person_knows_person/part-00
+5 b/‎research/gaia/graph_store/data/more_data/graph_data/person_knows_person/part-00
+5
diff --git a/‎research/gaia/graph_store/data/more_data/graph_data/person_knows_person/part-01
+4 b/‎research/gaia/graph_store/data/more_data/graph_data/person_knows_person/part-01
+4
diff --git a/‎research/gaia/graph_store/data/more_data/graph_data_bin/partition_0/edge_property
152 Bytes b/‎research/gaia/graph_store/data/more_data/graph_data_bin/partition_0/edge_property
152 Bytes
diff --git a/‎research/gaia/graph_store/data/more_data/graph_data_bin/partition_0/graph_struct
244 Bytes b/‎research/gaia/graph_store/data/more_data/graph_data_bin/partition_0/graph_struct
244 Bytes
diff --git a/‎research/gaia/graph_store/data/more_data/graph_data_bin/partition_0/index_data
624 Bytes b/‎research/gaia/graph_store/data/more_data/graph_data_bin/partition_0/index_data
624 Bytes
diff --git a/‎research/gaia/graph_store/data/more_data/graph_data_bin/partition_0/node_property
2.27 KB b/‎research/gaia/graph_store/data/more_data/graph_data_bin/partition_0/node_property
2.27 KB
@@ -55,3 +55,7 @@ This product includes software from the hgraphdb project (Apache 2.0)
 This product includes software from the cuba-platform/cuba project (Apache 2.0)
 * Copyright (c) 2008-2016 Haulmont
 * https://github.com/cuba-platform/cuba
+
+This product includes software from the natelandau/shell-scripts project
+* Which has no license, but open sourced to public domain by Nathaniel Landau
+* https://github.com/natelandau/shell-scripts
@@ -1 +1,105 @@
-# GAIA
+# Overview
+GAIA (GrAph Interactive Analytics) is a full-fledged system for large-scale interactive graph analytics in the distributed context. 
+GAIA is based on the Tinkerpop's Gremlin query language (https://tinkerpop.apache.org/). Given a Gremlin query, Gaia
+will compile it into a dataflow with the help of the powerful Scope abstraction, and then schedule the computation in
+a distributed runtime.
+
+GAIA has been deployed at [Alibaba Corporation](https://www.alibaba.com/) to support a wide range of businesses from
+e-commerce to cybersecurity. This repository contains three main components of its architecture: 
+* GAIA compiler: As a main technical contribution of GAIA, we propose a powerful abstraction 
+  called *Scope* in order to hide the complex control flow (e.g. conditional and loop) and fine-grained dependency in
+  a Gremlin query from the dataflow engine. Taking a Gremlin query as input, the GAIA compiler is responsible for
+  compiling it to a dataflow (with Scope abstraction) in order to be executed in the dataflow engine. The compiler
+  is built on top of the [Gremlin server](http://tinkerpop.apache.org/docs/3.4.3/reference/##connecting-gremlin-server)
+  interface so that the system can seamlessly interact with the TinkerPop ecosystem, including development tools
+  such as [Gremlin Console](http://tinkerpop.apache.org/docs/3.4.3/reference/##gremlin-console)
+  and language wrappers such as Java and Python.
+* Distributed runtime: The GAIA execution runtime provides automatic support for efficient execution of Gremlin
+  queries at scale. Each query is compiled by the GAIA compiler into a distributed execution plan that is 
+  partitioned across multiple compute nodes for parallel execution. Each partition runs on a separate compute node,
+  managed by a local executor, that schedules and executes computation on a multi-core server.
+* Distributed graph store: The storage layer maintains an input graph that is hash-partitioned across a cluster,
+  with each vertex being placed together with its adjacent (both incoming and outgoing) edges and their attributes. 
+  Here we assume that the storage is coupled with the execution runtime for simplicity, that is each 
+  local executor holds a separate graph partition. In production, more functionalities of storage have been developed,
+  including snapshot isolation, fault tolerance and extensible apis for cloud storage services, while they are 
+  excluded from the open-sourced stack for conflict of interest. 
+
+# Preparement
+## Dependencies
+GAIA builds, runs, and has been tested on GNU/Linux (more specifically Centos 7). 
+Even though GAIA may build on systems similar to Linux, we have not tested correctness or performance,
+so please beware.
+
+At the minimum, Galois depends on the following software:
+* [Rust](https://www.rust-lang.org/) (>= 1.49): GAIA currently works on Rust 1.49, but we suppose that it also works
+  for any later version.
+* Java (jdk 8): Due to a known issue of gRPC that uses an older version of java annotation apis, the project is 
+  subject to jdk 8 for now.
+* Protobuf (3.0): The rust codegen is powered by [prost](https://github.com/danburkert/prost).
+* gRPC: gRPC is used for communication between Rust (engine) and Java (Gremlin server/client). The Rust
+implementation is powered by [tonic](https://github.com/hyperium/tonic)
+* Other Rust and Java dependencies, check
+    * `./gremlin/compiler/pom.xml`
+    * `./gremlin/gremlin_core/Cargo.toml`
+    * `./graph_store/Cargo.toml`
+    * `./pegasus/Cargo.toml`
+
+## Building codes
+TODO
+
+## Generate Graph Data
+Please refer to `./graph_store/README.rd` for details.
+
+# Deployment
+## Deploy GAIA services
+TODO 
+### Single-machine Deployment
+### Distributed Deployment
+## Start Gremlin Server
+After successfully building the codes, you can find `gremlin-server-plugin-1.0-SNAPSHOT-jar-with-dependencies.jar` in 
+`./gremlin/compiler/gremlin-server-plugin/target`, copy it to wherever you want to start the server
+```
+cp ./gremlin/compiler/gremlin-server-plugin/target/gremlin-server-plugin-1.0-SNAPSHOT-jar-with-dependencies.jar /path/to/your/dir
+cp -r ./gremlin/compiler/conf /path/to/your/dir
+cd /path/to/your/dir
+```
+
+There are some configurations to make in `./conf`:
+* Gremlin server address and port: TODO 
+* The graph storage schema: For your reference, we've provided the schema file
+`./conf/modern.schema.json` for [Tinkerpop's modern graph](https://tinkerpop.apache.org/docs/current/tutorials/getting-started/), 
+and `./conf/ldbc.schema.json` for [LDBC generated data](https://github.com/ldbc/ldbc_snb_datagen). 
+TODO: How to customize the schema
+
+Then start up the Gremlin server using
+```
+java -cp .:gremlin-server-plugin-1.0-SNAPSHOT-jar-with-dependencies.jar com.compiler.demo.server.GremlinServiceMain
+```
+
+## Run Query
+- Download TinkerPop's official [gremlin-console](https://archive.apache.org/dist/tinkerpop/3.4.9/apache-tinkerpop-gremlin-console-3.4.9-bin.zip)
+- cd `path/to/gremlin/console`, modify `conf/remote.yaml`
+  ```
+  hosts: [localhost]  # TODO: The hosts and port should align to the above server configuration?
+  port: 8182
+  serializer: { className: org.apache.tinkerpop.gremlin.driver.ser.GryoMessageSerializerV1d0, config: { serializeResultToString: true }}
+  ```
+- Console startup
+  ```
+  ./bin/gremlin.sh
+  :remote connect tinkerpop.server conf/remote.yaml
+  :remote console
+  ```
+- Submit query in console. Have fun!!
+
+# Contact
+TODO
+
+# Acknowledge
+TODO
+
+# Publications
+1. GAIA: A System for Interactive Analysis on Distributed Graphs Using a High-Level Language, Zhengping Qian,
+Chenqiang Min, Longbin Lai, Yong Fang, Gaofeng Li, Youyang Yao, Bingqing Lyu, Xiaoli Zhou, Zhimin Chen, Jingren Zhou,
+   18th USENIX Symposium on Networked Systems Design and Implementation (NSDI 2021), to appear.
@@ -0,0 +1,34 @@
+[package]
+name = "graph_store"
+version = "0.2.0"
+edition = "2018"
+
+[features]
+jemalloc = ["jemallocator"]
+
+[dependencies]
+abomonation = "0.7.3"
+abomonation_derive = "0.5"
+bincode = "1.0.1"
+clap = "2.32.0"
+csv = "1.1"
+env_logger = "0.7.1"
+lazy_static = "1.1.1"
+log = "0.4"
+indexmap = { version = "1.3", features = ["serde-1"] }
+itertools = "0.9"
+jemallocator = { version = "0.3.0", optional = true }
+petgraph = { version = "0.5.0", features = ["serde-1"] }
+rand = "0.5.5"
+rocksdb = "0.14.0"
+serde = { version = "1.0", features = ["derive"] }
+serde_cbor = "0.9.0"
+serde_derive = "1.0"
+serde_json = "1.0"
+tempdir = "0.3.7"
+timely = "0.10"
+walkdir = "2"
+
+[profile.release]
+lto = true
+panic = "abort"
@@ -0,0 +1,69 @@
+# Overview
+The codes maintain the distributed graph storage for GAIA. We adopt the property graph model as
+advocated in modern graph databses such as [Neo4j](https://neo4j.com/) adn [Tinkerpop](https://tinkerpop.apache.org/). 
+We split the graph data into two parts, namely structure data and property data. The structure data contains
+vertices and edges and their labels. Each vertex is identified by a globally unique identity, while the edges are
+maintained in the associated vertices using the conventional adjacency list. We leverage the Rust graph library
+[petgraph](https://github.com/petgraph/petgraph) to maintain the structure data.
+
+The property data are maintained in a variety of ways, as can be found in `src/table.rs`, namely:
+* `PropertyTable`: The default option of in-memory hash table.
+* `SingleValueTable`: An optimized in-memory table that maintains one single value. Altough vertices
+usually contain multiple properties, it is very common for the edges to only contain on single property in practice.
+  In addition, edges are often in a much larger order (10X~100X largers) than vertices. We thus implement `SingleValueTable`
+  as an optimization to ease the edges' storage burden.
+* `RocksTable`: A option to leverage [RocksDB](https://rocksdb.org/) for disk-based storage.
+
+# Usage of LDBC Parser
+## Preliminaries
+We currently provide a tool for parsing (and partitioning) the LDBC raw data generated by
+[LDBC Datagen](https://github.com/ldbc/ldbc_snb_datagen) into our distributed storage. 
+LDBC vertices are uniquely identified by its vertex type (label) and id. We leverage this feature by mapping
+each vertex label to a label id, and then assign each vertex a globally unique id using the combination
+of its label id and ldbc id. Note that certain vertex may have two-level (parimary and secondary) label, for 
+example, a `Company` vertex also has a primary label of `Organization`. In this case, the primary 
+label will be used. Edge is not the first-class citizen in our design, and will be indexed
+according to its source (and target) vertex. Given the global id, partitioning is straightforward: 
+in a cluster of k machines, we randomly assign each vertex to one of the machines according to the hash value of its global id; 
+the edges associated with a vertex v will be placed in the machine of v. Here, both the incoming and outgoing edges
+are considered by default, while an option of only outgoing edges will be provided. 
+
+## LDBC Data Gen & Preprocess
+After generating LDBC data in HDFS, there is a folder like `hdfs://<ip:port>/path/to/ldbc/data/social_network_xx/`, in which
+the vertex data of type `VType` is maintained in the file of `VType_0_0.csv`, and the edge data of type `EType` (with
+source vertex typed `<SrcType>` and target vertex typed `<TgtType>`) in maintained in the file of 
+`<SrcType_EType_TgtType_0_0.csv>`. We require users to write a Hadoop MR program to initially partition the raw graph 
+data. After the pre-partitioning, the vertex data `VType_0_0.csv` must be stored in a folder of 
+`hdfs://<ip:port>/path/to/partitioned/data/VType/` that has the file fragments of `part-0000`, `part-0001` etc.
+Same applies to each edge data. 
+
+## Data Schema
+The schema contains the following metadata for the graph storage:
+* Mapping from vertex label to label id.
+* Mapping from edge label to a 3-tuple, which contains edge label id, ource vertex label id, and target vertex
+  label id. 
+* The properties (name and datatype) of each type of vertex/edge.
+
+The schema file is formatted using Json. We have provided a sampled schema file for LDBC data in `data/schema.json`.
+
+## Parsing tools
+Suppose the LDBC raw data has been preprocessed, and stored in 
+either local disk as `fs:///path/tp/ldbc/data`, or
+HDFS as `hdfs://<ip_addr:port>/path/tp/ldbc/data`. In addition, we are parsing the data in 
+a cluster of k machines, configured in a file named `hosts`, in which each line has the form of `<ip_addr:port>`.
+
+One simply calls the following to parse the graph data: 
+```
+./parse.sh -r <root_dir> -d <ldbc_dir> -p <ldbc_partitions> -g <graph_dir> -w <graph_partitions>
+-s <graph_schema> -t <hosts>
+```
+where:
+* `root_dir` is the working directory of the parsing tools
+* `ldbc_dir` records the LDBC raw data's directory (local fs or HDFS).
+* `ldbc_partitions` is the number of partitions of LDBC raw data (after preprocessed).
+* `graph_dir` is the main direcotry in which the parsed data will be maintained.
+* `graph_partitions` is the number of partitions of the parsed data per **each machine**.
+* `graph_schema` is the json-formated schema file of the graph data.
+* `hosts` records the hosts in the cluster.
+
+
@@ -0,0 +1,5 @@
+111|2012-07-21T07:59:14.322+000|92.39.58.88|Chrome|yes|3
+222|2012-07-21T07:59:14.322+000|213.55.127.9|Internet Explorer|thanks|6
+333|32012-07-21T07:59:14.322+000|213.55.127.9|Internet Explorer|LOL|3
+444|2012-07-21T07:59:14.322+000|213.55.127.9|Internet Explorer|I see|5
+555|2012-07-21T07:59:14.322+000|213.55.127.9|Internet Explorer|fine|4
@@ -0,0 +1,4 @@
+666|2012-07-21T07:59:14.322+000|92.39.58.88|Chrome|right|5
+777|2012-07-21T07:59:14.322+000|204.79.128.176|Firefox|About George Frideric Handel, ful with hisAbout Erwin Rommel, mandy. As onAbout|79
+888|2012-07-21T07:59:14.322+000|92.39.58.88|Chrome|good|4
+999|2012-07-21T07:59:14.322+000|46.19.159.176|Safari|no|2
@@ -0,0 +1,5 @@
+111|222
+222|333
+333|333
+444|333
+555|333
@@ -0,0 +1,4 @@
+666|777|2061584302088|10027
+777|777|2061584302089|10027
+888|777|2061584302090|10027
+999|888|2061584302091|2199023256684
@@ -0,0 +1,5 @@
+111|Mahinda|Perera|male|19891203|20100214153210447|119.235.7.103|Firefox
+222|Carmen|Lepland|female|19840218|20100128063958781|195.20.151.175|Internet Explorer
+333|Hồ Chí|Do|male|19881014|20100215004617657|103.2.223.188|Internet Explorer
+444|Rahul|Kumar|female|19800202|20100212122143365|27.97.186.123|Internet Explorer
+555|Rahul|Reddy|female|19820529|20100121104441479|27.97.237.23|Firefox
@@ -0,0 +1,4 @@
+666|Albin|Monteno|male|19860409|20100216235236860|94.250.4.124|Internet Explorer
+777|Meera|Rao|female|19821208|20100122195959221|49.249.98.96|Firefox
+888|A.|Rao|female|19850802|20100423225226582|49.202.188.25|Firefox
+999|Jack|Smith|male|19810419|20100425054511772|24.212.6.75|Internet Explorer
@@ -0,0 +1,5 @@
+111|222|20100313073721718
+111|333|20100920094243187
+111|444|20110102064341955
+111|555|20120907011130195
+111|666|20120717080449463
@@ -0,0 +1,4 @@
+222|444|20100804033836982
+222|777|20100202163844119
+222|888|20100331220757321
+222|999|20100724111548162
-Original file line number
+Diff line change
 +111|222
 +222|333
 +333|333
 +444|333
 +555|333