You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+54
Original file line number
Diff line number
Diff line change
@@ -15,6 +15,60 @@ The core idea of DataSynth, was first described in detail in the paper:
15
15
16
16
Arnau Prat-Pérez, Joan Guisado-Gámez, Xavier Fernández Salas, Petr Koupy, Siegfried Depner, Davide Basilio Bartolini
17
17
18
+
## Installing
19
+
20
+
We use Maven as our build tool. To compile the project, just type the following command in the project's root folder:
21
+
```
22
+
mvn -DskipTests assembly:assembly
23
+
```
24
+
Additionally, DataSynth requires a working installation of [Apache Spark](http://spark.apache.org) 2.0.1, compiled for your Hadoop version
25
+
26
+
## Running DataSynth
27
+
28
+
DataSynth uses Apache Spark to perform de generation of data. As a Spark application, it is executed using the spark-submit script provided by spark. From DataSynth's root folder, execute the following command:
The <kbd>--output-dir</kbd> option specifies the folder where the generated dataset will be placed, while the <kbd>--schema-file</kbd> specifies the schema of the graph to generate. Prefixing paths with "file://" or "hdfs://" is required. The example.json schema file defines the following schema:
Currently, the schema is specified in a rather low level json, although we plan to release a Domain Specific Language for convenience. The above schema specifies the generation of 1000000 entities of type Person, which contain an Integer attribute. Such attribute is generated with the property generator "org.dama.datasynth.common.generators.property.empirical.IntGenerator".
66
+
67
+
For now, a Property Generator is a class responsible of generating the values of an attribute for a given entity. The "initParameters" field specifies the required parameters for initializing the generator and their types. In this case, we pass a pointer to a file containing the distribution of the integer values to generate.
68
+
69
+
The schema also specifies the generation of an edge type with name "knows", which connects paris of persons. The edge is generated with a Structure Generator, which is the responsible of generating the graph connecting the nodes. In this case, we use a BTER graph generator, which takes the degree distribution and the average clustering coefficient per degree as parameters.
70
+
71
+
18
72
## Contributing
19
73
20
74
Feel free to contribute to the project by issuing pull requests, suggestions
0 commit comments