Skip to content

Commit b000de6

Browse files
authored
Master fix (#1)
* Fixed the code to allow the execution of DataSynth in local mode using the spark-submit script * Update README.md
1 parent 44fd589 commit b000de6

File tree

10 files changed

+130
-15
lines changed

10 files changed

+130
-15
lines changed

README.md

+54
Original file line numberDiff line numberDiff line change
@@ -15,6 +15,60 @@ The core idea of DataSynth, was first described in detail in the paper:
1515

1616
Arnau Prat-Pérez, Joan Guisado-Gámez, Xavier Fernández Salas, Petr Koupy, Siegfried Depner, Davide Basilio Bartolini
1717

18+
## Installing
19+
20+
We use Maven as our build tool. To compile the project, just type the following command in the project's root folder:
21+
```
22+
mvn -DskipTests assembly:assembly
23+
```
24+
Additionally, DataSynth requires a working installation of [Apache Spark](http://spark.apache.org) 2.0.1, compiled for your Hadoop version
25+
26+
## Running DataSynth
27+
28+
DataSynth uses Apache Spark to perform de generation of data. As a Spark application, it is executed using the spark-submit script provided by spark. From DataSynth's root folder, execute the following command:
29+
```
30+
$SPARK_HOME/bin/spark-submit -v --master local[*] --class org.dama.datasynth.DataSynth target/datasynth-1.0-SNAPSHOT-jar-with-dependencies.jar --schema-file file://./src/main/resources/examples/example.json --output-dir file://./datasynth
31+
```
32+
The <kbd>--output-dir</kbd> option specifies the folder where the generated dataset will be placed, while the <kbd>--schema-file</kbd> specifies the schema of the graph to generate. Prefixing paths with "file://" or "hdfs://" is required. The example.json schema file defines the following schema:
33+
34+
```json
35+
{
36+
"nodeTypes" : [
37+
{
38+
"name" : "person",
39+
"instances" : 1000000,
40+
"properties" : [
41+
{
42+
"name": "attribute1",
43+
"dataType": "Int",
44+
"generator": {
45+
"name":"org.dama.datasynth.common.generators.property.empirical.IntGenerator",
46+
"dependencies":[],
47+
"initParameters" : ["file://./src/main/resources/distributions/intDistribution.txt:File"," :String"]}
48+
}
49+
]
50+
}
51+
],
52+
"edgeTypes" : [
53+
{
54+
"name" : "knows",
55+
"source" : "person",
56+
"target" : "person",
57+
"structure" : {
58+
"name" : "org.dama.datasynth.common.generators.structure.BTERGenerator",
59+
"initParameters" : ["file://./src/main/resources/degrees/dblp:File","file://./src/main/resources/ccs/dblp:File"]
60+
}
61+
}
62+
]
63+
}
64+
```
65+
Currently, the schema is specified in a rather low level json, although we plan to release a Domain Specific Language for convenience. The above schema specifies the generation of 1000000 entities of type Person, which contain an Integer attribute. Such attribute is generated with the property generator "org.dama.datasynth.common.generators.property.empirical.IntGenerator".
66+
67+
For now, a Property Generator is a class responsible of generating the values of an attribute for a given entity. The "initParameters" field specifies the required parameters for initializing the generator and their types. In this case, we pass a pointer to a file containing the distribution of the integer values to generate.
68+
69+
The schema also specifies the generation of an edge type with name "knows", which connects paris of persons. The edge is generated with a Structure Generator, which is the responsible of generating the graph connecting the nodes. In this case, we use a BTER graph generator, which takes the degree distribution and the average clustering coefficient per degree as parameters.
70+
71+
1872
## Contributing
1973

2074
Feel free to contribute to the project by issuing pull requests, suggestions
+29
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,29 @@
1+
{
2+
"nodeTypes" : [
3+
{
4+
"name" : "person",
5+
"instances" : 1000000,
6+
"properties" : [
7+
{
8+
"name": "attribute1",
9+
"dataType": "Int",
10+
"generator": {
11+
"name":"org.dama.datasynth.common.generators.property.empirical.IntGenerator",
12+
"dependencies":[],
13+
"initParameters" : ["file://./src/main/resources/distributions/intDistribution.txt:File"," :String"]}
14+
}
15+
]
16+
}
17+
],
18+
"edgeTypes" : [
19+
{
20+
"name" : "knows",
21+
"source" : "person",
22+
"target" : "person",
23+
"structure" : {
24+
"name" : "org.dama.datasynth.common.generators.structure.BTERGenerator",
25+
"initParameters" : ["file://./src/main/resources/degrees/dblp:File","file://./src/main/resources/ccs/dblp:File"]
26+
}
27+
}
28+
]
29+
}

src/main/scala/org/dama/datasynth/DataSynth.scala

+1
Original file line numberDiff line numberDiff line change
@@ -17,6 +17,7 @@ object DataSynth {
1717

1818
def main( args : Array[String] ) {
1919
val dataSynthConfig = DataSynthConfig(args.toList)
20+
DataSynthConfig.validateConfig(dataSynthConfig)
2021
val json : String = File(dataSynthConfig.schemaFile)
2122
.open()
2223
.toList

src/main/scala/org/dama/datasynth/DataSynthConfig.scala

+12-2
Original file line numberDiff line numberDiff line change
@@ -47,10 +47,20 @@ object DataSynthConfig {
4747
case Nil => currentConfig
4848
}
4949
}
50+
51+
def validateConfig( config : DataSynthConfig ) = {
52+
if(config.outputDir.equals("") ){
53+
throw new RuntimeException(s"Output dir not specified. Use --output-dir <path> option")
54+
}
55+
56+
if(config.schemaFile.equals("") ){
57+
throw new RuntimeException(s"Schema file not specified. Use --schema-file <path> option")
58+
}
59+
}
5060
}
5161

52-
class DataSynthConfig ( val outputDir : String = "file://./datasynth",
53-
val schemaFile : String = "file://./schema.json",
62+
class DataSynthConfig ( val outputDir : String = "",
63+
val schemaFile : String = "",
5464
val masterWorkspaceDir : String = "file:///tmp",
5565
val datasynthWorkspaceDir : String = "file:///tmp")
5666
{

src/main/scala/org/dama/datasynth/common/utils/FileUtils.scala

+12
Original file line numberDiff line numberDiff line change
@@ -14,6 +14,7 @@ object FileUtils {
1414

1515
val hdfsRegex : Regex = "hdfs://(.*)".r
1616
val fileRegex : Regex = "file://(.*)".r
17+
val validRegex : Regex = "(file://|hdfs://)(/.*)".r
1718

1819
def removePrefix( filename : String ) : String = {
1920
filename match {
@@ -37,6 +38,17 @@ object FileUtils {
3738
}
3839
}
3940

41+
def validateUri( filename : String ) = {
42+
filename match {
43+
case validRegex(_,filename) => Unit
44+
case _ => {
45+
throw new RuntimeException(s"Invalid URI: ${filename}. URIs must be " +
46+
s"prefixed with either file:// " +
47+
s"or hdfs:// and be an absolute path")
48+
}
49+
}
50+
}
51+
4052

4153
case class File( filename : String ) {
4254
def open() : Iterator[String] = {

src/main/scala/org/dama/datasynth/runtime/spark/SparkRuntime.scala

+1-1
Original file line numberDiff line numberDiff line change
@@ -102,7 +102,7 @@ object SparkRuntime{
102102
val hdfsMaster = sparkSession.sparkContext.hadoopConfiguration.get("fs.default.name")
103103
val prefix = config.outputDir match {
104104
case path : String if FileUtils.isHDFS(path) => hdfsMaster +"/"+FileUtils.removePrefix(path)
105-
case path : String if FileUtils.isLocal(path) => path
105+
case path : String if FileUtils.isLocal(path) => FileUtils.removePrefix(path)
106106
}
107107
modifiedExecutionPlan.foreach(table =>
108108
fetchTableOperator(table).write.csv(prefix+"/"+table.name)

src/test/scala/org/dama/datasynth/DataSynthTest.scala

+7-5
Original file line numberDiff line numberDiff line change
@@ -22,17 +22,19 @@ class DataSynthTest extends FlatSpec with Matchers with BeforeAndAfterAll {
2222

2323
SparkSession.builder().master("local[*]").getOrCreate()
2424

25-
val testFolder = new File("./test")
26-
val dataFolder = new File("./test/data")
27-
val masterWorkspaceFolder = new File("./test/workspace")
28-
val datasynthWorkspaceFolder = new File("./test/workspace")
25+
val currentDirectory = new java.io.File(".").getCanonicalPath
26+
val testFolder = new File(currentDirectory+"/test")
27+
val dataFolder = new File(currentDirectory+"/test/data")
28+
val masterWorkspaceFolder = new File(currentDirectory+"/test/workspace")
29+
val datasynthWorkspaceFolder = new File(currentDirectory+"/test/workspace")
2930
testFolder.mkdir()
3031
dataFolder.mkdir()
3132
masterWorkspaceFolder.mkdir()
3233
val result = Try(DataSynth.main(List("--output-dir", "file://"+dataFolder.getAbsolutePath,
3334
"--master-workspace-dir", "file://"+masterWorkspaceFolder.getAbsolutePath,
3435
"--datasynth-workspace-dir", "file://"+datasynthWorkspaceFolder.getAbsolutePath,
35-
"--schema-file", "file://./src/test/resources/test.json").toArray))
36+
"--schema-file", "file://"+currentDirectory+"/src/test/resources/test.json")
37+
.toArray))
3638
FileUtils.deleteDirectory(testFolder)
3739
result match {
3840
case Success(_) => Unit

src/test/scala/org/dama/datasynth/common/generators/structure/BTERGeneratorTest.scala

+4-2
Original file line numberDiff line numberDiff line change
@@ -15,8 +15,10 @@ import org.scalatest.{BeforeAndAfterAll, FlatSpec, Matchers}
1515
class BTERGeneratorTest extends FlatSpec with Matchers with BeforeAndAfterAll {
1616

1717
"BTERGenerator " should "not crash and produce a graph in /tmp/bterEdges" in {
18-
val bterGenerator = new BTERGenerator(utils.FileUtils.File("file://./src/main/resources/degrees/dblp"),
19-
utils.FileUtils.File("file://./src/main/resources/ccs/dblp"));
18+
19+
val currentDirectory = new java.io.File(".").getCanonicalPath
20+
val bterGenerator = new BTERGenerator(utils.FileUtils.File("file://"+currentDirectory+"/src/main/resources/degrees/dblp"),
21+
utils.FileUtils.File("file://"+currentDirectory+"/src/main/resources/ccs/dblp"));
2022
bterGenerator.run(1000000, new Configuration(), "hdfs:///tmp/bterEdges")
2123
val fileSystem = FileSystem.get(new Configuration())
2224
fileSystem.exists(new Path("/tmp/bterEdges")) should be (true)

src/test/scala/org/dama/datasynth/runtime/spark/SparkRuntimeTest.scala

+2
Original file line numberDiff line numberDiff line change
@@ -18,6 +18,8 @@ import org.scalatest.{BeforeAndAfterAll, FlatSpec, Matchers, Suite}
1818
class SparkRuntimeTest extends FlatSpec with Matchers with BeforeAndAfterAll {
1919

2020
val config = DataSynthConfig().setOutputDir("file:///tmp/datasynth")
21+
.schemaFile("file:///tmp/fake.json")
22+
DataSynthConfig.validateConfig(config)
2123

2224
" A boolean table " should " contain all true " in {
2325
SparkSession.builder().master("local[*]").getOrCreate()

src/test/scala/org/dama/datasynth/runtime/spark/operators/OperatorsTest.scala

+8-5
Original file line numberDiff line numberDiff line change
@@ -15,7 +15,9 @@ import org.scalatest.{FlatSpec, Matchers}
1515
@RunWith(classOf[JUnitRunner])
1616
class OperatorsTest extends FlatSpec with Matchers {
1717

18-
val config = DataSynthConfig().setOutputDir("/tmp/datasynth")
18+
val config = DataSynthConfig().setOutputDir("file:///tmp/datasynth")
19+
.schemaFile("file:///tmp/fake.json")
20+
DataSynthConfig.validateConfig(config)
1921

2022
" An TableSizeOperator on a table of size 1000" should " should return 1000 " in {
2123
SparkSession.builder().master("local[*]").getOrCreate()
@@ -41,8 +43,8 @@ class OperatorsTest extends FlatSpec with Matchers {
4143

4244
" An InstantiateGraphGeneratorOperator" should "return an instance of a property generator " in {
4345
SparkSession.builder().master("local[*]").getOrCreate()
44-
val file1 = ExecutionPlan.File("path/to/file")
45-
val file2 = ExecutionPlan.File("path/to/file")
46+
val file1 = ExecutionPlan.File("file:///path/to/file")
47+
val file2 = ExecutionPlan.File("file:///path/to/file")
4648
val structureGeneratorNode = ExecutionPlan.StructureGenerator("org.dama.datasynth.common.generators.structure.BTERGenerator",Seq(file1, file2))
4749
SparkRuntime.start(config)
4850
val generator = SparkRuntime.instantiateStructureGeneratorOperator(structureGeneratorNode)
@@ -61,8 +63,9 @@ class OperatorsTest extends FlatSpec with Matchers {
6163
}
6264
"A FetchTableOperator2" should "return a Dataset when fetching a table (either property or edge)" in {
6365
SparkSession.builder().master("local[*]").getOrCreate()
64-
val file1 = ExecutionPlan.File("file://./src/main/resources/degrees/dblp")
65-
val file2 = ExecutionPlan.File("file://./src/main/resources/ccs/dblp")
66+
val currentDirectory = new java.io.File(".").getCanonicalPath
67+
val file1 = ExecutionPlan.File("file://"+currentDirectory+"/src/main/resources/degrees/dblp")
68+
val file2 = ExecutionPlan.File("file://"+currentDirectory+"/src/main/resources/ccs/dblp")
6669
val structureGenerator = ExecutionPlan.StructureGenerator("org.dama.datasynth.common.generators.structure.BTERGenerator",Seq(file1, file2))
6770
val size = ExecutionPlan.StaticValue[Long](1000)
6871
val createEdgeTable = EdgeTable("edges",structureGenerator,size)

0 commit comments

Comments
 (0)