Master fix (#1)

ArnauPrat · web-flow · commit b000de6692af · 2017-07-22T11:30:51.000+02:00
* Fixed the code to allow the execution of DataSynth in local mode using
the spark-submit script

* Update README.md
diff --git a/README.md b/README.md
@@ -15,6 +15,60 @@ The core idea of DataSynth, was first described in detail in the paper:
 
 Arnau Prat-Pérez, Joan Guisado-Gámez, Xavier Fernández Salas, Petr Koupy, Siegfried Depner, Davide Basilio Bartolini
 
+## Installing
+
+We use Maven as our build tool. To compile the project, just type the following command in the project's root folder:
+```
+mvn -DskipTests assembly:assembly
+```
+Additionally, DataSynth requires a working installation of [Apache Spark](http://spark.apache.org) 2.0.1, compiled for your Hadoop version 
+
+## Running DataSynth
+
+DataSynth uses Apache Spark to perform de generation of data. As a Spark application, it is executed using the spark-submit script provided by spark. From DataSynth's root folder, execute the following command:
+```
+$SPARK_HOME/bin/spark-submit -v --master local[*] --class org.dama.datasynth.DataSynth target/datasynth-1.0-SNAPSHOT-jar-with-dependencies.jar --schema-file file://./src/main/resources/examples/example.json  --output-dir file://./datasynth
+```
+The <kbd>--output-dir</kbd> option specifies the folder where the generated dataset will be placed, while the <kbd>--schema-file</kbd> specifies the schema of the graph to generate. Prefixing paths with "file://" or "hdfs://" is required. The example.json schema file defines the following schema:
+
+```json
+{
+  "nodeTypes" : [ 
+    {   
+      "name" : "person",
+      "instances" : 1000000,
+      "properties" : [ 
+        {
+          "name": "attribute1",
+          "dataType": "Int",
+          "generator": {
+            "name":"org.dama.datasynth.common.generators.property.empirical.IntGenerator",
+            "dependencies":[],
+            "initParameters" : ["file://./src/main/resources/distributions/intDistribution.txt:File"," :String"]}
+        }
+      ]   
+    }   
+  ],  
+  "edgeTypes" : [ 
+    {   
+      "name" : "knows",
+      "source" : "person",
+      "target" : "person",
+      "structure" : { 
+        "name" : "org.dama.datasynth.common.generators.structure.BTERGenerator",
+        "initParameters" : ["file://./src/main/resources/degrees/dblp:File","file://./src/main/resources/ccs/dblp:File"]
+      }   
+    }   
+  ]
+}
+```
+Currently, the schema is specified in a rather low level json, although we plan to release a Domain Specific Language for convenience. The above schema specifies the generation of 1000000 entities of type Person, which contain an Integer attribute. Such attribute is generated with the property generator "org.dama.datasynth.common.generators.property.empirical.IntGenerator". 
+
+For now, a Property Generator is a class responsible of generating the values of an attribute for a given entity. The "initParameters" field specifies the required parameters for initializing the generator and their types. In this case, we pass a pointer to a file containing the distribution of the integer values to generate.
+
+The schema also specifies the generation of an edge type with name "knows", which connects paris of persons. The edge is generated with a Structure Generator, which is the responsible of generating the graph connecting the nodes. In this case, we use a BTER graph generator, which takes the degree distribution and the average clustering coefficient per degree as parameters.
+
+
 ## Contributing
 
 Feel free to contribute to the project by issuing pull requests, suggestions
diff --git a/src/main/resources/examples/example.json b/src/main/resources/examples/example.json
@@ -0,0 +1,29 @@
+{
+  "nodeTypes" : [
+    {
+      "name" : "person",
+      "instances" : 1000000,
+      "properties" : [
+        {
+          "name": "attribute1",
+          "dataType": "Int",
+          "generator": {
+            "name":"org.dama.datasynth.common.generators.property.empirical.IntGenerator",
+            "dependencies":[],
+            "initParameters" : ["file://./src/main/resources/distributions/intDistribution.txt:File"," :String"]}
+        }
+      ]
+    }
+  ],
+  "edgeTypes" : [
+    {
+      "name" : "knows",
+      "source" : "person",
+      "target" : "person",
+      "structure" : {
+        "name" : "org.dama.datasynth.common.generators.structure.BTERGenerator",
+        "initParameters" : ["file://./src/main/resources/degrees/dblp:File","file://./src/main/resources/ccs/dblp:File"]
+      }
+    }
+  ]
+}
diff --git a/src/main/scala/org/dama/datasynth/DataSynth.scala b/src/main/scala/org/dama/datasynth/DataSynth.scala
@@ -17,6 +17,7 @@ object DataSynth {
 
   def main( args : Array[String] ) {
     val dataSynthConfig = DataSynthConfig(args.toList)
+    DataSynthConfig.validateConfig(dataSynthConfig)
     val json : String = File(dataSynthConfig.schemaFile)
                         .open()
                         .toList
diff --git a/src/main/scala/org/dama/datasynth/DataSynthConfig.scala b/src/main/scala/org/dama/datasynth/DataSynthConfig.scala
@@ -47,10 +47,20 @@ object DataSynthConfig {
       case Nil => currentConfig
     }
   }
+
+  def validateConfig( config : DataSynthConfig ) = {
+    if(config.outputDir.equals("") ){
+      throw new RuntimeException(s"Output dir not specified. Use --output-dir <path> option")
+    }
+
+    if(config.schemaFile.equals("") ){
+      throw new RuntimeException(s"Schema file not specified. Use --schema-file <path> option")
+    }
+  }
 }
 
-class DataSynthConfig ( val outputDir : String = "file://./datasynth",
-                        val schemaFile : String = "file://./schema.json",
+class DataSynthConfig ( val outputDir : String = "",
+                        val schemaFile : String = "",
                         val masterWorkspaceDir : String = "file:///tmp",
                         val datasynthWorkspaceDir : String = "file:///tmp")
 {
diff --git a/src/main/scala/org/dama/datasynth/common/utils/FileUtils.scala b/src/main/scala/org/dama/datasynth/common/utils/FileUtils.scala
@@ -14,6 +14,7 @@ object FileUtils {
 
   val hdfsRegex : Regex = "hdfs://(.*)".r
   val fileRegex : Regex = "file://(.*)".r
+  val validRegex : Regex = "(file://|hdfs://)(/.*)".r
 
   def removePrefix( filename : String ) : String = {
     filename match {
@@ -37,6 +38,17 @@ object FileUtils {
     }
   }
 
+  def validateUri( filename : String ) = {
+    filename match {
+      case validRegex(_,filename) => Unit
+      case _ => {
+        throw new RuntimeException(s"Invalid URI: ${filename}. URIs must be " +
+                                     s"prefixed with either file:// " +
+                                     s"or hdfs:// and be an absolute path")
+      }
+    }
+  }
+
 
   case class File( filename : String ) {
     def open() : Iterator[String] = {
diff --git a/src/main/scala/org/dama/datasynth/runtime/spark/SparkRuntime.scala b/src/main/scala/org/dama/datasynth/runtime/spark/SparkRuntime.scala
@@ -102,7 +102,7 @@ object SparkRuntime{
     val hdfsMaster = sparkSession.sparkContext.hadoopConfiguration.get("fs.default.name")
     val prefix = config.outputDir match {
       case path : String if FileUtils.isHDFS(path) => hdfsMaster +"/"+FileUtils.removePrefix(path)
-      case path : String if FileUtils.isLocal(path) => path
+      case path : String if FileUtils.isLocal(path) => FileUtils.removePrefix(path)
     }
     modifiedExecutionPlan.foreach(table =>
       fetchTableOperator(table).write.csv(prefix+"/"+table.name)
diff --git a/src/test/scala/org/dama/datasynth/DataSynthTest.scala b/src/test/scala/org/dama/datasynth/DataSynthTest.scala
@@ -22,17 +22,19 @@ class DataSynthTest extends FlatSpec with Matchers with BeforeAndAfterAll {
 
     SparkSession.builder().master("local[*]").getOrCreate()
 
-    val testFolder = new File("./test")
-    val dataFolder = new File("./test/data")
-    val masterWorkspaceFolder = new File("./test/workspace")
-    val datasynthWorkspaceFolder = new File("./test/workspace")
+    val currentDirectory = new java.io.File(".").getCanonicalPath
+    val testFolder = new File(currentDirectory+"/test")
+    val dataFolder = new File(currentDirectory+"/test/data")
+    val masterWorkspaceFolder = new File(currentDirectory+"/test/workspace")
+    val datasynthWorkspaceFolder = new File(currentDirectory+"/test/workspace")
     testFolder.mkdir()
     dataFolder.mkdir()
     masterWorkspaceFolder.mkdir()
     val result = Try(DataSynth.main(List("--output-dir", "file://"+dataFolder.getAbsolutePath,
                                          "--master-workspace-dir", "file://"+masterWorkspaceFolder.getAbsolutePath,
                                          "--datasynth-workspace-dir", "file://"+datasynthWorkspaceFolder.getAbsolutePath,
-                                         "--schema-file", "file://./src/test/resources/test.json").toArray))
+                                         "--schema-file", "file://"+currentDirectory+"/src/test/resources/test.json")
+                                    .toArray))
     FileUtils.deleteDirectory(testFolder)
     result match {
       case Success(_) => Unit
diff --git a/src/test/scala/org/dama/datasynth/common/generators/structure/BTERGeneratorTest.scala b/src/test/scala/org/dama/datasynth/common/generators/structure/BTERGeneratorTest.scala
@@ -15,8 +15,10 @@ import org.scalatest.{BeforeAndAfterAll, FlatSpec, Matchers}
 class BTERGeneratorTest extends FlatSpec with Matchers with BeforeAndAfterAll {
 
   "BTERGenerator " should "not crash and produce a graph in /tmp/bterEdges" in {
-    val bterGenerator = new BTERGenerator(utils.FileUtils.File("file://./src/main/resources/degrees/dblp"),
-                             utils.FileUtils.File("file://./src/main/resources/ccs/dblp"));
+
+    val currentDirectory = new java.io.File(".").getCanonicalPath
+    val bterGenerator = new BTERGenerator(utils.FileUtils.File("file://"+currentDirectory+"/src/main/resources/degrees/dblp"),
+                             utils.FileUtils.File("file://"+currentDirectory+"/src/main/resources/ccs/dblp"));
     bterGenerator.run(1000000, new Configuration(), "hdfs:///tmp/bterEdges")
     val fileSystem = FileSystem.get(new Configuration())
     fileSystem.exists(new Path("/tmp/bterEdges")) should be (true)
diff --git a/src/test/scala/org/dama/datasynth/runtime/spark/SparkRuntimeTest.scala b/src/test/scala/org/dama/datasynth/runtime/spark/SparkRuntimeTest.scala
@@ -18,6 +18,8 @@ import org.scalatest.{BeforeAndAfterAll, FlatSpec, Matchers, Suite}
 class SparkRuntimeTest extends FlatSpec with Matchers with BeforeAndAfterAll {
 
   val config = DataSynthConfig().setOutputDir("file:///tmp/datasynth")
+                                .schemaFile("file:///tmp/fake.json")
+  DataSynthConfig.validateConfig(config)
 
   " A boolean table " should " contain all true " in {
     SparkSession.builder().master("local[*]").getOrCreate()
diff --git a/src/test/scala/org/dama/datasynth/runtime/spark/operators/OperatorsTest.scala b/src/test/scala/org/dama/datasynth/runtime/spark/operators/OperatorsTest.scala
@@ -15,7 +15,9 @@ import org.scalatest.{FlatSpec, Matchers}
 @RunWith(classOf[JUnitRunner])
 class OperatorsTest extends FlatSpec with Matchers {
 
-  val config = DataSynthConfig().setOutputDir("/tmp/datasynth")
+  val config = DataSynthConfig().setOutputDir("file:///tmp/datasynth")
+                                .schemaFile("file:///tmp/fake.json")
+  DataSynthConfig.validateConfig(config)
 
   " An TableSizeOperator on a table of size 1000" should " should return 1000 " in {
     SparkSession.builder().master("local[*]").getOrCreate()
@@ -41,8 +43,8 @@ class OperatorsTest extends FlatSpec with Matchers {
 
   " An InstantiateGraphGeneratorOperator" should "return an instance of a property generator " in {
     SparkSession.builder().master("local[*]").getOrCreate()
-    val file1 = ExecutionPlan.File("path/to/file")
-    val file2 = ExecutionPlan.File("path/to/file")
+    val file1 = ExecutionPlan.File("file:///path/to/file")
+    val file2 = ExecutionPlan.File("file:///path/to/file")
     val structureGeneratorNode = ExecutionPlan.StructureGenerator("org.dama.datasynth.common.generators.structure.BTERGenerator",Seq(file1, file2))
     SparkRuntime.start(config)
     val generator = SparkRuntime.instantiateStructureGeneratorOperator(structureGeneratorNode)
@@ -61,8 +63,9 @@ class OperatorsTest extends FlatSpec with Matchers {
   }
   "A FetchTableOperator2" should "return a Dataset when fetching a table (either property or edge)" in {
     SparkSession.builder().master("local[*]").getOrCreate()
-    val file1 = ExecutionPlan.File("file://./src/main/resources/degrees/dblp")
-    val file2 = ExecutionPlan.File("file://./src/main/resources/ccs/dblp")
+    val currentDirectory = new java.io.File(".").getCanonicalPath
+    val file1 = ExecutionPlan.File("file://"+currentDirectory+"/src/main/resources/degrees/dblp")
+    val file2 = ExecutionPlan.File("file://"+currentDirectory+"/src/main/resources/ccs/dblp")
     val structureGenerator = ExecutionPlan.StructureGenerator("org.dama.datasynth.common.generators.structure.BTERGenerator",Seq(file1, file2))
     val size = ExecutionPlan.StaticValue[Long](1000)
     val createEdgeTable = EdgeTable("edges",structureGenerator,size)

Original file line number	Diff line number	Diff line change
`@@ -47,10 +47,20 @@ object DataSynthConfig {`
`47`	`47`	`case Nil => currentConfig`
`48`	`48`	`}`
`49`	`49`	`}`
	`50`	`+`
	`51`	`+ def validateConfig( config : DataSynthConfig ) = {`
	`52`	`+ if(config.outputDir.equals("") ){`
	`53`	`+ throw new RuntimeException(s"Output dir not specified. Use --output-dir <path> option")`
	`54`	`+ }`
	`55`	`+`
	`56`	`+ if(config.schemaFile.equals("") ){`
	`57`	`+ throw new RuntimeException(s"Schema file not specified. Use --schema-file <path> option")`
	`58`	`+ }`
	`59`	`+ }`
`50`	`60`	`}`
`51`	`61`
`52`		`-class DataSynthConfig ( val outputDir : String = "file://./datasynth",`
`53`		`- val schemaFile : String = "file://./schema.json",`
	`62`	`+class DataSynthConfig ( val outputDir : String = "",`
	`63`	`+ val schemaFile : String = "",`
`54`	`64`	`val masterWorkspaceDir : String = "file:///tmp",`
`55`	`65`	`val datasynthWorkspaceDir : String = "file:///tmp")`
`56`	`66`	`{`
Original file line number	Diff line number	Diff line change
`@@ -102,7 +102,7 @@ object SparkRuntime{`
`102`	`102`	`val hdfsMaster = sparkSession.sparkContext.hadoopConfiguration.get("fs.default.name")`
`103`	`103`	`val prefix = config.outputDir match {`
`104`	`104`	`case path : String if FileUtils.isHDFS(path) => hdfsMaster +"/"+FileUtils.removePrefix(path)`
`105`		`- case path : String if FileUtils.isLocal(path) => path`
	`105`	`+ case path : String if FileUtils.isLocal(path) => FileUtils.removePrefix(path)`
`106`	`106`	`}`
`107`	`107`	`modifiedExecutionPlan.foreach(table =>`
`108`	`108`	`fetchTableOperator(table).write.csv(prefix+"/"+table.name)`