sparkwdsub

Spark processing of wikidata subsets using Shape Expressions.

This repo contains an example script that processes Wikidata subsets using Shape Expressions.

The algorithm used to validate schemas in parallel has been ported to its own repo at pschema

Building and running

The system requires sbt to be built. Once you download and install it. You only need to run:

sbt assembly

which will generate a flat jar with all dependencies in folder: target/scala-2.12/ called: sparkwdsub.jar.

Using a local cluster

In order to run the system locally, you need to download and install Apache Spark. You should have the executable spark-submit accessible in your path.

Once you have Spark installed and the sparkwdsub.jar generated you can run with the following example:

spark-submit --class "es.weso.wdsub.spark.Main" --master local[4] target/scala-2.12/sparkwdsub.jar -d examples/sample-dump1.json.gz  -m cluster -n testCities -s examples/cities.shex -k -o target/cities

which will generate a folder target/cities with information about the extracted dump.

Using AWS

In AWS, you can do the following steps:

Upload a Wikidata dump to Amazon S3. Dumps from wikidata are available here
Create a cluster in Amazon EMR. Inside the cluster configure 2 steps: one step to run the spark-submit and another one to fail after the run finishes. The step to run sparkwdsub has the following configuration in our system:

spark-submit --deploy-mode client --class es.weso.wdsub.spark.Main s3://weso/projects/wdsub/sparkwdsub.jar --mode cluster --name sparkwdsubLabra01 --dump s3://weso/datasets/wikidata/dump_20151116.json --schema s3://weso/projects/wdsub/author.shex --out s3://weso/projects/wdsub/

Using Google Cloud

Instructions pending...

Command line

Usage: sparkwdsub dump --schema <file> [--out <file>] [--site <string>] [--maxIterations <integer>] [--verbose] [--loggingLevel <string>] <dumpFile>
 Process example dump file.
 Options and flags:
     --help
         Display this help text.
     --schema <file>, -s <file>
         schema path
     --out <file>, -o <file>
         output path
     --site <string>
         Base url, default =http://www.wikidata.org/entity
     --maxIterations <integer>
        Max iterations for Pregel algorithm, default =20
     --verbose
         Verbose mode
     --loggingLevel <string>
         Logging level (ERROR, WARN, INFO), default=ERROR

Example:

sparkwdsub dump -s examples/cities.shex examples/6lines.json

Name		Name	Last commit message	Last commit date
Latest commit History 67 Commits
.github/workflows		.github/workflows
examples		examples
project		project
sbt		sbt
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
build.sbt		build.sbt
default.properties		default.properties
version.sbt		version.sbt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

sparkwdsub

Building and running

Using a local cluster

Using AWS

Using Google Cloud

Command line

About

Releases 1

Packages

Contributors 4

Languages

License

weso/sparkwdsub

Folders and files

Latest commit

History

Repository files navigation

sparkwdsub

Building and running

Using a local cluster

Using AWS

Using Google Cloud

Command line

About

Resources

License

Stars

Watchers

Forks

Releases 1

Packages 0

Contributors 4

Languages

Packages