Neo4j data integration and build pipeline - https://github.com/elswob/neo4j-build-pipeline
This pipeline originated from the work done to create the graph for EpiGraphDB. With over 20 separate data sets, >10 node types and >40 relationship types we needed to create a pipeline that could make the process relatively easy for others in the group to contribute. By combining Snakemake, Docker, Neo4j and GitHub Actions we have created a pipeline that can create a fully tested Neo4j graph database from raw data.
One of the main aims of this pipeline was performance. Initial efforts used the LOAD CSV
method, but quickly became slow as the size and complexity of the graph increased. Here we focus on creating clean data sets that can be loaded using the neo4j-import
tool, bringing build time down from hours to minutes.
Components of interest:
- Pipeline can be used to prepare raw data, create files for graph, or build graph.
- A defined schema is used to QC all data before loading.
- Merging multiple data sets into a single node type is handled automatically.
- Use
neo4j-import
to build the graph
Note:
- This is not a fully tested pipeline, there are known issues with Docker and Neo4j that need careful consideration.
Install miniconda3
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
sh Miniconda3-latest-Linux-x86_64.sh
Docker (17.06.0-ce) - https://docs.docker.com/get-docker/ Docker Compose (v1.27.4) - https://docs.docker.com/compose/install/
For linux distributions this should be ok, but for mac, may need to install coreutils
brew install coreutils
The following will run the demo data and create a basic graph
#clone the repo (use https if necessary)
git clone [email protected]:elswob/neo4j-build-pipeline.git
cd neo4j-build-pipeline
#create the conda environment
conda env create -f environment.yml
conda activate neo4j_build
#create a basic environment variable file for demo data - this probably requires some edits, but may work as is
cp example.env .env
#run the pipeline
snakemake -r all -j 4
If just testing, simply clone the repo git clone [email protected]:elswob/neo4j-build-pipeline.git
and skip straight to Create conda environment
If creating a new pipeline and graph based on this repo there are two options:
- Fork the repo and skip straight to Create conda environment
- Create a copy of the repo, see below:
- follows method from here - https://github.com/manubot/rootstock/blob/master/SETUP.md#configuration
Create an empty GitHub repository at https://github.com/new.
Make a note of user and repo name
OWNER=xxx
REPO=abc
git clone [email protected]:elswob/neo4j-build-pipeline.git
cd neo4j-build-pipeline
Set the origin URL to match repo created above
git remote set-url origin https://github.com/$OWNER/$REPO.git
git remote set-url origin [email protected]:$OWNER/$REPO.git
Push to new repo
git push --set-upstream origin main
conda env create -f environment.yml
conda activate neo4j_build
snakemake -r clean_all -j 1
snakemake -r check_new_data -j 4
A complete run of the pipeline will create a Neo4j graph within a Docker container, on the machine running the pipeline. The variables that are used for that are defined in the .env
file.
Copy example.env
to .env
and edit
cp example.env .env
- Modify this
- No spaces in paths please :)
- Use absolute/relative paths where stated
- If using remote server for raw data and backups, set SERVER_NAME and set up SSH keys Remote Server
### Data integration variables
#version of graph being built
GRAPH_VERSION=0.0.1
#location of snakemake logs (relative or absolute)
SNAKEMAKE_LOGS=demo/results/logs
#neo4j directories (absolute)
NEO4J_IMPORT_DIR=./demo/neo4j/0.0.1/import
NEO4J_DATA_DIR=./demo/neo4j/0.0.1/data
NEO4J_LOG_DIR=./demo/neo4j/0.0.1/logs
#path to directory containing source data (absolute)
DATA_DIR=demo/source_data
#path to directory containing data processing script directories and code (relative)
PROCESSING_DIR=demo/scripts/processing
#path to directory for graph data backups (relative or absolute)
GRAPH_DIR=demo/results/graph_data
#path to config (relative or absolute)
CONFIG_PATH=demo/config
#name of server if source data is on a remote machine, not needed if all data are local
#SERVER_NAME=None
#number of threads to use for parallel parts
THREADS=10
############################################################################################################
#### Docker things for building graph, ignore if not using
# GRAPH_CONTAINER_NAME:
# Used in docker-compose and snakefile to
# assign container name to the db service to use docker exec
GRAPH_CONTAINER_NAME=neo4j-pipeline-demo-graph
#Neo4j server address (this will be the server running the pipeline and be used to populate the Neo4j web server conf)
NEO4J_ADDRESS=neo4j.server.com
# Neo4j connection
GRAPH_USER=neo4j
GRAPH_HOST=localhost
GRAPH_PASSWORD=changeme
GRAPH_HTTP_PORT=27474
GRAPH_BOLT_PORT=27687
GRAPH_HTTPS_PORT=27473
# Neo4j memory
# Set these to something suitable, for testing the small example data 1G should be fine. For anything bigger, see https://neo4j.com/developer/kb/how-to-estimate-initial-memory-configuration/
GRAPH_HEAP_INITIAL=1G
GRAPH_PAGECACHE=1G
GRAPH_HEAP_MAX=2G
snakemake -r all -j 4
You should then be able to explore the graph via Neo4j browser by visiting the URL of the server hosting the graph plus the GRAPH_HTTP_PORT
number specified, e.g. localhost:27474
. Here you can login with the following
- Connect URL =
bolt://
name_of_server
:GRAPH_BOLT_PORT from .env
- Authentication type =
Username/Password
- Username =
GRAPH_USER from .env
- Password =
GRAPH_PASSWORD from .env
Because neo4j-admin
is expecting arrays in a particlar format, all arrays need to be separated by ;
and have no surrounding []
or quotes. To help with this, there is a function create_neo4j_array_from_array
in workflow.scripts.utils.general
.
Old version of docker-compose, just pip install a new one :)
pip install --user docker-compose
Due to issues with Neo4j 4.* and Docker, need to manually create Neo4j directories before building the graph. If this is not done, Docker will create the Neo4j directories and make them unreadable.
- this happens during Snakemake
create_graph
rule viaworkflow.scripts.graph_build.create_neo4j
. - to run this manually
python -m workflow.scripts.graph_build.create_neo4j
If this error:
Bind for 0.0.0.0:xxx failed: port is already allocated
Then need to change ports in .env
as they are already being used by another container
GRAPH_HTTP_PORT=17474
GRAPH_BOLT_PORT=17687
GRAPH_HTTPS_PORT=17473
When building the graph, if the user is not part of the docker group may see an error like this
Starting database...
Creating Neo4j graph directories
.....
PermissionError: [Errno 13] Permission denied
To fix this, need to be added to docker group
https://docs.docker.com/engine/install/linux-postinstall/
sudo usermod -aG docker $USER
If connections result in the following:
The client is unauthorized due to authentication failure.
There may be an issue with authentication.
First
- check password used, make sure it doesn't contain any special characters such as
#
. If so, change the password, reload .env file, then rebuild:
export $(cat .env | sed 's/#.*//g' | xargs);
snakemake -r clean_all -j1
snakemake -r all -j1
Second
- it is possible to reset a password
- https://neo4j.com/docs/operations-manual/4.0/configuration/password-and-user-recovery/
- Get variables from .env file
export $(cat .env | sed 's/#.*//g' | xargs)
- Disable auth in
docker-compose.yml
- NEO4J_dbms_security_auth__enabled=false
Restart container
docker-compose down
docker-compose up -d
- Reset the password
docker exec -it $GRAPH_CONTAINER_NAME cypher-shell -a localhost:$GRAPH_BOLT_PORT -d system
ALTER USER neo4j SET PASSWORD 'changeme';
- Enable auth in
docker-compose.yml
- NEO4J_dbms_security_auth__enabled=true
Restart container
docker-compose down
docker-compose up -d
- Check connection
If this has worked, you will be asked for username and password, and connection will succeed.
docker exec -it $GRAPH_CONTAINER_NAME cypher-shell -a localhost:$GRAPH_BOLT_PORT -d system
snakemake -r backup_graph -j1
On production server, create data
directory
mkdir data
chmod 777 data
Move dump into data
mv neo4j ./data
Start container
docker-compose -f docker-compose-public.yml up -d
Stop neo4j but keep container open
public_container=db-public
docker exec -it $public_container cypher-shell -a neo4j://localhost:1234 -d system "stop database neo4j;"
Restore the backup
docker exec -it $public_container bin/neo4j-admin restore --from data/neo4j --verbose --force
Restart the database
docker exec -it $public_container cypher-shell -a neo4j://localhost:1234 -d system "start database neo4j;"
Again, based on logic from here - https://github.com/manubot/rootstock/blob/master/SETUP.md#merging-upstream-rootstock-changes
#checkout new branch
git checkout -b nbp-$(date '+%Y-%m-%d')
Pull new commits from neo4j-build-pipeline
#if remote not set
git config remote.neo4j-build-pipeline.url || git remote add neo4j-build-pipeline https://github.com/elswob/neo4j-build-pipeline.git
#pull new commits
git pull --no-ff --no-rebase --no-commit neo4j-build-pipeline main
If no problems, commit new updates
git commit -am 'merging upstream changes'
git push origin nbp-$(date '+%Y-%m-%d')
Then open a pull request
https://snakemake.readthedocs.io/en/v5.1.4/executable.html#visualization
snakemake -r all --rulegraph | dot -Tpdf > rulegraph.pdf
snakemake -r all --dag | dot -Tpdf > dag.pdf
https://snakemake.readthedocs.io/en/stable/snakefiles/reporting.html
Run this after the workflow has finished
snakemake --report report.html