A pyspark based codebase for fetching and formatting metadata from a LIMS db for IGF
- Step 1: Get Miniconda
wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh
- Step 2: Clone git repo
git clone https://github.com/imperial-genomics-facility/LimsMetadataParsing.git
- Step 3: Install conda env from the environment.yml file
conda env create -n ENV_NAME --file environment.yml
- Step 4: Create egg file for LimsMetadataParsing repo
python setup.py bdist_egg
Download UCanAccess from the following link and unzip the contents
parseAccessDbForMetadata.py [-h] -a ACCESS_DB_PATH -q QUOTE_FILE_PATH
-o OUTPUT_PATH -k KNOWN_PROJECTS_LIST -j
UCANACCESS_JAR_PATH
optional arguments:
-h, --help show this help message and exit
-a ACCESS_DB_PATH, --access_db_path ACCESS_DB_PATH
Path to Access LIMS db
-q QUOTE_FILE_PATH, --quote_file_path QUOTE_FILE_PATH
Path to quote xls file
-o OUTPUT_PATH, --output_path OUTPUT_PATH
Output dir path for metadta files
-k KNOWN_PROJECTS_LIST, --known_projects_list KNOWN_PROJECTS_LIST
File containing list of known projects
-j UCANACCESS_JAR_PATH, --ucanaccess_jar_path UCANACCESS_JAR_PATH
Path to ucanaccess jar files
spark-submit \
--master local[NUMBER_OF_CPUS] \
--py-files /path/igfLimsParsing-0.0.1-py3.6.egg \
/path/LimsMetadataParsing/scripts/parseAccessDbForMetadata.py \
-a /path/Database.accdb \
-q /path/Quotes.xlsx \
-o /path/csv_dir \
-k /path/project_list.csv \
-j /path/UCanAccess-4.0.4-bin