WARC/0.18 File metadata parser and indexer.
- Python 3.6 (or greater)
- psycopg2 2.8.5
- pyyaml 5.1
- validators 0.18.2
- beautifulsoup4 4.9.3
This project is developed for the purpose of extraction and indexing metadata from WARC/0.18 file format.
The input WARC/0.18 file is processed and the metadata is saved in a Postgres Database table.
First step is to configure the database connection in the config.yaml
environment.
Inside the database create the table specified in the docker/init.sql
script.
Alternatively you can use the docker-compose file located in the docker/
folder to spawn a database.
Run: cd docker/ && docker-compose up -d
Example usage of the script:
python3 warcparser.py -f input/15.warc.gz -c config.yaml -n=corpus-name
Stavros Grigoriou