Semantic text and metadata extraction from conference proceedings for constructing knowledge bases and semantic search.
The dataset covers major conferences in machine learning, natural language and speech processing.
The dataset
directory contains the directory structure for a preview. A sample processed json object is available here.
Download the dataset here. Download size: 508MB, Extracted size: 6GB
The current release of the dataset has documents processed from the following conference proceedings.
NeurIPS | EMNLP | ACL | InterSpeech |
---|---|---|---|
- | - | - | - |
2019 | 2019 | 2019 | 2019 |
2018 | - | 2018 | 2018 |
2017 | - | 2017 | 2017 |
2016 | - | 2016 | - |
2015 | - | 2015 | - |
Note: Few files from certain proceedings are dropped from the dataset due to parsing errors. More proceedings volumes to be added soon.
The dataset contains the following fields extracted from each document.
- Semantically extracted fields using a nltk and spacy pipeline.
entities
: Named Entity Recognition is performed on the document text to extract text span as entites and tag them with entity_type.tags
: Part of Speech tagged tokens extracted from document text.parser
: Dependency Parsing between text spans of document.noun_chunks
: Base noun phrases that have a noun as their head.
- Metadata fields extracted using PyPDF2
filename
metadata
numPages
title
author
subject
creator
producer
keywords
creationdate
moddate
trapped
ptexfullbanner
raw_text
-
An instance of the extraction pipeline can be run locally to produce processed json in the same format as from the dataset.
Clone the repository and cd into proceedings/
.
git clone https://github.com/enigmaeth/proceedings
cd proceedings/
The repository requires python3
to be installed. Check the python version and initialise a virtualenv to install requirements.
$ python --version
Python 3.6.9
$ python3 -m venv .
$ source .env/bin/activate
$ pip install -r requirements.txt
Download the required spacy and nltk modules.
$ pip install -U spacy
$ pip install -U spacy-lookups-data
$ python -m spacy download en_core_web_sm
# Install the nltk punkt module
$ python3
>>> import nltk
>>> nltk.download()
NLTK Downloader
---------------------------------------------------------------------------
d) Download l) List u) Update c) Config h) Help q) Quit
---------------------------------------------------------------------------
Downloader> d
Download which package (l=list; x=cancel)?
Identifier> punkt
# punkt module should be downloaded now.
The entry point for the pipeline is src/extract.py
. It defines the following three arguments.
--directory DIRECTORY **REQUIRED**
Root dir with documents to process.
eg: ~/docs/acl/acl_2015
--accepted_formats ACCEPTED_FORMATS
File extensions to process. Default: ["pdf"]
--max_size MAX_SIZE Max size of a single file to process. Default: 5MB
To process a documents directory with extract.py
, run the following command. The path to directory should be an absolute path.
python3 extract.py --directory <path/to/directory>
Contributions for adding processed proceedings are welcome and would help grow the dataset quickly.
-
If you are interested in adding a new proceeding, please follow the steps in
Local Development
and send a Pull Request with the processed json (in.zip
/.tar.gz
) that would be added in the next release! -
If you want a new proceeding to be added, please open a new issue using the template here!
All the semantic extraction components live inside src/repository/semantic_extract.py
file.
-
If you are interested in adding a new semantic extraction component, please follow the steps in
Local Development
to run a local instance. Verify if your component works by running it on thetest
directory and send a Pull Request! -
If you want a new component to be added, please open a new issue using the template here!
The project is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.