Corpus-Viewer-Addons
provides a set of distributed services to analyze texts from large document corpora in a fast and easy way.
- NLP Toolkit
- A set of Topic Models trained from internal corpora and packaged as Restful APIs
- A repository focused on search and explore documents from their topic distributions
-
Install Docker-Engine and Docker-Compose
-
Clone this repo and move into
src/test/docker/
directory.git clone https://github.com/cbadenes/corpus-viewer-addons.git
- Move into
src/test/docker/nlp/en
directory. - Run the service by:
docker-compose up -d
- You should be able to monitor the progress by:
docker-compose logs -f
- The above command runs two services: DBpedia Spotlight and librAIry NLP-EN, and uses the settings specified within
docker-compose.yml
. - The HTTP Restful-API should be available at:
http://localhost:7777/en
- More info here
- Move into
src/test/docker/models
directory. - Run the models by:
docker-compose up -d
- You should be able to monitor the progress by:
docker-compose logs -f
- The above command runs a web service for each model and uses the settings specified within
docker-compose.yml
. - The HTTP Restful-APIs should be available at:
http://localhost:800[x]/model
A Topic Model is described by the following services:
/settings
: read the metadata of the model./topics
: lists the topics discovered by the model./topics/{id}
: get topic details./topics/{id}/words
: explore a topic from its word distribution./topics/{id}/neighbours
: explore a topic from its concurrence topics./inferences
: build a vector with the topic distributions from a given text.
- Move into
src/test/docker/solr
directory. - Create collections by:
create.sh
- Manage service by:
start.sh
,stop.sh
andclean.sh
scripts" - You should be able to monitor the progress by:
docker solr logs -f
- The above command runs a Solr service and uses the settings specified within
docker-compose.yml
. - The HTTP Admin console should be available at:
http://localhost:8983
In order to be able to index documents it is necessary to have the topic distributions (doctopics) along with the texts.
You can download the following corpus: CORDIS, Wikipedia or Patstat, by executing the following script:
```
./download-corpora.sh
```
Once executed, the /corpora
folder is created with the texts and vectors associated for each corpus.
Then you can run the unit tests to load data (src/test/java/load
) into the repository:
LoadDocuments
: Saves texts and meta-information of each document in thedocuments
collectionLoadDocTopics
: Saves the topic distribution of each document in thedoctopics
collectionLoadTopicDocs
: Saves the topic info of each model in thetopicdocs
collection
There are multiple ways to explore the content of the corpus. You can take a look at the queries we've defined in src/test/java/query
.
You can use the following to cite this work:
@inproceedings{Badenes-Olmedo:2017:DTM:3103010.3121040,
author = {Badenes-Olmedo, Carlos and Redondo-Garcia, Jos{\'e} Luis and Corcho, Oscar},
title = {Distributing Text Mining Tasks with librAIry},
booktitle = {Proceedings of the 2017 ACM Symposium on Document Engineering},
series = {DocEng '17},
year = {2017},
isbn = {978-1-4503-4689-4},
pages = {63--66},
numpages = {4},
url = {http://doi.acm.org/10.1145/3103010.3121040},
doi = {10.1145/3103010.3121040},
acmid = {3121040},
publisher = {ACM},
keywords = {data integration, large-scale text analysis, nlp, scholarly data, text mining},
}