Course material for "Traitement automatique de corpus" (STIC-B545) taught at ULB.
Caution: Python 3.6 or higher required to handle f-strings (3.9 or even 3.10 is better).
There are two ways to run this code:
- Install Docker: https://docs.docker.com/get-docker/
- Start Docker
git clone [email protected]:madewild/tac.git
cd tac
docker-compose build
docker-compose up
Then everything will be available here: http://localhost:8888/lab
In that way you can use the Jupyter notebooks (\*.ipynb
files) at the root of each module and ignore the raw scripts.
It is recommended to run this code in a virtual environment:
git clone [email protected]:madewild/tac.git
cd tac
virtualenv venv --python=python3
source venv/bin/activate
pip install --upgrade pip
pip install -r requirements.txt
python -m spacy download fr_core_news_md
In that way you can use the scripts \*.py
(in the subdirectory raw_scripts
of each module).
They should be executed from the root of this repository.
s1_sql
: querying a simple relational database
s2_sparql
: querying the Wikidata SPARQL endpoint
s3_api
: playing with OpenStreetMap and EUcountries APIs
s4_scrape
: scraping the AVB to retrieve 2833 PDF bulletins
s1_convert
: script to convert PDFs to TXTs, move them to dedicated folder and aggregate them in single big text file
s2_explore
: playing with various categories (city, year, decade, type...)
s3_freq
: basic frequency analysis, hapaxes, long words...
s1_keyword
: using YAKE to extract French keywords in each text file
s2_wordcloud
: generating a wordcloud for a given year (calling filtering.py
in the background)
Install SpaCy from requirements then run this command to download French model: python -m spacy download fr_core_news_sm
s3_ner
: perform NER with SpaCy FR model
s4_sentiment
: analyse positive/negative sentences with textblob
s1_classification
: supervised classification of 20 newsgroups
s2_clustering
: unsupervised clustering with k-means
s3_sentence_tokenizer
: split big text into sentences
s4_word_embeddings
: train and explore a word2vec model on corpus
s1_language_detection
: language identification with langid
s2_machine_translation
: machine translation with transformers
s3_anonymization
: de-identification of data with Faker
s1_extraction
: extract text from various file types
s2_fuzzy_matching
: correcting OCR errors with edit distance