TextNN is a collection of Python code snippets solving different text mining tasks (on varying datasets) using deep learning.
Before using the code, please install the necessary software dependencies.
- Install conda (i.e., Anaconda or Miniconda)
- Create
textnnconda environment:Update the conda environment (from an old version):conda env create -f environment.yml; \ conda activate textnnconda env update -f environment.yml; \ conda activate textnn
Running the code in a docker container can be achieved by building the image:
docker build --target=env-and-code --tag textnn .and running the image in interactive mode (conda environment automatically loaded)
docker run --rm -it textnnTo be able to reflect current code changes inside the container, you can bind the current directory as code volume:
docker run --rm -v "${PWD}:/code" -it textnnPlease note, changes in the container reflect on the code directory of the host system.
To enable GPU support build with:
docker build --target=gpu-env-and-code --tag textnn .and run:
docker run --rm --runtime=nvidia -it textnnThe recommended EC2 setup (e.g., g3s.xlarge) is based on Deep Learning AMI (Ubuntu) Version 21.2
(ami-0e9085a8d461c2d01) with an increased volumne of 120GB or more. It is recommended to execute code via
Docker, by setting up the project and creating an image:
git clone https://github.com/tongr/TextNN && cd TextNN && \
docker build --target=gpu-env-and-code -t textnn .And running the experiments inside the container:
docker run --rm --runtime=nvidia -v "${PWD}:/code" -it textnnTo build and push the current version (also marked as latest), run:
DATE="$(date -u +'%Y-%m-%dT%H:%M:%SZ')" && \
NAME="registry.gitlab.com/tongr/textnn" && \
VERSION="$(git describe --always)" && \
COMMIT="$(git rev-parse HEAD)" && \
docker build --target=env-and-code --build-arg "BUILD_DATE=${DATE}" --build-arg "BUILD_NAME=${NAME}" \
--build-arg "BUILD_VERSION=${VERSION}" --build-arg "VCS_REF=${COMMIT}" \
--tag ${NAME}:${VERSION} --tag ${NAME}/cpu:${VERSION} . && \
docker tag ${NAME}:${VERSION} ${NAME}:latest && \
docker tag ${NAME}/cpu:${VERSION} ${NAME}/cpu:latest && \
docker push ${NAME}:${VERSION} && docker push ${NAME}/cpu:${VERSION} \
docker push ${NAME}:latest && docker push ${NAME}/cpu:latest && \
docker build --target=gpu-env-and-code --build-arg "BUILD_DATE=${DATE}" --build-arg "BUILD_NAME=${NAME}" \
--build-arg "BUILD_VERSION=${VERSION}" --build-arg "VCS_REF=${COMMIT}" --tag ${NAME}/gpu:${VERSION} . && \
docker tag ${NAME}/gpu:${VERSION} ${NAME}/gpu:latest && \
docker push ${NAME}/gpu:${VERSION} && docker push ${NAME}/gpu:latest
Run tests:
docker run --rm -it registry.gitlab.com/tongr/textnn:latest pytest --cov -vv
The individual datasets have a specific DATASET indicator, the parameters for the following experiments are
equivalent:
-
To run training and evaluation of a LSTM model to predict positive/negative reviews run:
python ./run_experiment.py [DATASET] [OPT_ARGS] train-and-test [--validation-split VALIDATION_HOLD_OUT_RATIO]
where the optional
VALIDATION_HOLD_OUT_RATIO(default0.05) specified how much data will be hold back for epoch validation during training.Further optional arguments
OPT_ARGSwill influence the following areas:- text encoding settings:
--vocabulary-size VOCABULARY_SIZE,--max-text-length MAX_TEXT_LENGTH,--pad-beginning [True|False](whether to add padding at start and end of a sequence), and--use-start-end-indicators [True|False](whether to use reserved indicator token<START>and<END>) - embedding setup:
--embeddings [EMBEDDING_SIZE|PRETRAINED_EMBEDDINGS_FILE](--update-embeddings [True|False]) - network structure
--layer_definitions [LAYER_DEFINITIONS](layer definitions separated by pipe, e.g.,--layer-definitions 'LSTM(16)|Dense(8)') - training
--batch-size BATCH_SIZE,--num-epochs NUM_EPOCHS,--learning-rate LEARNING_RATE,--learning-decay LEARNING_DECAY,--shuffle-training-data [True|False|RANDOM_SEED](RANDOM_SEEDrefers to anintvalue used as the seed for the random number generator) - print config information
--log-config [True|False](default:True)
- text encoding settings:
-
To debug the selected encoding model run:
python ./run_experiment.py [DATASET] [OPT_ARGS] \ test-encoding "This is a test sentence" "This sentence contains the unknown word klcuvhacnjbduskxuscj" \ [--show-padding [True|False]] [--show-start-end [True|False]]This command will create representations for the two example sentences. The parameter
--show-paddingforces the output of<PAD>indicators in the re-decoded text and--show-start-enden-/disables<START>and<END>indicators. The aforementioned optional argumentsOPT_ARGSstill apply. -
To execute k-fold cross validation based only on the training data set
python ./run_experiment.py [DATASET] [OPT_ARGS] \ cross-validation [--k NUMBER_OF_FOLDS]The
NUMBER_OF_FOLDSindicates the amout of folds / splits to use for cross validation. Aforementioned optional argumentsOPT_ARGSstill apply.
The ACL IMDb dataset consists of 25,000 highly polar movie reviews for training, and 25,000 for testing and can be found here (alt. here).
Preparation: Download the dataset and extract it in the aclImdb subfolder.
curl http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz | tar -xzIn the following examples, the indicator IMDB_DATA_FOLDER refers to the base folder of the ACL IMDb dataset:
IMDB_DATA_FOLDER=${PWD}/aclImdb/
Run experiments:
-
To run training and evaluation of a LSTM model to predict positive/negative reviews run:
python ./run_experiment.py imdb --data-folder [IMDB_DATA_FOLDER] [OPT_ARGS] \ train-and-test [--validation-split VALIDATION_HOLD_OUT_RATIO]
where
IMDB_DATA_FOLDERrefers to the base folder of the ACL IMDb dataset and the aforementioned optional argumentsVALIDATION_HOLD_OUT_RATIOandOPT_ARGSstill apply. -
To debug the selected encoding model run:
python ./run_experiment.py imdb --data-folder [IMDB_DATA_FOLDER] [OPT_ARGS] \ test-encoding "This is a test sentence" "This sentence contains the unknown word klcuvhacnjbduskxuscj" \ [--show-padding [True|False]] [--show-start-end [True|False]]
The aforementioned optional arguments
--show-padding [...],--show-start-end [...], andOPT_ARGSstill apply. -
To execute k-fold cross validation based only on the training data set
python ./run_experiment.py imdb --data-folder [IMDB_DATA_FOLDER] [OPT_ARGS] \ cross-validation [--k NUMBER_OF_FOLDS]The aforementioned optional arguments
NUMBER_OF_FOLDSandOPT_ARGSstill apply.
The Amazon reviews dataset consists of hundred million reviews by millions of Amazon customers over two decades. The reviews express opinions and describe the customer experiences regarding products on the Amazon.com website. Different review subsets are listed here: https://s3.amazonaws.com/amazon-reviews-pds/tsv/index.txt
Preparation:
Download a dataset (e.g., Amazon Video reviews amazon_reviews_us_Video_v1_00.tsv.gz):
wget https://s3.amazonaws.com/amazon-reviews-pds/tsv/amazon_reviews_us_Video_v1_00.tsv.gz -P amazon
In the following examples, the indicator AMAZON_DATA_FILE refers to the downloaded data file of the Amazon
dataset:
AMAZON_DATA_FILE=${PWD}/amazon/amazon_reviews_us_Video_v1_00.tsv.gz
Run experiments:
-
To run training and evaluation of a LSTM model to predict positive/negative reviews run:
python ./run_experiment.py amazon --data-file [AMAZON_DATA_FILE] [OPT_ARGS] \ train-and-test [--validation-split VALIDATION_HOLD_OUT_RATIO]where
AMAZON_DATA_FILErefers to the Amazon dataset file, the aforementioned optional argumentsVALIDATION_HOLD_OUT_RATIOandOPT_ARGSstill apply. -
To debug the selected encoding model run:
python ./run_experiment.py amazon --data-file [AMAZON_DATA_FILE] [OPT_ARGS] \ test-encoding "This is a test sentence" "This sentence contains the unknown word klcuvhacnjbduskxuscj" \ [--show-padding [True|False]] [--show-start-end [True|False]]
The aforementioned optional arguments
--show-padding [...],--show-start-end [...], andOPT_ARGSstill apply. -
To execute k-fold cross validation based only on the training data set
python ./run_experiment.py amazon --data-folder [IMDB_DATA_FOLDER] [OPT_ARGS] \ cross-validation [--k NUMBER_OF_FOLDS]The aforementioned optional arguments
NUMBER_OF_FOLDSandOPT_ARGSstill apply.
The YELP reviews dataset consists of approx. 6 million reviews for 200k businesses. The reviews express opinions and describe the customer experiences collected on www.yelp.com.
Preparation:
Download the dataset and extract the review.json. In the following examples, the indicator YELP_DATA_FILE refers to
the extracted review.json file.
Run experiments:
-
To run training and evaluation of a LSTM model to predict positive/negative reviews run:
python ./run_experiment.py yelp --data-file [YELP_DATA_FILE] [OPT_ARGS] \ train-and-test [--validation-split VALIDATION_HOLD_OUT_RATIO]where
YELP_DATA_FILErefers to the YELP dataset file, the aforementioned optional argumentsVALIDATION_HOLD_OUT_RATIOandOPT_ARGSstill apply. -
To debug the selected encoding model run:
python ./run_experiment.py yelp --data-file [YELP_DATA_FILE] [OPT_ARGS] \ test-encoding "This is a test sentence" "This sentence contains the unknown word klcuvhacnjbduskxuscj" \ [--show-padding [True|False]] [--show-start-end [True|False]]
The aforementioned optional arguments
--show-padding [...],--show-start-end [...], andOPT_ARGSstill apply. -
To execute k-fold cross validation based only on the training data set
python ./run_experiment.py yelp --data-folder [IMDB_DATA_FOLDER] [OPT_ARGS] \ cross-validation [--k NUMBER_OF_FOLDS]The aforementioned optional arguments
NUMBER_OF_FOLDSandOPT_ARGSstill apply.
TODO: add description ...
TODO: add description ...
TODO: add description ...
TODO: add description ...
Pretrained word embeddings can be used by loading provided vec files. For instance, fastText - aligned word vectors (aalternatively, other word vectors)