An exploration of hybrid neural doc summarization task

This repository contains the code for our "Deep Natural Language Processing" exam paper at Politecnico of Turin (years 2022).

Starting from the report of the 2nd classified at FNS-2021 challenge and inspired by the paper Fast Abstractive Summarization with Reinforce-Selected Sentence Rewriting, we explore the task of document summarization with an approach that can be used both for abstractive and extractive summaries, we reproduce the result of the cited report and extend this task on an abstractive italian dataset. Moreover we train different models with different parameters in order to explore the behaviours of the RL model with different ROUGE score given as reward functions.

You can

Preprocess the data
Train the models
Evaluate the models

.yipnb files are avalaible here you can find a tutorial for the full process and the script to evaluate our models

Dependencies

Python 3 (tested on python 3.7)
PyTorch 0.4.0
- with GPU and CUDA enabled installation (though the code is runnable on CPU, it would be way too slow)
gensim
cytoolz
tensorboardX
pyrouge (for evaluation)

Datasets used

FNS 2021 donwload here
Italian set of data extracted from WikiLingua download frome here For the italian dataset we refer to it as wiki10k since is an extraction of 10k document and respective summaries.

We report some statistics of the datasets analyzed (Note we use only 450 document for fns due to limited HW resource)

	wiki10k	wiki10k	fns-2021	fns-2021
	Documents	Summaries	Documents	Summaries
tot word	11670364	1354573	21689271	1040095
avg words per sent	19,016	9,46	36,23	29,59
avg words per doc	1167,153	135,47	48198,38	2311,32
tot sents	613711	143098	598496	35149
avg sents per doc	61,37	14,31	1329,99	78,1
tot_docs	10000	10000	450	450

Models

In the same folder of dataset are present different already trained model that can be used to run the evaluation.

Process data (full pipeline)

To run the whole preprocess pipeline use the script inside preprocess folder prepro_pipeline.py, run the following command.

The script generate a folder containing different data processed: the documents, the summaries, the corpus, the BoW.

Note: the last argument is optionally and used for abstractive summarization since the summaries could have word not present in the documents.

python prepro_pipeline.py <PHASE OF PREPROCESSING (all for full pipeline)> <PATH TO RAW DATA>/annual_reports <PATH TO RAW DATA>/gold_summaries (opt)<PATH of folder cotaining both doc and summaries>

If you use the Wiki dataset, to extract and generate the raw data run generate_raw_wiki.py use the following command

python generate_raw_wiki_ita.py <PATH OF italian.pickle DATASET> <PATH WHERE SAVE THE RAW DATA> <NUMBER OF ARTICLE TO EXTRACT>

Generate labeled data for pointer net training

The script extract_labels.py generate the label by matching the gold summaries sents with the correspondent report sents, for each report generate a json file with the labels and store those inside the destination folder.

Run the following comand to extract labels, specify the source paths for processed report and summaries and the destination path for labeled data

python extract_labels.py <PATH OF PROCESSED DOCUMENTS> <PATH OF PROCESSED GOLD SUMMARIES> <PATH WHERE SAVE THE LABELED SAMPLES>

Train models

Train gensim Word2Vec model

The script train_w2v.py inside preprocess folder read the whole preprocessed corpus and use it to train a gensim word2vec model then save it.

Run the following comand to train the word2vec model

python train_w2v.py <PATH OF CORPUS PROCeSSED> <PATH WHERE TO SAVE THE MODEL (specify the model name)>

Train extractor

Use the script train_extractor_ml.py run the following command

python train_extractor_ml.py --data_dir=<PATH TO POINTER DATA SPLITTED> --path=<PATH WHERE TO SAVE THE MODEL> --w2v=<PATH OF W2V MODEL>

Extract data for abstractor training

To train the abstractor we use the data extracted from the extractor, run the following command to make abstractor data

python generate_extractor_outputs.py --dir=<PATH OF LABELLED DATA (train or val)> --w2v_file=<PATH OF W2V FILE> --extractor_model=<PATH OF EXTRACTOR MODEL FOLDER> --output_dir=<PATH WHERE TO SAVE THE EXTRACTED DATA>

Train abstractor

Use the script train_abstractor.py run the following command

python train_abstractor.py --data_dir=<PATH TO ABST DATA SPLITTED> --path=<PATH WHERE TO SAVE THE MODEL> --w2v=<PATH OF W2V MODEL>

Train full rl model

Use the script train_full_rl.py run the following command

The arguments --reward=avg_rouges sets the reward function as the average of rouge-1, rouge-2, rouge-L, by default the reward is only the rouge-2

python train_full_rl.py --data_dir=<PATH OF ABSTRACTOR DATA> --path<PATH WHERE TO SAVE THE MODEL> --ext_dir=<EXTRACTOR MODEL PATH> --abs_dir=<ABSTRACTOR MODEL PATH> --reward=avg_rouges

Make evaluation

Use the script decode_eval.py to get the results, the validation data are used.

python decode_eval.py --path=<PATH WHERE TO SAVE THE EXTRACTION> --model_dir=<PATH OF RL MODEL> --data_dir=<PATH OF DIR OF LABELLED DATA>

Our results

We train different models on different datasets and with different settings (columns) and evaluate those with the rouge-1, rouge-2 and rouge-L computed for each generated summary against the the reference and averaged over the number of documents.

For the full settings of the model refer to the appendix in the paper.

	wiki rl rouge-2	wiki rl rouge-L	wiki rl avg-rouges	fns rl rouge-2	fns rl rouge-L	fns rl avg-rouges
avg rouge-2	0,0282	0,0273	0,0255	0,0413	0,218	0,0732
avg rouge-1	0,154	0,149	0,132	0,139	0,348	0,237
avg rouge-L	0,152	0,145	0,131	0,134	0,344	0,228

Name		Name	Last commit message	Last commit date
Latest commit History 174 Commits
data		data
model		model
preprocess		preprocess
.gitignore		.gitignore
An Exploration of Hybrid Neural Document Summarization Task Presentation.pdf		An Exploration of Hybrid Neural Document Summarization Task Presentation.pdf
README.md		README.md
decode_baselines.py		decode_baselines.py
decode_eval.py		decode_eval.py
decoding.py		decoding.py
generate_extractor_outputs.py		generate_extractor_outputs.py
metric.py		metric.py
requirements.txt		requirements.txt
rl.py		rl.py
train_abstractor.py		train_abstractor.py
train_extractor_ml.py		train_extractor_ml.py
train_full_rl.py		train_full_rl.py
training.py		training.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

An exploration of hybrid neural doc summarization task

Dependencies

Datasets used

Models

Process data (full pipeline)

Generate labeled data for pointer net training

Train models

Train gensim Word2Vec model

Train extractor

Extract data for abstractor training

Train abstractor

Train full rl model

Make evaluation

Our results

About

Releases

Packages

Languages

alessandropec/NLP_Project-fast_abs_rl

Folders and files

Latest commit

History

Repository files navigation

An exploration of hybrid neural doc summarization task

Dependencies

Datasets used

Models

Process data (full pipeline)

Generate labeled data for pointer net training

Train models

Train gensim Word2Vec model

Train extractor

Extract data for abstractor training

Train abstractor

Train full rl model

Make evaluation

Our results

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages