Skip to content

siddrtm/Document-Summarization

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

These scripts were an attempt made by me to understand text summarization. I implemented two papers which used unsupervised method to extract the most important keywords/sentences from text. The basic idea of these papers was to create a graph associated with text where the vertices represent the entity to be ranked and the edges indicate some relationship(which can be syntactic or semantic) between vertices, and then PageRank algorithm is used to rank these vertices.

The two papers that I explored were: LexRank: www.cs.cmu.edu/afs/cs/project/jair/pub/volume22/erkan04a-html/erkan04a.html TextRank: https://web.eecs.umich.edu/~mihalcea/papers/mihalcea.emnlp04.pdf

I further plan on exploring papers related to Multi-document summarization; specifically: R. McDonald. A Study of Global Inference Algorithms in Multi-Document Summarization ECIR 2007. (formulates summarization task as global optimization problem using integer linear programming) W. Yih et al. Multi-Document Summarization by Maximizing Informative Content-Words. IJCAI 2007. (introduces stack decoding to this field)

Scripts: testRankWord.py : implements textRank algorithm for keyword extraction. testRankSent.py : implements textRank algorithm for sentence summarizaton. lexRank.py : implements lexrank algorithm for sentence summarization.

usage: script_name number_of_top_entities document_containing_text example - ./lexrank.py 3 data.txt

Dependency: Nltk, numpy

About

Implementation of LexRank and TextRank Algorithm

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages