-
-
Notifications
You must be signed in to change notification settings - Fork 531
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix TextRank algorithm #100
Conversation
@kmkurn Thanks for this. From the brief view. Can you please remove file Also I don't like you removed old algorithm. Can you please create new class for TextRank algorithm and let the old summarization algorithm in sumy with some other name (for example |
|
* Rename old TextRank as reduction summarizer * Fix TextRank algorithm to use PageRank * Add requirements.txt * Fix deprecated section name in setup.cfg * Update readme * Fix similarity edge cases
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry it took so long. Firstly I didn't notice your code changes and later I was to busy (read lazy) to do deeper review. Anyway, I checked the code and for some reason I can't see the algorithm from the paper in the code. Maybe it's just me because it's quite a time I read some paper with these things :) But can you please add comments to the method _create_matrix
explaining what the matrix means? What means rows, cols, cells on diagonal? And what lines with the computation means? I identified weights /= weights.sum(axis=1)[:, numpy.newaxis]
as sum{Vk \e Out(Vj)}(wjk)
but not sure because I can't find other sum for Vj
from WS(Vi)
computation.
README.md
Outdated
```sh | ||
$ pip install pytest pytest-cov | ||
$ pip install -r requirements.txt |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, but I still don' think that handling 2 files for dependencies is good idea. Can you please change this to pip install -U pytest -e .
?
requirements.txt
Outdated
nltk==3.2.5 | ||
numpy==1.14.0 | ||
pytest==3.3.2 | ||
pytest-cov==2.5.1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think developers should be stick to concrete version of pytest and also pytest-cov is fully optional. It's the same as with pytest-watch. I use them but if you don't it doesn't matter.
README.md
Outdated
- **SumBasic** - Method that is often used as a baseline in | ||
the literature. Source: [Read about | ||
SumBasic](http://www.cis.upenn.edu/~nenkova/papers/ipm.pdf) | ||
- **KL-Sum** - Method that greedily adds sentences to a summary so | ||
long as it decreases the KL Divergence. Source: [Read about | ||
KL-Sum](http://www.aclweb.org/anthology/N09-1041) | ||
- **Reduction** - Graph-based summarization, taken from https://github.com/adamfabish/Reduction |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have separate list of implementations and descriptions of the methods. Link to https://github.com/adamfabish/Reduction is below. Can you please change this to rather describe the method? Something like "Graph based summarization inspired by TextRank but with just single iteration for sentence rating."
from ..utils import build_document | ||
|
||
|
||
class TestTextRank(unittest.TestCase): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please use pytest style tests without the class? I was using nosetests in the past and that's why there is a lot of classes, but now pytest if the best test FW for me and would be nice to remove this boilerplate. Also the name here is wrong since summarizer is named Reduction now :) And can you also use plain assert
statements pytest uses?
|
||
text_rank_module.numpy = numpy | ||
|
||
|
||
class TestTextRank(unittest.TestCase): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also here, can you please remove this class and use only functions ans use pytest asserts?
sumy/summarizers/text_rank.py
Outdated
|
||
for i, words_i in enumerate(sentences_as_words): | ||
for j, words_j in enumerate(sentences_as_words): | ||
weights[i, j] = self._rate_sentences_edge(words_i, words_j) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you make method _rate_sentences_edge
static please?
@@ -89,5 +90,4 @@ def test_sentences_rating(self): | |||
|
|||
ratings = summarizer.rate_sentences(document) | |||
self.assertEqual(len(ratings), 3) | |||
self.assertTrue(ratings[document.sentences[1]] > ratings[document.sentences[0]]) | |||
self.assertTrue(ratings[document.sentences[0]] > ratings[document.sentences[2]]) | |||
self.assertAlmostEqual(sum(ratings.values()), 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can use pytest.approx here instead :)
sumy/summarizers/text_rank.py
Outdated
# method works; without this division, the stationary probability blows up. This | ||
# should not affect the ranking of the vertices so we can use the resulting stationary | ||
# probability as is without any postprocessing. | ||
return (1.-d)/N * numpy.ones((N, N)) + d * weights |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about numpy.full((N, N), (1.-self.damping)/N) + self.damping * weights
?
sumy/summarizers/text_rank.py
Outdated
|
||
def _create_matrix(self, document): | ||
sentences_as_words = [self._to_words_set(sent) for sent in document.sentences] | ||
N = len(sentences_as_words) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please name this variable sentences_count
instead of N? I am quite a hater of variable names as x, y, z, N, r, ...
even in math equations and similar. I think that people would understand math/physics easier if the variables would be real words reflecting things in reality or they would have colors. Since Python is not very colorful language (fortunately :) I am using meaningful names ;)
@@ -7,7 +7,7 @@ tag = false | |||
|
|||
[bumpversion:file:sumy/__init__.py] | |||
|
|||
[pytest] | |||
[tool:pytest] | |||
addopts = --quiet --tb=short --color=yes --cov=sumy --cov-report=term-missing --no-cov-on-fail |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@miso-belica you have pytest-cov options listed here, so anyone invoking pytest
command (without any options) will have to have pytest-cov installed. So it is not optional. Thus, it's either removing pytest-cov options from here or make it required as a dependency.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kmkurn Thanks for this. Then make pytest-cov required please.
@miso-belica I've implemented the requested changes but the build failed for Python 3.7 and nightly when doing |
@kmkurn thanks, I think you can ignore it now. I will try to figure it later or just will ignore it because it's dev versions anyway, now I am enjoying The PyCon |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your work and patience 😃
This PR fixes the TextRank algorithm so that it runs PageRank algorithm (as discussed in #87). This PR also:
requirements.txt
file so contributors can install all the required dependencies easily. I don't know the exact version that you use, so maybe you can check if all the versions are correct.setup.cfg
to[tool:pytest]
instead of[pytest]
; the latter is deprecated.Some things to note:
.travis.yml
file might be modified so it installs dependencies fromrequirements.txt
so there's only one place where dependencies are listed.