Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix TextRank algorithm #100

Merged
merged 2 commits into from
Mar 18, 2018
Merged

Fix TextRank algorithm #100

merged 2 commits into from
Mar 18, 2018

Conversation

kmkurn
Copy link
Contributor

@kmkurn kmkurn commented Jan 22, 2018

This PR fixes the TextRank algorithm so that it runs PageRank algorithm (as discussed in #87). This PR also:

  • Adds requirements.txt file so contributors can install all the required dependencies easily. I don't know the exact version that you use, so maybe you can check if all the versions are correct.
  • Updates pytest config section name in setup.cfg to [tool:pytest] instead of [pytest]; the latter is deprecated.
  • Updates README by referencing the original TextRank paper. (I notice there is a README.rst file as well, why the redundancy? Github can parse .rst file just fine.)
  • Fixes edge case when computing sentence similarity in TextRank. This guarantees the adjacency matrix has non-zero values in the diagonals, which makes more sense because a sentence should be similar to itself.

Some things to note:

  • Current version of pytest does not seem to have pytest executable for every python version. I believe tox is now the standard tool to do tests on several python versions.
  • If this PR is accepted, .travis.yml file might be modified so it installs dependencies from requirements.txt so there's only one place where dependencies are listed.

@miso-belica
Copy link
Owner

@kmkurn Thanks for this. From the brief view. Can you please remove file requirements.txt? sumy is library and not application and I don't want to rely on specific versions. Dependencies should be installed from setup.py file. See https://caremad.io/posts/2013/07/setup-vs-requirement/ for explanation.

Also I don't like you removed old algorithm. Can you please create new class for TextRank algorithm and let the old summarization algorithm in sumy with some other name (for example ReductionSummarizer) so it will be still available? Some people like ti and use it :)

@miso-belica miso-belica self-requested a review January 22, 2018 08:57
@kmkurn
Copy link
Contributor Author

kmkurn commented Jan 22, 2018

  1. Yes sumy is a library, but requirements.txt file's purpose is for the developers/contributors, not users. For the developers, sumy becomes an application (this is mentioned in the article you linked), so ideally they all have the same dev environment which includes the exact Python and library versions. Developers do not install sumy from pypi but rather they clone the repository and install from requirements.txt. Projects like nltk and requests also have requirements.txt file (or similar). I can remove it if you insist though.

  2. Sure no problem.

* Rename old TextRank as reduction summarizer
* Fix TextRank algorithm to use PageRank
* Add requirements.txt
* Fix deprecated section name in setup.cfg
* Update readme
* Fix similarity edge cases
Copy link
Owner

@miso-belica miso-belica left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry it took so long. Firstly I didn't notice your code changes and later I was to busy (read lazy) to do deeper review. Anyway, I checked the code and for some reason I can't see the algorithm from the paper in the code. Maybe it's just me because it's quite a time I read some paper with these things :) But can you please add comments to the method _create_matrix explaining what the matrix means? What means rows, cols, cells on diagonal? And what lines with the computation means? I identified weights /= weights.sum(axis=1)[:, numpy.newaxis] as sum{Vk \e Out(Vj)}(wjk) but not sure because I can't find other sum for Vj from WS(Vi) computation.

README.md Outdated
```sh
$ pip install pytest pytest-cov
$ pip install -r requirements.txt
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, but I still don' think that handling 2 files for dependencies is good idea. Can you please change this to pip install -U pytest -e .?

requirements.txt Outdated
nltk==3.2.5
numpy==1.14.0
pytest==3.3.2
pytest-cov==2.5.1
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think developers should be stick to concrete version of pytest and also pytest-cov is fully optional. It's the same as with pytest-watch. I use them but if you don't it doesn't matter.

README.md Outdated
- **SumBasic** - Method that is often used as a baseline in
the literature. Source: [Read about
SumBasic](http://www.cis.upenn.edu/~nenkova/papers/ipm.pdf)
- **KL-Sum** - Method that greedily adds sentences to a summary so
long as it decreases the KL Divergence. Source: [Read about
KL-Sum](http://www.aclweb.org/anthology/N09-1041)
- **Reduction** - Graph-based summarization, taken from https://github.com/adamfabish/Reduction
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have separate list of implementations and descriptions of the methods. Link to https://github.com/adamfabish/Reduction is below. Can you please change this to rather describe the method? Something like "Graph based summarization inspired by TextRank but with just single iteration for sentence rating."

from ..utils import build_document


class TestTextRank(unittest.TestCase):
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please use pytest style tests without the class? I was using nosetests in the past and that's why there is a lot of classes, but now pytest if the best test FW for me and would be nice to remove this boilerplate. Also the name here is wrong since summarizer is named Reduction now :) And can you also use plain assert statements pytest uses?


text_rank_module.numpy = numpy


class TestTextRank(unittest.TestCase):
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also here, can you please remove this class and use only functions ans use pytest asserts?


for i, words_i in enumerate(sentences_as_words):
for j, words_j in enumerate(sentences_as_words):
weights[i, j] = self._rate_sentences_edge(words_i, words_j)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you make method _rate_sentences_edge static please?

@@ -89,5 +90,4 @@ def test_sentences_rating(self):

ratings = summarizer.rate_sentences(document)
self.assertEqual(len(ratings), 3)
self.assertTrue(ratings[document.sentences[1]] > ratings[document.sentences[0]])
self.assertTrue(ratings[document.sentences[0]] > ratings[document.sentences[2]])
self.assertAlmostEqual(sum(ratings.values()), 1)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can use pytest.approx here instead :)

# method works; without this division, the stationary probability blows up. This
# should not affect the ranking of the vertices so we can use the resulting stationary
# probability as is without any postprocessing.
return (1.-d)/N * numpy.ones((N, N)) + d * weights
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about numpy.full((N, N), (1.-self.damping)/N) + self.damping * weights?


def _create_matrix(self, document):
sentences_as_words = [self._to_words_set(sent) for sent in document.sentences]
N = len(sentences_as_words)
Copy link
Owner

@miso-belica miso-belica Feb 25, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please name this variable sentences_count instead of N? I am quite a hater of variable names as x, y, z, N, r, ... even in math equations and similar. I think that people would understand math/physics easier if the variables would be real words reflecting things in reality or they would have colors. Since Python is not very colorful language (fortunately :) I am using meaningful names ;)

@@ -7,7 +7,7 @@ tag = false

[bumpversion:file:sumy/__init__.py]

[pytest]
[tool:pytest]
addopts = --quiet --tb=short --color=yes --cov=sumy --cov-report=term-missing --no-cov-on-fail
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@miso-belica you have pytest-cov options listed here, so anyone invoking pytest command (without any options) will have to have pytest-cov installed. So it is not optional. Thus, it's either removing pytest-cov options from here or make it required as a dependency.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kmkurn Thanks for this. Then make pytest-cov required please.

@kmkurn
Copy link
Contributor Author

kmkurn commented Mar 10, 2018

@miso-belica I've implemented the requested changes but the build failed for Python 3.7 and nightly when doing python setup.py install. It worked just find for earlier version of Python so I'm not sure why. Do you have any idea?

@miso-belica
Copy link
Owner

@kmkurn thanks, I think you can ignore it now. I will try to figure it later or just will ignore it because it's dev versions anyway, now I am enjoying The PyCon ☺️

Copy link
Owner

@miso-belica miso-belica left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your work and patience 😃

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants