-
Notifications
You must be signed in to change notification settings - Fork 128
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
tf*idf: Why don't you normalize to maximum term count for document? #22
Comments
It should be included in the next release of the gem. |
I wrote my own gem in the meantime: https://github.com/opennorth/tf-idf-similarity Perhaps you'd like to re-use it? |
Sweet. I will! My philosophy is really to keep as much logic code as possible separate from the Treat core, so this is definitely the kind of contribution I welcome. My naïve implementation was horribly inefficient anyway. Thanks, |
I am interested in using your library to replace the current tf_idf workers, but my requirements would be as follows: 1 - We need to be able to input documents that are already tokenized. From what I understand, the gem does not provide a public interface to perform either of these tasks. Are you interested in coding up the public functions so I can use it directly? Or should I add it to my (long) to-do list for Treat? |
collection.idf(term) * collection.tf(document, term) |
Treat (and the tf-idf and similarity gems) all normalize tf to the number of terms in the document: https://github.com/louismullie/treat/blob/master/lib/treat/workers/extractors/tf_idf/native.rb#L78
We normalize so that (1) long and short documents have comparable tf weights and (2) documents with large vocabularies and those with small vocabularies have comparable tf weights.
Normalizing to the number of terms in a document only fixes (1). Normalizing to the maximum term count (which is all I've ever seen in the literature) fixes both (1) and (2).
The text was updated successfully, but these errors were encountered: