-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Issues with tf*idf implementation #7
Comments
There are various variants of tfidf The goal of rsemantic is to provide flexibility in using many different types of transforms. I'll look at adding some of the different tfidf variants. Thanks for reporting this! |
There are certainly many variants, and I implement some of the alternatives in https://github.com/opennorth/tf-idf-similarity/tree/master/lib/tf-idf-similarity/extras However, the variant used by Lucene is the only one I see common implementations of. Sphinx, Ferret and others use that variant. Lucene docs: http://lucene.apache.org/core/4_0_0-BETA/core/org/apache/lucene/search/similarities/TFIDFSimilarity.html |
@josephwilk I just noticed that the link you gave doesn't describe the variant in your code. The link describes dividing by the maximum tf in a document, wheareas you divide by the number of terms in a document (and I still can't find any paper that recommends or even describes your variant). |
Good spot and thanks for coming back to the issue :) I forget the exact reasoning why I had tweaked tfidf like that. I believe it was related to getting better performance with LSA. But since I cannot remember that is good enough reason to kill it and move to Lucene's algorithm: Feel free to re-open if you think there is something wrong, or want to do some more collaboration. |
Awesome! That's great. |
Your implementation normalizes the frequency of a term in a document to the number of terms in that document (which, as far as I can tell, never occurs in the academic literature) and has no normalization component. My gem, on the other hand, uses the same formula as Lucene and other major implementations. See https://github.com/opennorth/tf-idf-similarity
The text was updated successfully, but these errors were encountered: