Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issues with tf*idf implementation #7

Closed
jpmckinney opened this issue Nov 20, 2012 · 5 comments
Closed

Issues with tf*idf implementation #7

jpmckinney opened this issue Nov 20, 2012 · 5 comments

Comments

@jpmckinney
Copy link

Your implementation normalizes the frequency of a term in a document to the number of terms in that document (which, as far as I can tell, never occurs in the academic literature) and has no normalization component. My gem, on the other hand, uses the same formula as Lucene and other major implementations. See https://github.com/opennorth/tf-idf-similarity

@josephwilk
Copy link
Owner

There are various variants of tfidf
http://nlp.stanford.edu/IR-book/html/htmledition/maximum-tf-normalization-1.html

The goal of rsemantic is to provide flexibility in using many different types of transforms. I'll look at adding some of the different tfidf variants.

Thanks for reporting this!

@jpmckinney
Copy link
Author

There are certainly many variants, and I implement some of the alternatives in https://github.com/opennorth/tf-idf-similarity/tree/master/lib/tf-idf-similarity/extras However, the variant used by Lucene is the only one I see common implementations of. Sphinx, Ferret and others use that variant.

Lucene docs: http://lucene.apache.org/core/4_0_0-BETA/core/org/apache/lucene/search/similarities/TFIDFSimilarity.html

@jpmckinney
Copy link
Author

@josephwilk I just noticed that the link you gave doesn't describe the variant in your code. The link describes dividing by the maximum tf in a document, wheareas you divide by the number of terms in a document (and I still can't find any paper that recommends or even describes your variant).

@josephwilk josephwilk reopened this Jan 7, 2013
@josephwilk
Copy link
Owner

Good spot and thanks for coming back to the issue :)

I forget the exact reasoning why I had tweaked tfidf like that. I believe it was related to getting better performance with LSA. But since I cannot remember that is good enough reason to kill it and move to Lucene's algorithm:

https://github.com/josephwilk/rsemantic/blob/master/lib/semantic/transform/tf_idf_transform.rb

Feel free to re-open if you think there is something wrong, or want to do some more collaboration.

@jpmckinney
Copy link
Author

Awesome! That's great.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants