Issues with tf*idf implementation #7

jpmckinney · 2012-11-20T05:52:07Z

Your implementation normalizes the frequency of a term in a document to the number of terms in that document (which, as far as I can tell, never occurs in the academic literature) and has no normalization component. My gem, on the other hand, uses the same formula as Lucene and other major implementations. See https://github.com/opennorth/tf-idf-similarity

josephwilk · 2012-11-20T09:12:13Z

There are various variants of tfidf
http://nlp.stanford.edu/IR-book/html/htmledition/maximum-tf-normalization-1.html

The goal of rsemantic is to provide flexibility in using many different types of transforms. I'll look at adding some of the different tfidf variants.

Thanks for reporting this!

jpmckinney · 2012-11-20T14:49:53Z

There are certainly many variants, and I implement some of the alternatives in https://github.com/opennorth/tf-idf-similarity/tree/master/lib/tf-idf-similarity/extras However, the variant used by Lucene is the only one I see common implementations of. Sphinx, Ferret and others use that variant.

Lucene docs: http://lucene.apache.org/core/4_0_0-BETA/core/org/apache/lucene/search/similarities/TFIDFSimilarity.html

jpmckinney · 2013-01-06T22:12:58Z

@josephwilk I just noticed that the link you gave doesn't describe the variant in your code. The link describes dividing by the maximum tf in a document, wheareas you divide by the number of terms in a document (and I still can't find any paper that recommends or even describes your variant).

josephwilk · 2013-01-07T13:51:24Z

Good spot and thanks for coming back to the issue :)

I forget the exact reasoning why I had tweaked tfidf like that. I believe it was related to getting better performance with LSA. But since I cannot remember that is good enough reason to kill it and move to Lucene's algorithm:

https://github.com/josephwilk/rsemantic/blob/master/lib/semantic/transform/tf_idf_transform.rb

Feel free to re-open if you think there is something wrong, or want to do some more collaboration.

jpmckinney · 2013-01-07T16:36:16Z

Awesome! That's great.

josephwilk closed this as completed Nov 20, 2012

jpmckinney mentioned this issue Jan 6, 2013

Try to have other gem authors implement tf*idf correctly jpmckinney/tf-idf-similarity#4

Closed

josephwilk reopened this Jan 7, 2013

josephwilk closed this as completed Jan 7, 2013

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issues with tf*idf implementation #7

Issues with tf*idf implementation #7

jpmckinney commented Nov 20, 2012

josephwilk commented Nov 20, 2012

jpmckinney commented Nov 20, 2012

jpmckinney commented Jan 6, 2013

josephwilk commented Jan 7, 2013

jpmckinney commented Jan 7, 2013

Issues with tf*idf implementation #7

Issues with tf*idf implementation #7

Comments

jpmckinney commented Nov 20, 2012

josephwilk commented Nov 20, 2012

jpmckinney commented Nov 20, 2012

jpmckinney commented Jan 6, 2013

josephwilk commented Jan 7, 2013

jpmckinney commented Jan 7, 2013