Issues with tf*idf implementation #2

jpmckinney · 2012-11-20T05:55:56Z

Your implementation uses plain term and document frequencies, with no damping or normalization (which, as far as I can tell, never occurs in the academic literature) . My gem, on the other hand, uses the same formula as Lucene and other major implementations. See https://github.com/opennorth/tf-idf-similarity

mkdynamic · 2013-07-16T03:05:11Z

Thanks for flagging. Since I don't use the gem anymore myself, I'm unlikely to invest the time to address this anytime soon. However I'd welcome a patch :)

Could you expand briefly on the impact of normalization and damping?

jpmckinney · 2013-07-16T16:30:19Z

I'd invite you to read about tf*idf implementations (there are links to references in my gem's README), but briefly, if you perform a similarity search without any normalization, you will have too strong a bias towards:

Terms that appear frequently. A term that appears 10 times in a document is rarely 10 times as important as a term that appears once. Algorithms generally take the square root of the term frequency.
Longer documents. Algorithms use cosine normalization to make all tf-idf vectors into unit vectors, which removes all bias relating to document length, since document length is generally considered irrelevant.

As for your calculation of IDF, every reference I've found takes the log of that term, so I'm not quite sure how you came to your implementation.

jpmckinney mentioned this issue Jan 6, 2013

Try to have other gem authors implement tf*idf correctly jpmckinney/tf-idf-similarity#4

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issues with tf*idf implementation #2

Issues with tf*idf implementation #2

jpmckinney commented Nov 20, 2012

mkdynamic commented Jul 16, 2013

jpmckinney commented Jul 16, 2013

Issues with tf*idf implementation #2

Issues with tf*idf implementation #2

Comments

jpmckinney commented Nov 20, 2012

mkdynamic commented Jul 16, 2013

jpmckinney commented Jul 16, 2013