tf*idf: Why don't you normalize to maximum term count for document? #22

jpmckinney · 2012-09-08T21:56:35Z

Treat (and the tf-idf and similarity gems) all normalize tf to the number of terms in the document: https://github.com/louismullie/treat/blob/master/lib/treat/workers/extractors/tf_idf/native.rb#L78

We normalize so that (1) long and short documents have comparable tf weights and (2) documents with large vocabularies and those with small vocabularies have comparable tf weights.

Normalizing to the number of terms in a document only fixes (1). Normalizing to the maximum term count (which is all I've ever seen in the literature) fixes both (1) and (2).

louismullie · 2012-10-24T12:35:51Z

It should be included in the next release of the gem.

jpmckinney · 2012-10-24T13:08:10Z

I wrote my own gem in the meantime: https://github.com/opennorth/tf-idf-similarity Perhaps you'd like to re-use it?

louismullie · 2012-10-24T13:13:00Z

Sweet. I will! My philosophy is really to keep as much logic code as possible separate from the Treat core, so this is definitely the kind of contribution I welcome. My naïve implementation was horribly inefficient anyway.

Thanks,
Louis

louismullie · 2012-10-29T05:37:04Z

I am interested in using your library to replace the current tf_idf workers, but my requirements would be as follows:

1 - We need to be able to input documents that are already tokenized.
2 - We need to be able to easily access tf*idf scores for a given word in a given document.

From what I understand, the gem does not provide a public interface to perform either of these tasks.

Are you interested in coding up the public functions so I can use it directly? Or should I add it to my (long) to-do list for Treat?

jpmckinney · 2012-10-29T15:28:57Z

I assume the tokenized documents would be arrays? You can now do that by passing :tokens argument when initializing a document, eg. Document.new("Lorem ipsum", :tokens => ["Lorem", "ipsum"])
If you have a collections of documents, you can now get the tf*idf for a particular document and term with collection.tfidf(document, term). Previously you'd have to multiply the idf and tf yourself, eg:

collection.idf(term) * collection.tf(document, term)

jpmckinney mentioned this issue Jan 6, 2013

Try to have other gem authors implement tf*idf correctly jpmckinney/tf-idf-similarity#4

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tf*idf: Why don't you normalize to maximum term count for document? #22

tf*idf: Why don't you normalize to maximum term count for document? #22

jpmckinney commented Sep 8, 2012

louismullie commented Oct 24, 2012

jpmckinney commented Oct 24, 2012

louismullie commented Oct 24, 2012

louismullie commented Oct 29, 2012

jpmckinney commented Oct 29, 2012

tf*idf: Why don't you normalize to maximum term count for document? #22

tf*idf: Why don't you normalize to maximum term count for document? #22

Comments

jpmckinney commented Sep 8, 2012

louismullie commented Oct 24, 2012

jpmckinney commented Oct 24, 2012

louismullie commented Oct 24, 2012

louismullie commented Oct 29, 2012

jpmckinney commented Oct 29, 2012