TF-IDF

vector_spaceTF-IDF stands for Term Frequency-Inverse Document Frequency. It is a method to find out how important a term or a set of terms is in a collection of documents or, as defined in wikipedia, in a “corpus”.

I have first met it in the MOOC Text Retrieval and Search Engines,  but I have also retrieved it in one MOOC about Recommender Systems.

The basic idea is to give higher importance to the terms that appear more times in the document while decreasing with the number of time the terms appears in the collection. So, it highlights the specificity of the document with respect to a term when compared to the other documents in the collection.

This method has a lot of literature behind it and a lot of uses in Information Retrieval, in particular and in many variants it is used in the field of search engines as a ranking tool to sort the search results. This type of search engines are based on the construction of indexes and a Vector Space Model to match the queries to the documents.

The main factor involved, as explained above, are:

  • TF: How frequent is a term in a document. It can be used in many ways, for instance as boolean, as a simple count or as a logaritmic scaled frequency, or as a value normalized on the length of the document to avoid to privilege longer documents
  • IDF: It is a term designed to penalize common words, that occurr very frequently in the document, or collection

To understand better the term IDF, we should first discuss about a quantity called DF: This is the frequency of the term in the collection. Intuitively, if a normalized TF(k, d) for a specific term k and document d is similar to DF(k, C) for the same term on the entire collection C, the specific document should not be ranked higher than the other documents for the specific term k. Almost certainly there should be documents deserving higher ranking. But there is where IDF is coming to our help.   IDF purpose is to penalize terms that are common in the collection. In fact, normally idf(t,D) is defined as:

  • log (|D|/|{ d ∈D : t ∈d }| ), where
  • |D| is the size of the collection,
  • |{ d ∈D : t ∈d }| is the number of documents in D where the term t appears

This formula has several variants. However, it is funny to see that in the recommender systems application, the version where the searched terms is sought against a catalogue of products descriptions, the best variant in my opinion is the “unary” version. That is, if the sought term is present in the product characteristics or description, the TF value is set to 1, and to 0 if not.

Leave a Reply