TF-IDF: Playing with Python’s Gensim

This is the first of a series of articles dedicated to the python’s libraries for scientific use.

The first library I would like to introduce is strictly related to my previous post about TF-IDF. In the vast world of python there is a library, gensim, which contains a series of tools to implement a tfidf model among other things.

Let’s start by installing the library and importing something from it:

from gensim import corpora, models, similarities

1	from gensim import corpora, models, similarities

This shall take up the most important parts of gensim. As I do not actually know much about python and I am exploring it as a tool to learn Data Science, I would also like to read do this experiment while reading from file. I have a file containing a list of mooc titles, more or less something like this (I won’t post the whole file):
.... Machine learning techniques yoga: body and mind database knowledge cognitive neuroscience tv Macro Economics MCMC simulation hadoop beginner innovations in data science metallurgy applications data mining in health business intelligence prediction agile methodologies quantum computing master data management Microeconomic data structures and algorithms history of the united states genetic algorithms learn to play the guitar ...

So, I start from this one as the basis for my experiment.

# File with a list of mooc titles
f = open('/home/antonio/moocs-list.txt', 'r')

# Empty list to work with the mooc titles (my documents)
docs=[]
num_documents=0

# Add all the mooc titles to the  collection
for line in f:
docs.append(line)
num_documents+=1
f.close()

# File with a list of mooc titles

f = open('/home/antonio/moocs-list.txt', 'r')

# Empty list to work with the mooc titles (my documents)

docs=[]

num_documents=0

# Add all the mooc titles to the collection

for line in f:

docs.append(line)

num_documents+=1

f.close()

I also want to remove all so-called “stopwords” from the list.

# Create a list of stopwords
stoplist = set('for a of the and to in'.split())

# Purge all stopwords from the corpus
texts = [[word for word in document.lower().split() if word not in stoplist]
for document in docs]

# Create term frequencies (TF)
from collections import defaultdict

docs = [[token for token in text] for text in texts]

dictionary = corpora.Dictionary(docs)

# store the dictionary, to file for future reference (I love python's handling of files!)
dictionary.save('/home/antonio/moocs-list-dict')

# Verify what is in there!
print dictionary.num_docs
print num_documents

# This is the main object. the corpus
corpus = [dictionary.doc2bow(text) for text in docs]

# store to disk, for later use
corpora.MmCorpus.serialize('/home/antonio/moocs-list-serial', corpus)

# Create a list of stopwords

stoplist = set('for a of the and to in'.split())

# Purge all stopwords from the corpus

texts = [[word for word in document.lower().split() if word not in stoplist]

for document in docs]

# Create term frequencies (TF)

from collections import defaultdict

docs = [[token for token in text] for text in texts]

dictionary = corpora.Dictionary(docs)

# store the dictionary, to file for future reference (I love python's handling of files!)

dictionary.save('/home/antonio/moocs-list-dict')

# Verify what is in there!

print dictionary.num_docs

print num_documents

# This is the main object. the corpus

corpus = [dictionary.doc2bow(text) for text in docs]

# store to disk, for later use

corpora.MmCorpus.serialize('/home/antonio/moocs-list-serial', corpus)

Now I would like to see how the matching vectors look like!

new_doc = "macro economics"
new_vec = dictionary.doc2bow(new_doc.lower().split())

# both words have been found and this is the resulting vector...
print (new_vec)

new_doc2= "elvis presley"
new_vec2 = dictionary.doc2bow(new_doc2.lower().split())

# no words shall be found...
print (new_vec2)

# Let's play with tfidf, we transform corpus...
tfidf = models.TfidfModel(corpus)

new_doc = "macro economics"

new_vec = dictionary.doc2bow(new_doc.lower().split())

# both words have been found and this is the resulting vector...

print (new_vec)

new_doc2= "elvis presley"

new_vec2 = dictionary.doc2bow(new_doc2.lower().split())

# no words shall be found...

print (new_vec2)

# Let's play with tfidf, we transform corpus...

tfidf = models.TfidfModel(corpus)

This is not enough. I now want to match

# Let's play with tfidf, we transform corpus...
tfidf = models.TfidfModel(corpus)

# building up the index...
index = similarities.SparseMatrixSimilarity(tfidf[corpus], num_features=dictionary.num_docs)

# save the index for later use
index.save('/home/antonio/moocs-list.index')

# the query (I use the first of the two vectors created before)...
vec_tfidf=tfidf[new_vec]
vec_tfidf
sims = index[vec_tfidf]
print(list(enumerate(sims)))

# Let's play with tfidf, we transform corpus...

tfidf = models.TfidfModel(corpus)

# building up the index...

index = similarities.SparseMatrixSimilarity(tfidf[corpus], num_features=dictionary.num_docs)

# save the index for later use

index.save('/home/antonio/moocs-list.index')

# the query (I use the first of the two vectors created before)...

vec_tfidf=tfidf[new_vec]

vec_tfidf

sims = index[vec_tfidf]

print(list(enumerate(sims)))

I have to add that this experiment is not fully successful on my computer. The second last line causes a crash in my python kernel, but I think that it should be more or less correct as I was mixing and matching from the various tutorials of the library. More on this later on…

Facebook

Twitter

Data Science day by day

Blogging while building a new skillset

Leave a Reply Cancel reply