This is the first of a series of articles dedicated to the python’s libraries for scientific use.
The first library I would like to introduce is strictly related to my previous post about TF-IDF. In the vast world of python there is a library, gensim, which contains a series of tools to implement a tfidf model among other things.
Let’s start by installing the library and importing something from it:
1 |
from gensim import corpora, models, similarities |
This shall take up the most important parts of gensim. As I do not actually know much about python and I am exploring it as a tool to learn Data Science, I would also like to read do this experiment while reading from file. I have a file containing a list of mooc titles, more or less something like this (I won’t post the whole file):
....
Machine learning techniques
yoga: body and mind
database knowledge
cognitive neuroscience
tv
Macro Economics
MCMC simulation
hadoop beginner
innovations in data science
metallurgy
applications data mining in health
business intelligence prediction
agile methodologies
quantum computing
master data management
Microeconomic
data structures and algorithms
history of the united states
genetic algorithms
learn to play the guitar
...
So, I start from this one as the basis for my experiment.
1 2 3 4 5 6 7 8 9 10 11 12 |
# File with a list of mooc titles f = open('/home/antonio/moocs-list.txt', 'r') # Empty list to work with the mooc titles (my documents) docs=[] num_documents=0 # Add all the mooc titles to theĀ collection for line in f: docs.append(line) num_documents+=1 f.close() |
I also want to remove all so-called “stopwords” from the list.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 |
# Create a list of stopwords stoplist = set('for a of the and to in'.split()) # Purge all stopwords from the corpus texts = [[word for word in document.lower().split() if word not in stoplist] for document in docs] # Create term frequencies (TF) from collections import defaultdict docs = [[token for token in text] for text in texts] dictionary = corpora.Dictionary(docs) # store the dictionary, to file for future reference (I love python's handling of files!) dictionary.save('/home/antonio/moocs-list-dict') # Verify what is in there! print dictionary.num_docs print num_documents # This is the main object. the corpus corpus = [dictionary.doc2bow(text) for text in docs] # store to disk, for later use corpora.MmCorpus.serialize('/home/antonio/moocs-list-serial', corpus) |
Now I would like to see how the matching vectors look like!
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
new_doc = "macro economics" new_vec = dictionary.doc2bow(new_doc.lower().split()) # both words have been found and this is the resulting vector... print (new_vec) new_doc2= "elvis presley" new_vec2 = dictionary.doc2bow(new_doc2.lower().split()) # no words shall be found... print (new_vec2) # Let's play with tfidf, we transform corpus... tfidf = models.TfidfModel(corpus) |
This is not enough. I now want to match
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
# Let's play with tfidf, we transform corpus... tfidf = models.TfidfModel(corpus) # building up the index... index = similarities.SparseMatrixSimilarity(tfidf[corpus], num_features=dictionary.num_docs) # save the index for later use index.save('/home/antonio/moocs-list.index') # the query (I use the first of the two vectors created before)... vec_tfidf=tfidf[new_vec] vec_tfidf sims = index[vec_tfidf] print(list(enumerate(sims))) |
I have to add that this experiment is not fully successful on my computer. The second last line causes a crash in my python kernel, but I think that it should be more or less correct as I was mixing and matching from the various tutorials of the library. More on this later on…