TF-IDF: Playing with Python’s Gensim

vector_spacepython

This is the first of a series of articles dedicated to the python’s libraries for scientific use.

The first library I would like to introduce is strictly related to my previous post about TF-IDF. In the vast world of python there is a library, gensim, which contains a series of tools to implement a tfidf model among other things.

Let’s start by installing the library and importing something from it:

This shall take up the most important parts of gensim. As I do not actually know much about python and I am exploring it as a tool to learn Data Science, I would also like to read do this experiment while reading from file. I have a file containing a list of mooc titles, more or less something like this (I won’t post the whole file):

....
Machine learning techniques
yoga: body and mind
database knowledge
cognitive neuroscience
tv
Macro Economics
MCMC simulation
hadoop beginner
innovations in data science
metallurgy
applications data mining in health
business intelligence prediction
agile methodologies
quantum computing
master data management
Microeconomic
data structures and algorithms
history of the united states
genetic algorithms
learn to play the guitar
...

So, I start from this one as the basis for my experiment.

I also want to remove all so-called “stopwords” from the list.

Now I would like to see how the matching vectors look like!

This is not enough. I now want to match

I have to add that this experiment is not fully successful on my computer. The second last line causes a crash in my python kernel, but I think that it should be more or less correct as I was mixing and matching from the various tutorials of the library. More on this later on…

Leave a Reply