This article has the objective of scraping a web site with the purpose of generating a wordcloud out of its text. The wordcould visualization of text is something that is becoming more and more popular, and you find it more and more also in common TV shows, where some presenters have had the idea to collect the frequency of terms from the tweets talking about football for instance, to understand and present to the general public what was the theme of the day among the fans or even during an electoral campaign. Here the purpose is slightly different, I am trying to scrape my own blog (this one) to understand how am I doing in my quest to learn as much as possible about Data Science and compare the resulting image with what I have in mind. The concept however works with any web site.
Before I start with the technical contents, I have to say that as usual I will be working in Python and that the article is based on some very nice pieces of work from other people, from whom I have taken full handed:
- Chapter 9 of the book Data Science From Scratch from Joel Grus, which I have reviewed earlier on, and on which the principles of web scraping and BeautifulSoup library are explored and explained.
- Sebastian Raschka article on getting a wordcloud out of twitter
- Andreas Mueller’s article and code.
- The amazing guys at stackoverflow and askubuntu who came up with the solution to my PIL library blues (I was unable to install it until I read that one). This library really needs some love.
Step 1: Scrape some web page
Here it is how we get some useful bare text from any web site. Before you do this, as Joel Grus mentions in his book, you have to check that you are allowed to do this and under what terms/APIs :
1 2 3 4 5 6 7 8 9 10 11 12 13 |
from mechanize import Browser from bs4 import BeautifulSoup mech = Browser() url = "http://antonio-ferraro.eu.pn" page = mech.open(url) html = page.read() soup = BeautifulSoup(html) # Here I get all the paragraphs matches=soup.select("p") # Here I geet all the titles matches2=soup.select("title") |
In order to get all the posts I have temporarily enabled a bigger number of posts to be displayed, to be sure to get enough content. At this point, I have saved the resulting text in a file. This allows me to go back at a later stage and try a different visualization. I am working in Ipython notebooks, so I split the program in cells just as I am doing in this post.
1 2 3 4 5 6 7 8 9 10 11 |
if = open('workfile', 'w') line=u'' # Paragraphs for match in matches: line=match.getText() f.write(line.encode('utf8')) # Titles for match in matches2: line=match.getText() f.write(line.encode('utf8')) f.close() |
Here the problem has been to deal with Unicode text. I am not going to give you the details, but it is really something to consider when working in Python. This and the PIL library blues I mentioned above made me like Python a zest less. 😉 Only Joking!
At this point (in reality to say it all, this would be the next evening), Let us do something with the collected text.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 |
from wordcloud import WordCloud, STOPWORDS import matplotlib.pyplot as plt # Read the file in and merge all lines words=' ' count =0 f = open('workfile', 'r') for line in f: words=words= words + line f.close # Make the word cloud (I use a font that I like) wordcloud = WordCloud( font_path='CabinSketch-Bold.ttf', stopwords=STOPWORDS, background_color='black', max_words=500, width=1800, height=1400 ).generate(words) plt.imshow(wordcloud) plt.axis('off') plt.savefig('./cloud2.png', dpi=300) plt.show() |
This generates the following worldcloud (I show it in small size):
Ok. But I have seen wordclouds in any shape. How do we do this? Easy, we create a mask!
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 |
from wordcloud import WordCloud, STOPWORDS import matplotlib.pyplot as plt from scipy.misc import imread %matplotlib inline ellipse_mask = imread("ellipse.png") wordcloud = WordCloud( font_path='CabinSketch-Bold.ttf', stopwords=STOPWORDS, background_color='white', mask=ellipse_mask, max_words=500, width=1800, height=1400 ).generate(words) plt.imshow(wordcloud) plt.axis('off') plt.savefig('./cloud3.png', dpi=300) plt.show() |
Here it comes my elliptic wordcloud!