Web scraping in different languages: Stopwords

Dante Alighieri and the wordcloud from his “Divine Comedy”

What if you want to scrape a web page in a language other than English and generate the wordcloud from the page as I explained in my previous post. There is something missing, as you will find out. The STOPWORDS used in that example will not help you. Therefore, you will need to supply a language-specific set. In Python, you do this by installing the stop-words package, which supports many western languages.Then you use it like this:

from stop_words import get_stop_words
stop_words = get_stop_words('it')

1 2	from stop_words import get_stop_words stop_words = get_stop_words('it')

However, there is something different to consider here. The stop_words obtained is a list of common words that when removed, supposedly remove no conceptual terms to a document written in the current form of the language. But what if the language dates.. …say almost 800 years ago, like in the case of Dante Alighieri? Or even 500 years ago, think of William Shakespeare? Well, one possible solution is to look at the language and decide for your self what are possible additional stop-words. Then you just add them to your original list like this:

stop_words=stop_words+[u'quel',u'quando',u'tanto',
u'de',u'poi',u'qual',u'pi',u'tal',u'prima',
u'co',u'g',u's',u'n',u'gi',u'cos',u'son',
u'canto']

stop_words=stop_words+[u'quel',u'quando',u'tanto',

u'de',u'poi',u'qual',u'pi',u'tal',u'prima',

u'co',u'g',u's',u'n',u'gi',u'cos',u'son',

u'canto']

Here “canto” stays for chapter and it is by far the most common word, that is why I removed it, but this is totally subjective. Notice that in this version of the cloud generation I am passing to the wordcloud generator a “list” and not a “set” like in the previous article. Also notice that all strings are “unicoded”.

from wordcloud import WordCloud
import matplotlib.pyplot as plt
from scipy.misc import imread
dante_mask = imread("Dante.png")

wordcloud = WordCloud(
font_path='CabinSketch-Bold.ttf',
stopwords=stop_words,
background_color='black',
mask=dante_mask,
max_words=500,
width=670,
height=1000
).generate(words)

plt.imshow(wordcloud)
plt.axis('off')
plt.savefig('./dante_cloud.png', dpi=300)

from wordcloud import WordCloud

import matplotlib.pyplot as plt

from scipy.misc import imread

dante_mask = imread("Dante.png")

wordcloud = WordCloud(

font_path='CabinSketch-Bold.ttf',

stopwords=stop_words,

background_color='black',

mask=dante_mask,

max_words=500,

width=670,

height=1000

).generate(words)

plt.imshow(wordcloud)

plt.axis('off')

plt.savefig('./dante_cloud.png', dpi=300)

Here I manually cut around the wordcloud the profile of Dante and made the cut-out transparent.

Facebook

Twitter

Data Science day by day

Blogging while building a new skillset

Web scraping in different languages: Stopwords