Web scraping in different languages: Stopwords

 Dante Alighieri and the wordcloud from his “Divine Comedy”

What if you want to scrape a web page in a language other than English and generate the wordcloud from the page as I explained in my previous post. There is something missing, as you will find out. The STOPWORDS used in that example will not help you. Therefore, you will need to supply a language-specific set.  In Python, you do this by installing the stop-words package, which supports many western languages.Then you use it like this:

However, there is something different to consider here. The stop_words obtained is a list of common words that when removed, supposedly remove no conceptual terms to a document written in the current form of the language. But what if the language dates.. …say almost 800 years ago, like in the case of Dante Alighieri? Or even 500 years ago, think of William Shakespeare? Well, one possible solution is to look at the language and decide for your self what are possible additional stop-words. Then you just add them to your original list like this:

Here “canto” stays for chapter and it is by far the most common word, that is why I removed it, but this is totally subjective. Notice that in this version of the cloud generation I am passing to the wordcloud generator a “list” and not a “set” like in the previous article.  Also notice that all strings are “unicoded”.

Here I manually cut around the wordcloud the profile of Dante and made the cut-out transparent.