This post follows a few older ones I have published on Word Clouds, until now all of them were based on Python code. I like word clouds, not for the information they can communicate, but just for their aesthetics, and until now I was a bit disappointed because in R I could not get the same results as I easily could pull out with a few lines of Python code. Finally I found the excellent wordcloud2 library by Lchiffon, which brings R on pair with Python at least on my personal point of view. Here I am also exploring the tm package, to get a bit familiar with it.
So, let us see. I am going to pick a document, and make a Word Cloud out of it. For this particular post, I am picking the US Declaration of independence. I have just copied the main text in a file called “independence.txt” which I store in a directory called “text”. This is my only assumption. So, let us load it, clean it and generate a Term Document Matrix out of it.
First thing, we initialize the libraries, set the options that we require and the path to the text file(s) that we want to include:
1 2 3 4 5 6 7 |
# Init required libraries libs <- c("tm", "wordcloud2") lapply(libs,require, character.only= TRUE) # Set Options options(stringsAsFactors = FALSE) # The working directory ... pathname <- "./text" |
Then, I write a couple of functions that could be useful also in other occasions, to generate a Term Document Matrix and to clean a corpus. The previous article on the tutorial by Tim D’Auria has been an inspiration for doing things in this way.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
# Clean corpus cleanCorpus <- function(corpus){ corpus.tmp <- tm_map(corpus, removePunctuation) corpus.tmp <- tm_map(corpus.tmp,stripWhitespace) corpus.tmp <- tm_map(corpus.tmp,tolower) corpus.tmp <- tm_map(corpus.tmp, PlainTextDocument) corpus.tmp <- tm_map(corpus.tmp,removeWords, stopwords("english")) return(corpus.tmp) } # Build TDM generateTDM <- function(path){ s.dir <-path s.cor <-Corpus(DirSource(directory = s.dir, encoding= "UTF-8")) s.cor.cl <- cleanCorpus(s.cor) s.tdm <- TermDocumentMatrix(s.cor.cl) s.tdm } |
And finally, I obtain a TDM, and after a bit of manipulations to shape the data into a format that is good for wordcloud2, I get two visualizations:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
# Get the Term Document Matrix TDM <- generateTDM("./text") # Do I need a matrix or a DF? We will see... TDMasMatrix <- as.matrix(TDM) TDMasDF<-data.frame(TDMasMatrix) TDMasDF$words <-rownames(TDMasDF) # Column headers colnames(TDMasDF) <- c("freq", "word") # sorting (not really needed) TDMasDF<-TDMasDF[order(-TDMasDF$freq),] # Column order is important for wordcloud2 TDMasDF<-TDMasDF[, c("word", "freq")] # The results! # wordcloud2(TDMasDF, size = 0.4, figPath="./liberty.png", backgroundColor = 'black', fontFamily="Loma") # The one above uses a custom shape but it does not keep the x/y proportions wordcloud2(TDMasDF, size = 0.3, shape="star", backgroundColor = 'black', fontFamily="Loma") letterCloud(TDMasDF, word="USA", size = 0.3, fontFamily="Loma", backgroundColor = 'black') |
The first is a classical word cloud, where I set the shape to be a star,after I have played a bit with the parameters to get a pleasant result (shape, backgroundColor, size and fontFamily):
The second is a call to the interesting lettercloud function, which allows to create a word cloud in the shape of a letter or a word. More interestingly, these are active visualizations that respond to mouse hover events by showing the frequency of the term under the mouse (not these embedded in my post, these are just screen shots).
So, did you like it? Then go on and have fun with your own Word Clouds!