NLP: Language Detection in R – Data Science day by day

I have played a bit with two language detection libraries in R, without going too much in the details of how they work. These are:

textcat
CLRD

The second package does not seem actively maintained, as the last update (version 1.1.0) is now over three years old. It can be however obtained and installed on current R versions using the following commands:

#install from archive
url <- "http://cran.us.r-project.org/src/contrib/Archive/cldr/cldr_1.1.0.tar.gz"
pkgFile<-"cldr_1.1.0.tar.gz"
download.file(url = url, destfile = pkgFile)
install.packages(pkgs=pkgFile, type="source", repos=NULL)
unlink(pkgFile)
# or if you have devtools installed:
# devtools::install_version("cldr",version="1.1.0")

#install from archive

url <- "http://cran.us.r-project.org/src/contrib/Archive/cldr/cldr_1.1.0.tar.gz"

pkgFile<-"cldr_1.1.0.tar.gz"

download.file(url = url, destfile = pkgFile)

install.packages(pkgs=pkgFile, type="source", repos=NULL)

unlink(pkgFile)

# or if you have devtools installed:

# devtools::install_version("cldr",version="1.1.0")

Textcat is actively maintained, and can be installed using the usual install.packages(“textcat”) way. It has language profiles for 75 languages and some of the algorithms at its core have been used inside Spamassassin. It can be manipulated using Language profiles and various options among which the selection of a distance function.

As far as I have understood, both packages are based on algorithms that detect the language based on ngram samples, cldr get hints also from the document encoding. CLDR derives from Google Chrome source code, it has been ported in other languages like python when Chrome’s source code was open sourced. The effort on the R side is still available on github but as I mentioned above it seems dead, which is a pity. CLD2 can detect up to 80 languages and requires inputs in UTF-8.

My interest is to find out how good are these packages at detecting the language in very short sentences, like for instance tweets or chat messages.I have therefore prepared a short list of sentences:

documents <- c("Rechtdoor gaan, dan naar rechts.",
"Kemal Kılıçdaroğlu Doğan TV Center'da",
"I live in the countryside",
"Questa frase non è scritta in Napoletano.",
"Das ist ein deutscher satz.",
"La vie est magnifique",
"El jugador está predispuesto a que será un partido complicado.",
"Καιρό έχουμε να τα πούμε!",
"Jar kan ikke snakke Norsk")

documents <- c("Rechtdoor gaan, dan naar rechts.",

"Kemal Kılıçdaroğlu Doğan TV Center'da",

"I live in the countryside",

"Questa frase non è scritta in Napoletano.",

"Das ist ein deutscher satz.",

"La vie est magnifique",

"El jugador está predispuesto a que será un partido complicado.",

"Καιρό έχουμε να τα πούμε!",

"Jar kan ikke snakke Norsk")

And I have given these directly to the two packages, using in the beginning their simplest form of language detection call.

# CLDR
detectLanguage(documents)
# Textcat
textcat(documents)

# CLDR

detectLanguage(documents)

# Textcat

textcat(documents)

The results of the two calls are are as follows,

CLDR returns a dataframe which contains for each of the documents (sentences in this case) a total of 13 variables per document
textcat returns a vector of character strings, containing the name of the language per each document submitted

Here is the table with my results:

Apart from textcat having an issue with Turkish, which I have tried to resolve unsuccessfully using more complex options (selection of distance algorithms), on these sample sentences the two libraries behave substantially in the same way. This changes when running more tests with shorter sentences. Here CLDR behaves slightly better, that is why I think it is a pity that this package is no longer maintained.

Facebook

Twitter