NLP: Language Detection in R

I have played a bit with two language detection libraries in R, without going too much in the details of how they work.  These are:


The second package does not seem actively maintained, as the last update (version 1.1.0) is now over three years old. It can be however obtained and installed on current R versions using the following commands:

Textcat is actively maintained, and can be installed using the usual install.packages(“textcat”) way. It has language profiles for 75 languages and some of the algorithms at its core have been  used inside Spamassassin.  It can be manipulated using Language profiles and various options among which the selection of a distance function.

As far as I have understood, both packages are based on algorithms that detect the language based on ngram samples, cldr get hints also from the document encoding. CLDR derives from Google Chrome source code, it has been ported in other languages like python when Chrome’s source code was open sourced. The effort on the R side is still available on github but as I mentioned above it seems dead, which is a pity. CLD2 can detect up to 80 languages and requires inputs in UTF-8.

My interest is to find out how good are these packages at detecting the language in very short sentences, like for instance tweets or chat messages.I have therefore prepared a short list of sentences:

And I have given these directly to the two packages, using in the beginning their simplest form of language detection call.

The results of the two calls are are as follows,

  • CLDR returns a dataframe which contains for each of the documents (sentences in this case) a total of 13 variables per document
  • textcat returns a vector of character strings, containing the name of the language per each document submitted

Here is the table with my results:

Apart from textcat having an issue with Turkish, which I have tried to resolve unsuccessfully using more complex options (selection of distance algorithms), on these sample sentences the two libraries behave substantially in the same way. This changes when running more tests with shorter sentences. Here CLDR behaves slightly better, that is why I think it is a pity that this package is no longer maintained.