As I have been playing with R quite a bit and I become rusty with Python, I wanted to see how good the language detection libraries in this environment are and also how they compare with R.
I have made a quick search and I found out that langdetect is one derived directly from Google language detection. From the home page of the Python library you can get to the project page, this seems to be different from the code on which the R library CLDR is based. And in fact, the Python library seems to be well alive and maintained. It claims it can detect 55 languages out of the box and upon a simple call to the function “detect” will return the two letter iso code of the language detect while a call to detect_lang will return a vector of probabilities strings. In my tests the vector contained a single item, while in the examples on the web site you get more.
Even in Python there are choices to be made, and the alternative library that I found here is langid. This claims to be a standalone library capable of detecting 97 languages, it has a demo feature, and you can use the langid.classify(“your text”) to get the most likely language and its “score”. The highest the score, the most probable the language (scores that I have observed are negative numbers). Let us see a bit of code and compare these two libraries quickly:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 |
# The language detection libraries import langdetect import langid # The test sentences documents = [u"Rechtdoor gaan, dan naar rechts.", u"Kemal Kılıçdaroğlu Doğan TV Center'da", u"I live in the countryside", u"Questa frase non è scritta in Napoletano.", u"Das ist ein deutscher satz.", u"La vie est magnifique", u"El jugador está predispuesto a que será un partido complicado.", u"Καιρό έχουμε να τα πούμε!", u"Jar kan ikke snakke Norsk", u"Bom dia"] # The language detection test #1: LANGDETECT for line in documents: print langdetect.detect_langs(line) # The language detection test #2: LANGID for line in documents: print langid.classify(line) |
Here I have added the little 2-gram “Bom dia” in Portuguese to see if a very short sentence could make any type of difference, as until there the two libraries performed identically. These are the results.
Results from langdetect :
1 2 3 4 5 6 7 8 9 10 |
[nl:0.999996527104] [tr:0.999997909896] [en:0.999996595419] [it:0.999994788709] [de:0.999998669095] [fr:0.999996192664] [es:0.999996729694] [el:0.999999999413] [no:0.999994212347] [pt:0.999995690285] |
Results from langid :
1 2 3 4 5 6 7 8 9 10 |
('nl', -73.47373533248901) ('tr', -74.21270608901978) ('en', -28.477052211761475) ('it', -64.93842697143555) ('de', -154.69498538970947) ('fr', -47.38988637924194) ('es', -346.22576999664307) ('el', -282.3087646961212) ('nb', -34.800633907318115) ('en', 9.061840057373047) |
What to say. On a bi-gram thrown just to see if one of the two libraries fails langid has failed to detect the language. However, both libraries have proven extremely easy to use and reliable in most of the cases. I would say even better than the libraries I have tested for R. I have also considered the Python’s NLTK library (Natural Language Toolkit), but the scope of the power of this library seems to be beyond my quest for a quick language detection solution, and so it did not compare favourably with the other two for what concerns its readiness for immediate use. However this library is probably the one to go for more complex or specific tasks.