Language Detection in Python

Language Detection in Python

  1. Use Libraries and API for Language Detection in Python
  2. Use Language Models for Language Detection in Python
  3. Use Intersecting Sets for Language Detection in Python
  4. Conclusion

Even though we are aware of a few languages as humans, it is not enough when we deal with datasets with mixed languages because we have to identify the language used in text or documents to proceed with the process. Due to that, adapting to a language detection method assists in such situations.

To detect languages, Python has different language detection libraries. We can pick which suits us most as the Python libraries used in language detection recognize the character among the expressions and commonly used words in the content.

We can build models using Natural Language Processing or Machine Learning to detect languages and Python libraries. For instance, when Chrome detects that the web page’s content is not in English, it pops up a box with a button to translate.

The idea behind that scenario is Chrome is using a model to predict the language of text used on a webpage.

Use Libraries and API for Language Detection in Python

The first method we used in Python to detect languages is a library or an API. Let’s see the most used libraries we can use for language detection in Python.

langdetect

langdetect is also Google’s language detection library that needs to be installed as the previous modules because this doesn’t come with the standard utility modules.

This API is useful in text processing and linguistics and supports 55 languages.

Python versions should be 2.7 or 3.4+ to use this API. We can install the langdetect API as below.

$ pip install langdetect

We can use the langdetect API to detect languages after importing the detect module. After that, the code prints the detected language of the given sentence.

Here we have provided three sentences as examples, and it displays their languages as English (en), Italian (pt), and Chinese (ko), respectively.

Code:

from langdetect import detect

print(detect('Hello World!'))
print(detect('Ciao mondoe!'))
print(detect('你好世界!'))

Output:

langdetect

langid

langid is another API used in detecting the language names with minimal dependencies. And also it is a standalone language identification tool that can detect 97 languages.

To install, we have to type the below command in the terminal.

$ pip install langid

Using the method below, we can detect the language using the langid library. As in TextBlob, while looping, it sees the language of three sentences and prints out the respected language of each sentence as English (it), Italian (gl) and Chinese (zh).

Code:

import langid

T = ['Hello World!', 'Ciao mondoe!', '你好世界!' ]

for i in T:
    print(langid.classify(i))

Output:

langid

textblob

textblob is another API that uses Google Translate’s language detector to perform on textual data. It plays nicely with the help of NLTK (Natural Language Toolkit) and pattern modules, considered giants in Python.

This simple API does sentimental analysis, noun phrase extraction, part of speech tagging, classification, and more rather than detecting the language.

To use this API, the version of Python should be above or equal to 2.7 or 3.5 and requires an internet connection.

We have to install the package with the pip command.

$ pip install textblob

After that, we can detect the language by importing the module TextBlob. Here we have assigned three sentences with different languages to the array named "T".

While looping through a for loop, it detects the wording of three sentences and prints them out.

Code:

from textblob import TextBlob

T = ['Hello World!', 'Bonjour le monde!', '你好世界!' ]

for i in T:
    lang = TextBlob(i)
    print(lang.detect_language())

As the textblob library is already deprecated, the above code displays an error instead of an accurate output. So, using this method is not recommended; instead of this, we can use Google Translate API.

Learn more on TextBlob here.

In addition to the above APIs and libraries, we have googletrans, FastText, Spacy, polyglot, pycld, chardet, guess language, and many more. As per the use case, we can use them too.

Among them, polyglot and FastText are the best libraries for long text with high accuracy. Also, polyglot and pycld can detect multiple languages in a text.

googletrans is a free Python library that allows us to make unlimited requests. It can auto-detect languages and is fast and reliable.

FastText is a text classifier that can recognize 176 languages and provides quicker and more accurate outputs. The language detection library used by Facebook is FastText.

Apart from using libraries or APIs, we can detect languages by using language models or intersecting sets.

Use Language Models for Language Detection in Python

Here, the model gives the probability of a sequence of words, and we can use N language models for each language with the highest score.

These language models enable us to detect the language of the text even if it contains a diverse set of languages.

Use Intersecting Sets for Language Detection in Python

And the following method we can detect languages is using intersecting sets. Here we are preparing N sets with the most frequent words in each language and intersecting the text with each set.

Then the detected language is the set that has more intersections.

Conclusion

Overall, Python’s systematic method of detecting languages uses libraries and APIs. But they differ due to accuracy, language coverage, speed, and memory consumption.

We can choose suitable libraries and build models per the use case.

When the model only depends on one language, the other languages can be considered noise. Language detection is a step in data cleaning; therefore, we can get noise-free data by detecting languages.

Migel Hewage Nimesha avatar Migel Hewage Nimesha avatar

Nimesha is a Full-stack Software Engineer for more than five years, he loves technology, as technology has the power to solve our many problems within just a minute. He have been contributing to various projects over the last 5+ years and working with almost all the so-called 03 tiers(DB, M-Tier, and Client). Recently, he has started working with DevOps technologies such as Azure administration, Kubernetes, Terraform automation, and Bash scripting as well.