- Use Libraries and API for Language Detection in Python
- Use Language Models for Language Detection in Python
- Use Intersecting Sets for Language Detection in Python
Even though we are aware of a few languages as humans, it is not enough when we deal with datasets with mixed languages because we have to identify the language used in text or documents to proceed with the process. Due to that, adapting to a language detection method assists in such situations.
To detect languages, Python has different language detection libraries. We can pick which suits us most as the Python libraries used in language detection recognize the character among the expressions and commonly used words in the content.
We can build models using Natural Language Processing or Machine Learning to detect languages and Python libraries. For instance, when Chrome detects that the web page’s content is not in English, it pops up a box with a button to translate.
The idea behind that scenario is Chrome is using a model to predict the language of text used on a webpage.
Use Libraries and API for Language Detection in Python
The first method we used in Python to detect languages is a library or an API. Let’s see the most used libraries we can use for language detection in Python.
langdetect is also Google’s language detection library that needs to be installed as the previous modules because this doesn’t come with the standard utility modules.
This API is useful in text processing and linguistics and supports 55 languages.
Python versions should be 2.7 or 3.4+ to use this API. We can install the
langdetect API as below.
$ pip install langdetect
We can use the
langdetect API to detect languages after importing the
detect module. After that, the code prints the detected language of the given sentence.
Here we have provided three sentences as examples, and it displays their languages as English
(pt), and Chinese
from langdetect import detect print(detect("Hello World!")) print(detect("Ciao mondoe!")) print(detect("你好世界!"))
langid is another API used in detecting the language names with minimal dependencies. And also it is a standalone language identification tool that can detect 97 languages.
To install, we have to type the below command in the terminal.
$ pip install langid
Using the method below, we can detect the language using the
langid library. As in
TextBlob, while looping, it sees the language of three sentences and prints out the respected language of each sentence as English
(gl) and Chinese
import langid T = ["Hello World!", "Ciao mondoe!", "你好世界!"] for i in T: print(langid.classify(i))
textblob is another API that uses Google Translate’s language detector to perform on textual data. It plays nicely with the help of
NLTK (Natural Language Toolkit) and
pattern modules, considered giants in Python.
This simple API does sentimental analysis, noun phrase extraction, part of speech tagging, classification, and more rather than detecting the language.
To use this API, the version of Python should be above or equal to 2.7 or 3.5 and requires an internet connection.
We have to install the package with the
$ pip install textblob
After that, we can detect the language by importing the module
TextBlob. Here we have assigned three sentences with different languages to the array named
While looping through a
for loop, it detects the wording of three sentences and prints them out.
from textblob import TextBlob T = ["Hello World!", "Bonjour le monde!", "你好世界!"] for i in T: lang = TextBlob(i) print(lang.detect_language())
textblob library is already deprecated, the above code displays an error instead of an accurate output. So, using this method is not recommended; instead of this, we can use Google Translate API.
Learn more on
In addition to the above APIs and libraries, we have
guess language, and many more. As per the use case, we can use them too.
FastText are the best libraries for long text with high accuracy. Also,
pycld can detect multiple languages in a text.
googletrans is a free Python library that allows us to make unlimited requests. It can auto-detect languages and is fast and reliable.
FastText is a text classifier that can recognize 176 languages and provides quicker and more accurate outputs. The language detection library used by Facebook is
Apart from using libraries or APIs, we can detect languages by using language models or intersecting sets.
Use Language Models for Language Detection in Python
Here, the model gives the probability of a sequence of words, and we can use
N language models for each language with the highest score.
These language models enable us to detect the language of the text even if it contains a diverse set of languages.
Use Intersecting Sets for Language Detection in Python
And the following method we can detect languages is using intersecting sets. Here we are preparing
N sets with the most frequent words in each language and intersecting the text with each set.
Then the detected language is the set that has more intersections.
Overall, Python’s systematic method of detecting languages uses libraries and APIs. But they differ due to accuracy, language coverage, speed, and memory consumption.
We can choose suitable libraries and build models per the use case.
When the model only depends on one language, the other languages can be considered noise. Language detection is a step in data cleaning; therefore, we can get noise-free data by detecting languages.