How to Convert Unicode Characters to ASCII String in Python

Rayven Esplanada Feb 02, 2024
  1. Use unicodedata.normalize() and encode() to Convert Unicode to ASCII String in Python
  2. Use the unidecode Library to Convert Unicode to ASCII String in Python
  3. Conclusion
How to Convert Unicode Characters to ASCII String in Python

Unicode Characters is the global encoding standard for characters for all languages. Unlike ASCII, which only supports a single byte per character, Unicode characters extend this capability to 4 bytes, making it support more characters in any language.

This tutorial will demonstrate how to convert Unicode characters into an ASCII string. The goal is to either remove the characters that aren’t supported in ASCII or replace the Unicode characters with their corresponding ASCII character.

Use unicodedata.normalize() and encode() to Convert Unicode to ASCII String in Python

The Python module unicodedata provides a way to utilize the database of characters in Unicode and utility functions that help the accessing, filtering, and lookup of these characters significantly easier.

Normalizing Unicode

unicodedata has a function called normalize() that accepts two parameters, the normalized form of the Unicode string and the given string.

There are 4 types of normalized Unicode forms: NFC, NFKC, NFD, and NFKD. To learn more about this, the official documentation is readily available for an in-depth explanation for each type.

The NFKD normalized form will be used throughout this tutorial.

Syntax:

unicodedata.normalize(form, unistr)

Parameters:

  • form: This specifies the Unicode normalization form to apply to the input string.
  • unistr: The input Unicode string that we want to normalize according to the chosen normalization form.

Now, let’s declare a string with multiple Unicode characters.

Code Example:

import unicodedata

stringVal = "Här är ett exempel på en svensk mening att ge dig."

print(unicodedata.normalize("NFKD", stringVal).encode("ascii", "ignore"))

In the code, we start by importing the unicodedata module, which allows us to work with Unicode characters. We define a Unicode string called stringVal with the value "Här är ett exempel på en svensk mening att ge dig."; and this string contains various Unicode characters, including diacritics.

We then use the unicodedata.normalize() function with the "NFKD" (Normalization Form KD) parameter to normalize the stringVal. This normalization form decomposes characters with diacritics into their base characters and diacritic marks.

The result of the normalization is encoded using the "ascii" codec, and we specify "ignore" as the error handler. This means that any character that cannot be converted to ASCII will be ignored.

Output:

b'Har ar ett exempel pa en svensk mening att ge dig.'

The output displayed is a byte literal (indicated by the b prefix) containing the normalized string with non-ASCII characters replaced with their closest ASCII equivalents.

In this case, the characters ä and å are replaced with a, and the resulting string is b'Har ar ett exempel pa en svensk mening att ge dig.'. The byte literal can be further decoded to obtain a plain ASCII string if needed.

In order to remove the symbol and the single quotes encapsulating the string, call the function decode() after calling encode() to re-convert it into a string literal.

import unicodedata

stringVal = "Här är ett exempel på en svensk mening att ge dig."

print(unicodedata.normalize("NFKD", stringVal).encode("ascii", "ignore").decode())

Output:

Har ar ett exempel pa en svensk mening att ge dig.

Handling Untranslatable Characters

Let’s try another example using the replace as the second parameter in the encode() function. For this example, let’s try out a string having characters that do not have ASCII counterparts.

Code Example:

import unicodedata

stringVal = "áæãåāœčćęßßßわた"

print(unicodedata.normalize("NFKD", stringVal).encode("ascii", "replace").decode())

In the code, we begin by importing the unicodedata module. Then, we define a Unicode string called stringVal, which contains a mix of characters, "áæãåāœčćęßßßわた".

Next, we utilize the unicodedata.normalize() function with the "NFKD" parameter to normalize the stringVal. This normalization form decomposes characters into their base characters and diacritic marks, preparing them for conversion to ASCII.

The normalized string is then encoded using the "ascii" codec with the "replace" error handler. When characters in the string do not have direct ASCII representations, the "replace" handler replaces them with a question mark (?) symbol.

Output:

a??a?a?a??c?c?e??????

The output displayed is a string where non-ASCII characters are replaced with question marks. In this case, the output string becomes a??a?a?a??c?c?e??????.

This is a common way to handle characters that don’t have a direct ASCII equivalent during conversion, ensuring that the output remains in a recognizable format.

To remove the ?, we will use "ignore" instead of "replace" on the same string:

import unicodedata

stringVal = "áæãåāœčćęßßßわた"

print(unicodedata.normalize("NFKD", stringVal).encode("ascii", "ignore").decode())

Output:

aaaacce

As seen in the output, all the supposedly question marks (?) are removed since "ignore" is used instead of "replace", resulting in the output string: aaaacce.

Use the unidecode Library to Convert Unicode to ASCII String in Python

To use the unidecode library, we need to install it and then modify our code. Here’s how to use the unidecode library to convert Unicode to ASCII.

First, we need to install the unidecode library. We can do this by using the pip install unidecode command.

Once we have the library installed, we can use it to convert Unicode text to its closest ASCII representation.

Basic Syntax:

from unidecode import unidecode

ascii_text = unidecode(unicode_text)

Parameter:

  • input_text: This is the Unicode text that we want to convert to its closest ASCII representation. We pass the Unicode text as an argument to the function, and it returns the corresponding ASCII string.

Code Example:

from unidecode import unidecode

stringVal = "Här är ett exempel på en svensk mening att ge dig."

ascii_str = unidecode(stringVal)
print(ascii_str)

In this code, we import the unidecode function from the unidecode library. Then, we pass your Unicode string, stringVal, to the unidecode function, which will return an ASCII representation of the string.

Finally, we print the ascii_str, which contains the ASCII representation of the original Unicode string.

Output:

Har ar ett exempel pa en svensk mening att ge dig.

The unidecode library has transformed the original Unicode string "Här är ett exempel på en svensk mening att ge dig." into an ASCII representation while preserving the closest phonetic representation. In this case, it replaced characters like ä and å with their closest ASCII equivalents.

Conclusion

This article explores two methods for converting Unicode characters to ASCII strings in Python. It starts by demonstrating the use of the unicodedata module, which provides precise normalization of Unicode characters but may involve character replacement or removal.

Then, it introduces the unidecode library, a convenient tool that ensures phonetic representations are maintained while converting Unicode to ASCII. The choice of method depends on our specific requirements, offering Python developers versatile options for handling Unicode data effectively while ensuring compatibility with ASCII-based systems.

Rayven Esplanada avatar Rayven Esplanada avatar

Skilled in Python, Java, Spring Boot, AngularJS, and Agile Methodologies. Strong engineering professional with a passion for development and always seeking opportunities for personal and career growth. A Technical Writer writing about comprehensive how-to articles, environment set-ups, and technical walkthroughs. Specializes in writing Python, Java, Spring, and SQL articles.

LinkedIn

Related Article - Python Unicode