Convert Unicode Characters to ASCII String in Python

Unicode Characters is the global encoding standard for characters for all languages. Unlike ASCII, which only supports a single byte per character, Unicode characters extend this capability to 4 bytes, making it support more characters in any language.

This tutorial demonstrates how to convert Unicode characters into an ASCII string. The goal is to either remove the characters that aren’t supported in ASCII or replace the Unicode characters with their corresponding ASCII character.

Use unicodedata.normalize() and encode() to Convert Unicode to ASCII String in Python

The Python module unicodedata provides a way to utilize the database of characters in Unicode and utility functions that help the accessing, filtering, and lookup of these characters significantly easier.

unicodedata has a function called normalize() that accepts two parameters, the normalized form of the Unicode string and the given string.

There are 4 types of normalized Unicode forms: NFC, NFKC, NFD, and NFKD. To learn more about this, the official documentation is readily available for a thorough and in-depth explanation for each type. The NFKD normalized form will be used throughout this tutorial.

Let’s declare a string with multiple unicode characters.

import unicodedata

stringVal = u'Här är ett exempel på en svensk mening att ge dig.'

print(unicodedata.normalize('NFKD', stringVal).encode('ascii', 'ignore'))

After calling the normalize() method, chain a call to the function encode(), which does the conversion from Unicode to ASCII.

The u character before the string value helps Python recognize that the string value contains unicode characters; this is done for type safety purposes.

The first parameter specifies the conversion type, and the second parameter enforces what should be done if a character cannot be converted. In this case, the 2nd parameter passes ignore, which ignores any character that can’t be converted.

Output:

b'Har ar ett exempel pa en svensk mening att ge dig.'

Notice that the unicode characters from the original string (ä and å) have been replaced with its ASCII character counterpart (a).

The b symbol at the beginning of the string denotes that the string is a byte literal since the encode() function is used on the string. To remove the symbol and the single quotes encapsulating the string, then chain call the function decode() after calling encode() to re-convert it into a string literal.

print(unicodedata.normalize('NFKD', stringVal).encode('ascii', 'ignore').decode())

Output:

Har ar ett exempel pa en svensk mening att ge dig.

Let’s try another example using the replace as the second parameter in the encode() function.

For this example, let’s try out a string having characters that do not have ASCII counterparts.

import unicodedata

stringVal = u'áæãåāœčćęßßßわた'

print(unicodedata.normalize('NFKD', stringVal).encode('ascii', 'replace').decode())

All the characters within this example string are not registered in ASCII but may have a counterpart symbol.

Output:

a??a?a?a??c?c?e??????

The replace parameter outright replaces the characters without ASCII counterparts with a question mark ? symbol. If we were to use ignore on the same string:

print(unicodedata.normalize('NFKD', stringVal).encode('ascii', 'ignore').decode())

The output will be:

aaaacce

In summary, to convert Unicode characters into ASCII characters, use the normalize() function from the unicodedata module and the built-in encode() function for strings. You can either ignore or replace Unicode characters that do not have ASCII counterparts. The ignore option will remove the character, and the replace option will replace it with question marks.

Contribute
DelftStack is a collective effort contributed by software geeks like you. If you like the article and would like to contribute to DelftStack by writing paid articles, you can check the write for us page.

Related Article - Python String

  • Convert a String to List in Python
  • String to Hex in Python