Convert Unicode Characters to ASCII String in Python

Unicode Characters is the global encoding standard for characters for all languages. Unlike ASCII, which only supports a single byte per character, Unicode characters extend this capability to 4 bytes, making it support more characters in any language.
This tutorial demonstrates how to convert Unicode characters into an ASCII string. The goal is to either remove the characters that aren’t supported in ASCII or replace the Unicode characters with their corresponding ASCII character.
Use unicodedata.normalize()
and encode()
to Convert Unicode to ASCII String in Python
The Python module unicodedata
provides a way to utilize the database of characters in Unicode and utility functions that help the accessing, filtering, and lookup of these characters significantly easier.
unicodedata
has a function called normalize()
that accepts two parameters, the normalized form of the Unicode string and the given string.
There are 4 types of normalized Unicode forms: NFC
, NFKC
, NFD
, and NFKD
. To learn more about this, the official documentation is readily available for a thorough and in-depth explanation for each type. The NFKD
normalized form will be used throughout this tutorial.
Let’s declare a string with multiple unicode characters.
import unicodedata
stringVal = u'Här är ett exempel på en svensk mening att ge dig.'
print(unicodedata.normalize('NFKD', stringVal).encode('ascii', 'ignore'))
After calling the normalize()
method, chain a call to the function encode()
, which does the conversion from Unicode to ASCII.
The u
character before the string value helps Python recognize that the string value contains unicode characters; this is done for type safety purposes.
The first parameter specifies the conversion type, and the second parameter enforces what should be done if a character cannot be converted. In this case, the 2nd parameter passes ignore
, which ignores any character that can’t be converted.
Output:
b'Har ar ett exempel pa en svensk mening att ge dig.'
Notice that the unicode characters from the original string (ä
and å
) have been replaced with its ASCII character counterpart (a
).
The b
symbol at the beginning of the string denotes that the string is a byte literal since the encode()
function is used on the string. To remove the symbol and the single quotes encapsulating the string, then chain call the function decode()
after calling encode()
to re-convert it into a string literal.
print(unicodedata.normalize('NFKD', stringVal).encode('ascii', 'ignore').decode())
Output:
Har ar ett exempel pa en svensk mening att ge dig.
Let’s try another example using the replace
as the second parameter in the encode()
function.
For this example, let’s try out a string having characters that do not have ASCII counterparts.
import unicodedata
stringVal = u'áæãåāœčćęßßßわた'
print(unicodedata.normalize('NFKD', stringVal).encode('ascii', 'replace').decode())
All the characters within this example string are not registered in ASCII but may have a counterpart symbol.
Output:
a??a?a?a??c?c?e??????
The replace
parameter outright replaces the characters without ASCII counterparts with a question mark ?
symbol. If we were to use ignore
on the same string:
print(unicodedata.normalize('NFKD', stringVal).encode('ascii', 'ignore').decode())
The output will be:
aaaacce
In summary, to convert Unicode characters into ASCII characters, use the normalize()
function from the unicodedata
module and the built-in encode()
function for strings. You can either ignore or replace Unicode characters that do not have ASCII counterparts. The ignore
option will remove the character, and the replace
option will replace it with question marks.
Skilled in Python, Java, Spring Boot, AngularJS, and Agile Methodologies. Strong engineering professional with a passion for development and always seeking opportunities for personal and career growth. A Technical Writer writing about comprehensive how-to articles, environment set-ups, and technical walkthroughs. Specializes in writing Python, Java, Spring, and SQL articles.
LinkedIn