Convert Unicode Characters to ASCII String in Python

Convert Unicode Characters to ASCII String in Python

Unicode Characters is the global encoding standard for characters for all languages. Unlike ASCII, which only supports a single byte per character, Unicode characters extend this capability to 4 bytes, making it support more characters in any language.

This tutorial demonstrates how to convert Unicode characters into an ASCII string. The goal is to either remove the characters that aren’t supported in ASCII or replace the Unicode characters with their corresponding ASCII character.

Use unicodedata.normalize() and encode() to Convert Unicode to ASCII String in Python

The Python module unicodedata provides a way to utilize the database of characters in Unicode and utility functions that help the accessing, filtering, and lookup of these characters significantly easier.

unicodedata has a function called normalize() that accepts two parameters, the normalized form of the Unicode string and the given string.

There are 4 types of normalized Unicode forms: NFC, NFKC, NFD, and NFKD. To learn more about this, the official documentation is readily available for a thorough and in-depth explanation for each type. The NFKD normalized form will be used throughout this tutorial.

Let’s declare a string with multiple unicode characters.

import unicodedata

stringVal = u'Här är ett exempel på en svensk mening att ge dig.'

print(unicodedata.normalize('NFKD', stringVal).encode('ascii', 'ignore'))

After calling the normalize() method, chain a call to the function encode(), which does the conversion from Unicode to ASCII.

The u character before the string value helps Python recognize that the string value contains unicode characters; this is done for type safety purposes.

The first parameter specifies the conversion type, and the second parameter enforces what should be done if a character cannot be converted. In this case, the 2nd parameter passes ignore, which ignores any character that can’t be converted.

Output:

b'Har ar ett exempel pa en svensk mening att ge dig.'

Notice that the unicode characters from the original string (ä and å) have been replaced with its ASCII character counterpart (a).

The b symbol at the beginning of the string denotes that the string is a byte literal since the encode() function is used on the string. To remove the symbol and the single quotes encapsulating the string, then chain call the function decode() after calling encode() to re-convert it into a string literal.

print(unicodedata.normalize('NFKD', stringVal).encode('ascii', 'ignore').decode())

Output:

Har ar ett exempel pa en svensk mening att ge dig.

Let’s try another example using the replace as the second parameter in the encode() function.

For this example, let’s try out a string having characters that do not have ASCII counterparts.

import unicodedata

stringVal = u'áæãåāœčćęßßßわた'

print(unicodedata.normalize('NFKD', stringVal).encode('ascii', 'replace').decode())

All the characters within this example string are not registered in ASCII but may have a counterpart symbol.

Output:

a??a?a?a??c?c?e??????

The replace parameter outright replaces the characters without ASCII counterparts with a question mark ? symbol. If we were to use ignore on the same string:

print(unicodedata.normalize('NFKD', stringVal).encode('ascii', 'ignore').decode())

The output will be:

aaaacce

In summary, to convert Unicode characters into ASCII characters, use the normalize() function from the unicodedata module and the built-in encode() function for strings. You can either ignore or replace Unicode characters that do not have ASCII counterparts. The ignore option will remove the character, and the replace option will replace it with question marks.

Rayven Esplanada avatar Rayven Esplanada avatar

Skilled in Python, Java, Spring Boot, AngularJS, and Agile Methodologies. Strong engineering professional with a passion for development and always seeking opportunities for personal and career growth. A Technical Writer writing about comprehensive how-to articles, environment set-ups, and technical walkthroughs. Specializes in writing Python, Java, Spring, and SQL articles.

LinkedIn

Related Article - Python Unicode

  • Convert Unicode to ASCII in Python
  • Print Unicode Characters in Python
  • Related Article - Python String

  • Remove Commas From String in Python
  • Check a String Is Empty in a Pythonic Way
  • Convert a String to Variable Name in Python
  • Remove Whitespace From a String in Python
  • Extract Numbers From a String in Python
  • Convert String to Datetime in Python