Remove HTML Tags From a String in Python

Remove HTML Tags From a String in Python

  1. Use Regex to Remove HTML Tags From a String in Python
  2. Use BeautifulSoup to Remove HTML Tags From a String in Python
  3. Use xml.etree.ElementTree to Remove HTML Tags From a String in Python

In this guide, we will learn and apply a few methods to remove HTML tags from a string. We will use the regex, BeautifulSoup, and the XML element tree.

Use Regex to Remove HTML Tags From a String in Python

As HTML tags always contain the symbol <>. We will import the built-in re module (regular expression) and use the compile() method to search for the defined pattern in the input string.

Here, the pattern <.*?> means zero or more characters inside the tag <> and matches as few as possible.

The sub() method is used to replace the occurrences of a string with another string. Here, it will replace the found occurrences with an empty string.

Example Code:

#Python 3.x
import re
string='<h1>Delftstack</h1>'
print('String before cleaning:', string)
to_clean = re.compile('<.*?>')
cleantext = re.sub(to_clean, '', string)
print('String after cleaning:', cleantext)

Output:

#Python 3.x
String before cleaning: <h1>Delftstack</h1>
String after cleaning: Delftstack

Use BeautifulSoup to Remove HTML Tags From a String in Python

BeautifulSoup is a Python library to get the data from HTML and XML. It uses a parser to parse the HTML and XML; recommended one is lxml.

We need to install both before proceeding, using the following commands:

#Python 3.x
pip install beautifulsoup4
#Python 3.x
pip install lxml

We imported the BeautifulSoup module and parsed the given HTML string in the following code. We accessed the text from the HTML using the text attribute.

Example Code:

#Python 3.x
from bs4 import BeautifulSoup
string='<h1>Delftstack</h1>'
print('String after cleaning:', string)
cleantext = BeautifulSoup(string, "lxml").text
print('String after cleaning:', cleantext)

Output:

#Python 3.x
String after cleaning: <h1>Delftstack</h1>
String after cleaning: Delftstack

Use xml.etree.ElementTree to Remove HTML Tags From a String in Python

The ElementTree is a library that parses and navigates through XML. The fromstring() method parses the XML directly from a string to an element, which is a root element of the parse tree.

The itertext() produces a text iterator that loops over this element and all its sub-elements in document order, returning all inner text. By merging all the components (inner text) of an iterable (input string), separated by a string separator, the join() method returns a string that is free from HTML tags.

Example Code:

#Python 3.x
import xml.etree.ElementTree as ET
string = '<h1>Delftstack</h1>'
print('String before cleaning:', string)
tree = ET.fromstring(string)
print('String after cleaning:',''.join(tree.itertext()))

Output:

#Python 3.x
String before cleaning: <h1>Delftstack</h1>
String after cleaning: Delftstack
Author: Fariba Laiq
Fariba Laiq avatar Fariba Laiq avatar

I am Fariba Laiq from Pakistan. An android app developer, technical content writer, and coding instructor. Writing has always been one of my passions. I love to learn, implement and convey my knowledge to others.

LinkedIn

Related Article - Python String

  • Remove Commas From String in Python
  • Check a String Is Empty in a Pythonic Way
  • Convert a String to Variable Name in Python
  • Remove Whitespace From a String in Python
  • Extract Numbers From a String in Python
  • Convert String to Datetime in Python