Parse HTML With Python

Parse HTML With Python

  1. Use the BeautifulSoup Library to Parse HTML Code With Python
  2. Use the pyquery Library to Parse HTML Code With Python
  3. Use the lxml Library to Parse HTML Code With Python
  4. Use the justext Library to Parse HTML Code With Python
  5. Use the ehp Library to Parse HTML Code With Python

Python is a general-purpose programming language with many libraries with different features. There are many different ways to parse an HTML code with Python.

This article will explain how to parse HTML code with Python libraries such as BeautifulSoup, pyquery, and lxml. We will explain the libraries used for this purpose with examples.

Use the BeautifulSoup Library to Parse HTML Code With Python

Beautiful Soup is a Python library for parsing HTML and XML documents. It creates a parse tree that can extract data from HTML, and it is especially useful for web scraping.

Install it with the pip3 install beautifulsoup4 command to use the library.

In the example below, the text content of the div elements whose class is container will be printed on the screen.

from bs4 import BeautifulSoup

code = '''<html>
<head></head>
<body attr1='val1'>
    <div class='container'>
        <div>Text 1</div>
        <div>Text 2</div>
    </div>
</body>
</html>
'''

parsed_html = BeautifulSoup(code)
print(parsed_html.body.find('div', attrs={'class':'container'}).text)

Use the pyquery Library to Parse HTML Code With Python

pyquery is a Python library to make jQuery queries on XML documents. It uses lxml for fast XML and HTML manipulation with jQuery syntax.

Install it with the pip3 install pyquery command to use the library.

In the example below, the text content of the div elements whose class is container will be printed on the screen.

from pyquery import PyQuery
code = '''<html>
<head></head>
<body attr1='val1'>
    <div class='container'>
        <div>Text 1</div>
        <div>Text 2</div>
    </div>
</body>
</html>
'''
pq = PyQuery(code)
tag = pq('div.container')
print(tag.text())

Use the lxml Library to Parse HTML Code With Python

lxml is a Python library for parsing XML and HTML files. It integrates the C libraries libxml2 and libxslt into Python.

The lxml library is especially useful for web scraping. Install it with the pip3 install lxml command to use the library.

In the example below, the text content and link of the a elements in the website will be printed on the screen.

from lxml.html import parse
code = parse('http://www.google.com').getroot()
for link in code.cssselect('a'):
    print(f"{link.text_content()} {link.get('href')}")

Use the justext Library to Parse HTML Code With Python

jusText is a Python library to remove non-text content, such as navigation links, headers, and footers, from HTML code. It preserves mainly text containing full sentences.

Install it with the pip3 install justext command to use the library.

In the example below, the text content on the website will be printed on the screen.

import requests
import justext

code = requests.get("http://planet.python.org/")
content = justext.justext(code.content, justext.get_stoplist("English"))
for line in content:
    if not line.is_boilerplate:
        print(line.text)

Use the ehp Library to Parse HTML Code With Python

Easy HTML Parser is a Python library to parse HTML and XML documents. The ehp library is especially useful for web scraping.

Install it with the pip3 install ehp command to use the library.

In the example below, the text content of the div elements whose class is container will be printed on the screen.

from ehp import *

code = '''<html>
<head></head>
<body attr1='val1'>
    <div class='container'>
        <div>Text 1</div>
        <div>Text 2</div>
    </div>
</body>
</html>
'''

html = Html()
dom = html.feed(code)
for ind in dom.find('div', ('class', 'container')):
    print(ind.text())
Author: Yahya Irmak
Yahya Irmak avatar Yahya Irmak avatar

Yahya Irmak has experience in full stack technologies such as Java, Spring Boot, JavaScript, CSS, HTML.

LinkedIn

Related Article - Python HTML

  • Save HTML as PDF in Python