Python Extract Table From Webpage

Python Extract Table From Webpage

  1. Python Extract Table From Webpage
  2. Python Extract Table From Webpage Using Pandas
  3. Python Extract Table From Webpage Using lxml

The main aim of this article is to demonstrate how tables can be extracted from a webpage using Pandas and lxml in Python.

Python Extract Table From Webpage

Data holds significant importance in this modern era, where much information is processed, stored, and extracted daily at a high-frequency rate. Regarding that, our project may demand that we extract data from a certain online location, repository, or webpage.

This can be a possibility in numerous use cases. Perhaps it is required to access a public record-keeping site to extract, process, and store data for numerous reasons or extract data from our repository.

There are better ways to do this, but for the sake of simplicity, let’s assume that this is the required way.

To tackle this situation, we must devise a solution that establishes the connection to the webpage, reads through the page, finds any tables (if they are present), correctly extracts them, and stores them in a suitable format. This is so that it can be processed in our program.

The solution to this problem can be approached in multiple ways, two of which are mentioned below:

Python Extract Table From Webpage Using Pandas

Before proceeding further, make sure you have the following modules/packages installed:

  1. lxml
  2. html5lib
  3. BeautifulSoup4

You can install the mentioned packages with pip using the following command:

pip install lxml html5lib BeautifulSoup4

This gives the following output:

Collecting lxml
  Downloading lxml-4.9.1-cp310-cp310-win_amd64.whl (3.6 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 3.6/3.6 MB 37.9 kB/s eta 0:00:00
Requirement already satisfied: html5lib in c:\program files\python310\lib\site-packages (1.1)
Collecting BeautifulSoup4
  Downloading beautifulsoup4-4.11.1-py3-none-any.whl (128 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 128.2/128.2 kB 29.0 kB/s eta 0:00:00
Requirement already satisfied: webencodings in c:\program files\python310\lib\site-packages (from html5lib) (0.5.1)
Requirement already satisfied: six>=1.9 in c:\program files\python310\lib\site-packages (from html5lib) (1.16.0)
Collecting soupsieve>1.2
  Downloading soupsieve-2.3.2.post1-py3-none-any.whl (37 kB)
Collecting requests
  Downloading requests-2.28.1-py3-none-any.whl (62 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 62.8/62.8 kB 420.8 kB/s eta 0:00:00
Collecting urllib3<1.27,>=1.21.1
  Downloading urllib3-1.26.12-py2.py3-none-any.whl (140 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 140.4/140.4 kB 1.0 MB/s eta 0:00:00
Collecting idna<4,>=2.5
  Downloading idna-3.4-py3-none-any.whl (61 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 61.5/61.5 kB 121.6 kB/s eta 0:00:00
Collecting charset-normalizer<3,>=2
  Downloading charset_normalizer-2.1.1-py3-none-any.whl (39 kB)
Collecting certifi>=2017.4.17
  Downloading certifi-2022.9.24-py3-none-any.whl (161 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 161.1/161.1 kB 1.4 MB/s eta 0:00:00
Installing collected packages: urllib3, idna, charset-normalizer, certifi, requests, soupsieve, lxml, BeautifulSoup4
Successfully installed BeautifulSoup4-4.11.1 lxml-4.9.1 soupsieve-2.3.2.post1 certifi-2022.9.24 charset-normalizer-2.1.1 idna-3.4 requests-2.28.1 urllib3-1.26.12

After the necessary modules have been installed, it’s time to move on to the implementation part.

Consider the following code:

import requests
import pandas as pd

url = 'https://www.ffiec.gov/census/report.aspx?year=2020&county=009&state=09&report=demographic'

html = requests.get(url).content
df_list = pd.read_html(html)
df = df_list[-1]
print(df)

This gives the following output:

    Tract Code Tract Income Level Distressed or Under-served Tract  ...  Minority Population Owner Occupied Units 1- to 4- Family Units
1       1202.0           Moderate                                No  ...                 3040                  918                  2010
2       1251.0             Middle                                No  ...                  551                 1400                  1555
3       1252.0           Moderate                                No  ...                 2088                 1139                  1992
4       1253.0           Moderate                                No  ...                 2443                  728                  1814
..         ...                ...                               ...  ...                  ...                  ...                   ...
95      1714.0           Moderate                                No  ...                 1278                  141                   638
96      1715.0           Moderate                                No  ...                 2241                  396                  1274
97      1716.0             Middle                                No  ...                 1466                 1378                  1803
98      1717.0             Middle                                No  ...                  820                 1456                  1647
99      1751.0             Middle                                No  ...                  851                  669                  1240

[100 rows x 12 columns]

It is easy to extract tables from any webpage using the Pandas library. The read_html method of the Pandas library can be used to read and extract data from webpages and then convert them to dataframes to assist in the smooth processing of data, as they become Dataframe objects.

The extracted tables can also be saved to a CSV file using the to_csv method, which saves the Dataframe object to a CSV file.

Python Extract Table From Webpage Using lxml

Consider the following code:

from lxml import etree
import urllib.request

site = "https://www.ffiec.gov/census/report.aspx?year=2020&county=009&state=09&report=demographic"

hdr = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11',
       'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
       'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
       'Accept-Encoding': 'none',
       'Accept-Language': 'en-US,en;q=0.8',
       'Connection': 'keep-alive'}


request = urllib.request.Request(site, headers=hdr)
web = urllib.request.urlopen(request)
s = web.read()
html = etree.HTML(s)


## Get all 'tr'
tr_nodes = html.xpath('//table[@id="Report1_dgReportDemographic"]/tr')

## 'th' is inside first 'tr'
headers = [i.text for i in tr_nodes[0].xpath("th")]

## Get text from rest all 'tr'
td_content = [[td.text for td in tr.xpath('td')] for tr in tr_nodes[1:]]

for head in headers:
    print(head, end=" ")

print("")

for content in td_content:
    print(content)

This gives the output below:

Tract Code Tract Income Level  Distressed or Under   Tract Median Family Income % 2020 FFIEC Est. MSA/MD non-MSA/MD Median Family Income 2020 Est. Tract Median Family Income 2015 Tract Median Family Income Tract Population Tract Minority % Minority Population Owner Occupied Units 1- to 4- Family Units
['1201.00', 'Middle', 'No', '93.64', '$91,800', '$85,962', '$75,611', '6013', '26.44', '1590', '1862', '2248']
['1202.00', 'Moderate', 'No', '68.12', '$91,800', '$62,534', '$55,000', '6783', '44.82', '3040', '918', '2010']
['1251.00', 'Middle', 'No', '109.80', '$91,800', '$100,796', '$88,654', '4477', '12.31', '551', '1400', '1555']
['1252.00', 'Moderate', 'No', '62.55', '$91,800', '$57,421', '$50,506', '5912', '35.32', '2088', '1139', '1992']
['1253.00', 'Moderate', 'No', '57.28', '$91,800', '$52,583', '$46,250', '5164', '47.31', '2443', '728', '1814']
.
.
.
.
.

It is also possible to use lxml and urllib to extract tables from a webpage and process them. As evident from the code, we need to provide a custom header to the request; otherwise, a 403: Forbidden error is received.

After the request is successful, a search for the table is done (specifically for this site), after which we manually extract the headers and content (rows) from the table.

Although this option is a bit more lengthy and complex than the Pandas module, it is useful when more control and freedom are required regarding what to extract and what not.

Salman Mehmood avatar Salman Mehmood avatar

Hello! I am Salman Bin Mehmood(Baum), a software developer and I help organizations, address complex problems. My expertise lies within back-end, data science and machine learning. I am a lifelong learner, currently working on metaverse, and enrolled in a course building an AI application with python. I love solving problems and developing bug-free software for people. I write content related to python and hot Technologies.

LinkedIn