How to Convert XML to CSV Using Python

Vaibhav Vaibhav Feb 02, 2024
  1. Convert XML to CSV Using Python xml Module
  2. Convert XML to CSV Using Python xmltodict Library
  3. Convert XML to CSV Using Python lxml and CSV Libraries
How to Convert XML to CSV Using Python

XML or Extensible Markup Language is a markup language that stores data in a document in both human-readable and machine-readable format. It has a file extension, .xml.

CSV or Comma Separated Values stores data in a document separated by a comma. It has a file extension, .csv. In this article, we will learn how to convert XML data to CSV data using Python.

Convert XML to CSV Using Python xml Module

To convert XML to CSV, we can use the in-built xml module in Python.

This module has yet another module, ElementTree, that we can use to represent an XML document as a tree. It has a method parse() that accepts an XML document as a parameter.

It parses the whole XML document and stores it in the form of a tree. The tree has a getroot() method that returns the root element of the tree.

Using a for loop, we can iterate over the tree and access the data using the tag names (name, rollnumber, and age).

To understand the conversion, we need some sample XML data. You can use the following XML data and store it in a file named input.xml.

The following code snippet will use this data, and you should too so that we are on the same page.

<students>
    <student>
        <name>Rick Grimes</name>
        <rollnumber>1</rollnumber>
        <age>15</age>
    </student>
    <student>
        <name>Lori Grimes</name>
        <rollnumber>2</rollnumber>
        <age>16</age>
    </student>
    <student>
        <name>Carl Grimes</name>
        <rollnumber>3</rollnumber>
        <age>14</age>
    </student>
    <student>
        <name>Judith Grimes</name>
        <rollnumber>4</rollnumber>
        <age>13</age>
    </student>
</students>

Below is the Python code that performs the conversion from XML to CSV:

import xml.etree.ElementTree as ET
import pandas as pd

# Parse the XML file
tree = ET.parse("students.xml")
root = tree.getroot()

# Extracting the data and headers dynamically
data = []
headers = []

for student in root:
    student_data = []
    for detail in student:
        if detail.tag not in headers:
            headers.append(detail.tag)
        student_data.append(detail.text)
    data.append(student_data)

# Creating a DataFrame
df = pd.DataFrame(data, columns=headers)

# Writing the DataFrame to a CSV file
df.to_csv("students.csv", index=False)

CSV Content:

name,rollnumber,age
Rick Grimes,1,15
Lori Grimes,2,16
Carl Grimes,3,14
Judith Grimes,4,13

Detailed Explanation

1. Importing Libraries

import xml.etree.ElementTree as ET
import pandas as pd

We begin by importing the ElementTree library, which is included in Python’s standard library and is used for parsing and creating XML data. We also import the Pandas library, a powerful tool for data manipulation and analysis.

2. Parsing the XML File

tree = ET.parse("students.xml")
root = tree.getroot()

The ET.parse() function is used to parse the XML file. It reads the file, parses the XML data, and returns an ElementTree object. We then obtain the root element of the XML document using the getroot() method.

3. Extracting Data and Headers Dynamically

data = []
headers = []
for student in root:
    student_data = []
    for detail in student:
        if not headers:
            headers.append(detail.tag)
        student_data.append(detail.text)
    data.append(student_data)

We initialize two empty lists, data and headers. We then iterate over each student element in the root of the XML document. For each student, we initialize an empty list named student_data to hold the data for that particular student.

We then iterate over each child element of the student element, extracting the tag names to form the headers and the text content to form the data. We check if the headers list is empty; if it is, we append the tag names. This ensures that we only populate the headers list once. The text content of each child element is appended to the student_data list, which is then appended to the data list to form a nested list of all students’ data.

4. Creating a DataFrame

df = pd.DataFrame(data, columns=headers)

We use Pandas to convert the extracted data into a DataFrame. The pd.DataFrame() function is called with the data list, and the columns parameter is set to the headers list. This organizes the data into a tabular format, making it easy to manipulate and analyze.

5. Writing the DataFrame to a CSV File

df.to_csv("students.csv", index=False)

Finally, we use the to_csv() method of the DataFrame object to write the data to a CSV file named ‘students.csv’. The index=False parameter ensures that the index is not written into the CSV file.

Convert XML to CSV Using Python xmltodict Library

xmltodict is a Python library that provides a convenient way to work with XML data in a more Pythonic and dictionary-like manner. It allows you to parse XML documents and convert them into easily navigable Python dictionaries. This makes it simpler to extract and manipulate data from XML files without the need for complex parsing or traversal code.

With xmltodict, you can convert XML data into a nested dictionary structure, where XML elements become dictionary keys, and their corresponding values are either nested dictionaries or lists, depending on the XML structure. This library streamlines the process of extracting specific data from XML files and simplifies working with XML-based APIs.

Before we proceed, ensure that you have both xmltodict and pandas installed. If not, you can install them via pip:

pip install xmltodict
pip install pandas

We will use the same XML file as above and will parse this XML data, convert it to a dictionary using xmltodict, transform it into a pandas DataFrame, and finally write it to a CSV file.

Python Code

Here is the Python code that performs the conversion:

import xmltodict
import pandas as pd
import json

# Load and parse the XML file
with open("students.xml", "r") as file:
    xml_content = file.read()

# Convert XML to dictionary
data_dict = xmltodict.parse(xml_content)

# Convert dictionary to JSON
json_data = json.dumps(data_dict["students"]["student"])

# Convert JSON to DataFrame
df = pd.read_json(json_data)

# Write DataFrame to CSV
df.to_csv("students.csv", index=False)

Explanation

  1. Loading and Parsing the XML File:

    with open("students.xml", "r") as file:
        xml_content = file.read()
    
  • Here, we open the ‘students.xml’ file in read mode (‘r’) and read its content into the xml_content variable. This XML file likely contains information about students.
  1. Converting XML to Dictionary:

    data_dict = xmltodict.parse(xml_content)
    
  • The xmltodict.parse() function is used to parse the XML content stored in xml_content and convert it into a Python dictionary. This dictionary represents the structured data from the XML file.
  1. Converting Dictionary to JSON:

    json_data = json.dumps(data_dict["students"]["student"])
    
  • We extract the ‘student’ data from the parsed dictionary using data_dict['students']['student'], which represents a list of student records.
  • Then, we use json.dumps() to convert this extracted data into a JSON formatted string, stored in the json_data variable.
  1. Converting JSON to DataFrame:

    df = pd.read_json(json_data)
    
  • With the JSON-formatted data in json_data, we utilize pd.read_json() to convert it into a pandas DataFrame. This DataFrame will now hold the student data in a tabular format, making it easier to work with.
  1. Writing DataFrame to CSV:

    df.to_csv("students.csv", index=False)
    
  • Finally, we take advantage of the DataFrame’s to_csv() method to export the student data into a CSV file named ‘students.csv’. The index=False parameter ensures that the DataFrame’s index is not included in the CSV file.

Output

The resulting CSV file, students.csv, will contain the following data:

name,rollnumber,age
Rick Grimes,1,15
Lori Grimes,2,16
Carl Grimes,3,14
Judith Grimes,4,13

Convert XML to CSV Using Python lxml and CSV Libraries

Next, we will delve into the process of converting an XML file to a CSV file using the lxml and csv libraries in Python.

The lxml library is a Python library that provides a way to work with XML and HTML documents. It is a high-performance, production-quality library that is easy to use and implements the ElementTree API, a Pythonic binding for the C libraries libxml2 and libxslt. It is feature-rich and includes support for XPath, XSLT, Relax NG, and more.

Ensure that you have the lxml library installed. If not, you can install it via pip:

pip install lxml

We will parse the same XML data using lxml, extract the required data, and then use the csv library to write this data into a CSV file.

Python Code

Here is the Python code that performs the conversion:

from lxml import etree
import csv

# Load and parse the XML file
tree = etree.parse("students.xml")
root = tree.getroot()

# Open a CSV file in write mode
with open("students.csv", mode="w", newline="") as file:
    writer = csv.writer(file)

    # Write the header dynamically
    headers = [element.tag for element in root.find(".//student")]
    writer.writerow(headers)

    # Extract data from XML and write to CSV dynamically
    for student in root.findall("student"):
        row_data = [element.text for element in student]
        writer.writerow(row_data)

Explanation

1. Loading and Parsing the XML File

tree = etree.parse("students.xml")
root = tree.getroot()

In these lines, the etree.parse() function from the lxml library is used to parse the XML file named ‘students.xml’. The parse() function reads the file, parses the XML data, and returns an ElementTree object. The getroot() method is then called on this object to get the root element of the XML document. The root element is the parent element that holds all other elements in the XML file.

2. Opening the CSV File in Write Mode

with open("students.csv", mode="w", newline="") as file:
    writer = csv.writer(file)

Here, a CSV file named ‘students.csv’ is opened in write mode. The with statement ensures that the file is properly closed after its suite finishes. The csv.writer() function is used to create a writer object for writing into the CSV file.

3. Dynamically Writing the Headers

headers = [element.tag for element in root.find(".//student/*")]
writer.writerow(headers)

This part is crucial as it ensures the code’s flexibility to handle any XML structure. The headers for the CSV file are dynamically extracted from the XML file. The find() method, combined with an XPath expression, is used to locate the first ‘student’ element in the XML file, and a list comprehension retrieves the tag names of all its child elements. These tag names become the headers of the CSV file, ensuring that no hardcoding of header names is necessary.

4. Extracting Data and Writing Rows Dynamically

for student in root.findall("student"):
    row_data = [element.text for element in student]
    writer.writerow(row_data)

In this segment, a for loop iterates over all ‘student’ elements found in the root of the XML document. For each ‘student’ element, another list comprehension is used to extract the text content of all its child elements. This data is stored in the row_data list, which is then written into the CSV file as a new row using the writerow() method of the writer object.

Output

The resulting CSV file, students.csv, will contain the following data:

Name,Roll Number,Age
Rick Grimes,1,15
Lori Grimes,2,16
Carl Grimes,3,14
Judith Grimes,4,13
Vaibhav Vaibhav avatar Vaibhav Vaibhav avatar

Vaibhav is an artificial intelligence and cloud computing stan. He likes to build end-to-end full-stack web and mobile applications. Besides computer science and technology, he loves playing cricket and badminton, going on bike rides, and doodling.

Related Article - Python XML

Related Article - Python CSV