How to Create an XML Parser in Python

Vaibhhav Khetarpal Feb 02, 2024
  1. Use the ElementTree API to Parse an XML Document in Python
  2. Use the minidom Module to Parse an XML Document in Python
  3. Use the Beautiful Soup Library to Parse an XML Document in Python
  4. Use the xmltodict Library to Parse an XML Document in Python
  5. Use the lxml Library to Parse an XML Document in Python
  6. Use the untangle Module to Parse an XML Document in Python
  7. Use the declxml Library to Parse an XML Document in Python
How to Create an XML Parser in Python

XML is an abbreviation for eXtensible Markup Language and is a self-descriptive language utilized to store and transport data. Python provides a medium for parsing and modification of an XML document.

This tutorial focuses on and demonstrates different methods to parse an XML document in Python.

Use the ElementTree API to Parse an XML Document in Python

The xml.etree.ElementTree module is utilized to generate an efficient yet simple API to parse the XML document and create XML data.

The following code uses the xml.etree.ElementTree module to parse an XML document in Python.

# >= Python 3.3 code
import xml.etree.ElementTree as ET

file1 = """<foo>
           <bar>
               <type foobar="Hello"/>
               <type foobar="God"/>
          </bar>
       </foo>"""
tree = ET.fromstring(file1)
x = tree.findall("bar/type")
for item in x:
    print(item.get("foobar"))

Output:

Hello
God

Here, we pass the XML data as a string within triple quotes. We can also import an actual XML document with the help of the parse() function of the ElementTree module.

The cElementTree module was the C implementation of the ElementTree API, with the only difference being that cElementTree is optimized. With that being said, it can parse about 15-20 times faster than the ElementTree module and uses a very low amount of memory.

However, in Python 3.3 and above, the cElementTree module has been deprecated, and the ElementTree module uses a faster implementation.

Use the minidom Module to Parse an XML Document in Python

The xml.dom.minidom can be defined as a basic implementation of the Document Object Model (DOM) interface. All the DOM applications usually begin with the parsing of an XML object. Therefore, this method is the quickest method to parse an XML document in Python.

The following code uses the parse() function from the minidom module to parse an XML document in Python.

XML File (sample1.xml):

<data>
    <strings>
        <string name="Hello"></string>
        <string name="God"></string>
    </strings>
</data>

Python Code:

from xml.dom import minidom

xmldoc = minidom.parse("sample1.xml")
stringlist = xmldoc.getElementsByTagName("string")
print(len(stringlist))
print(stringlist[0].attributes["name"].value)
for x in stringlist:
    print(x.attributes["name"].value)

Output:

2
Hello
God

This module also allows the XML to be passed as a string, similar to the ElementTree API. However, it uses the parseString() function to achieve this.

Both the xml.etree.ElementTree and xml.dom.minidom modules are said to be not safe against maliciously constructed data.

Use the Beautiful Soup Library to Parse an XML Document in Python

The Beautiful Soup library is designed for web scraping projects and pulling the data out from XML and HTML files. Beautiful Soup is really fast and can parse anything that it encounters.

This library even does the tree traversal process for the program and parses the XML document. Additionally, Beautiful Soup is also used to prettify the given source code.

The Beautiful Soup library needs to be manually installed and then imported to the Python code for this method. This library can be installed using the pip command. The Beautiful Soup 4 library, which is the latest version, works on Python 2.7 and above.

The following code uses the Beautiful Soup library to parse an XML document in Python.

from bs4 import BeautifulSoup

file1 = """<foo>
   <bar>
      <type foobar="Hello"/>
      <type foobar="God"/>
   </bar>
</foo>"""

a = BeautifulSoup(file1)
print(a.foo.bar.type["foobar"])
print(a.foo.bar.findAll("type"))

Output:

u'Hello'
[<type foobar="Hello"></type>, <type foobar="God"></type>]

Beautiful Soup is faster than any other tools used for parsing, but it might be hard to understand and implement this method sometimes.

Use the xmltodict Library to Parse an XML Document in Python

The xmltodict library helps in making the process on XML files similar to that of JSON. It can also be used in the case when we want to parse an XML file. The xmltodict module can be utilized in this case by parsing an XML file to an ordered dictionary.

The xmltodict library needs to be manually installed and then imported into the Python code that contains the XML file. The installation of xmltodict is pretty basic and can be done using the standard pip command.

The following code uses the xmltodict library to parse an XML document in Python.

import xmltodict

file1 = """<foo>
             <bar>
                 <type foobar="Hello"/>
                 <type foobar="God"/>
             </bar>
        </foo> """
result = xmltodict.parse(file1)
print(result)

Output:

OrderedDict([(u'foo', OrderedDict([(u'bar', OrderedDict([(u'type', [OrderedDict([(u'@foobar', u'Hello')]), OrderedDict([(u'@foobar', u'God')])])]))]))])

Use the lxml Library to Parse an XML Document in Python

The lxml library is able to provide a simple yet very powerful API in Python used to parse XML and HTML files. It combines the ElementTree API with libxml2/libxslt.

In simpler words, the lxml library further extends the old ElementTree library to offer support for much newer things like XML Schema, XPath, and XSLT.

Here, we will use the lxml.objectify library. The following code uses the lxml library to parse an XML document in Python.

from collections import defaultdict
from lxml import objectify

file1 = """<foo>
                <bar>
                    <type foobar="1"/>
                    <type foobar="2"/>
                </bar>
            </foo>"""

c = defaultdict(int)

root = objectify.fromstring(file1)

for item in root.bar.type:
    c[item.attrib.get("foobar")] += 1

print(dict(c))

Output:

{'1': 1, '2': 1}

Here, in this program, the c variable is used to store the count of each item available in a dictionary.

Use the untangle Module to Parse an XML Document in Python

The untangle module is an easy-to-implement module that focuses on converting XML into a Python Object. It can also be easily installed using the pip command. This module works with Python 2.7 and above.

The following code uses the untangle module to parse an XML document in Python.

XML File (sample1.xml):

<foo>
   <bar>
      <type foobar="Hello"/>
   </bar>
</foo>

Python code:

import untangle

x = untangle.parse("/path_to_xml_file/sample1.xml")
print(x.foo.bar.type["foobar"])

Output:

Hello

Use the declxml Library to Parse an XML Document in Python

The declxml library, an abbreviation for Declarative XML Processing, is utilized to provide a simple API to serialize and parsing XML documents. This library aims to reduce the programmer’s workload and replace the need to go through big and long chunks of code of the parsing logic requisite when using other popular APIs, such as minidom or ElementTree.

The declxml module can be installed easily in the system by using the pip or the pipenv command. The following code uses the declxml library to parse an XML document in Python.

import declxml as xml

xml_string = """
<foo>
   <bar>
      <type foobar="1"/>
      <type foobar="3"/>
      <type foobar="5"/>
   </bar>
</foo>
"""

processor = xml.dictionary(
    "foo", [xml.dictionary("bar", [xml.array(xml.integer("type", attribute="foobar"))])]
)

xml.parse_from_string(processor, xml_string)

Output:

{'bar': {'foobar': [1, 3, 5]}}

In this method, we use processors for declaratively characterizing the structure of the given XML document and for mapping between XML and Python data structures.

Vaibhhav Khetarpal avatar Vaibhhav Khetarpal avatar

Vaibhhav is an IT professional who has a strong-hold in Python programming and various projects under his belt. He has an eagerness to discover new things and is a quick learner.

LinkedIn

Related Article - Python XML