Signup/Sign In

How to read XML file in Python

In this article, we will learn various ways to read XML files in Python. We will use some built-in modules and libraries available in Python and some related custom examples as well. Let's first have a quick look over the full form of XML, introduction to XML, and then read about various parsing modules to read XML documents in Python.

Introduction to XML

XML stands for Extensible Markup Language. It's needed for keeping track of the tiny to medium amount of knowledge. It allows programmers to develop their own applications to read data from other applications. The method of reading the information from an XML file and further analyzing its logical structure is known as Parsing. Therefore, reading an XML file is that the same as parsing the XML document.

In this article, we would take a look at four different ways to read XML documents using different XML modules. They are:

1. MiniDOM(Minimal Document Object Model)

2. BeautifulSoup alongside the lxml parser

3. Element tree

4. Simple API for XML (SAX)

XML File: We are using this XML file to read in our examples.

<data>
    <models>
        <model name="model1">model1abc</model>
        <model name="model2">model2abc</model>
    </models>
</data>

Read XML File Using MiniDOM

It is Python module, used to read XML file. It provides parse() function to read XML file. We must import Minidom first before using its function in the application. The syntax of this function is given below.

Syntax

xml.dom.minidom.parse(filename_or_file[, parser[, bufsize]])

This function returns a document of XML type.

Example Read XML File in Python

Since each node will be treated as an object, we are able to access the attributes and text of an element using the properties of the object. Look at the example below, we've accessed the attributes and text of a selected node.

from xml.dom import minidom

# parse an xml file by name
file = minidom.parse('models.xml')

#use getElementsByTagName() to get tag
models = file.getElementsByTagName('model')

# one specific item attribute
print('model #2 attribute:')
print(models[1].attributes['name'].value)

# all item attributes
print('\nAll attributes:')
for elem in models:
  print(elem.attributes['name'].value)

# one specific item's data
print('\nmodel #2 data:')
print(models[1].firstChild.data)
print(models[1].childNodes[0].data)

# all items data
print('\nAll model data:')
for elem in models:
  print(elem.firstChild.data)


model #2 attribute:
model2
All attributes:
model1
model2
model #2 data:
model2abc
model2abc
All model data:
model1abc
model2abc

Read XML File Using BeautifulSoup alongside the lxml parser

In this example, we will use a Python library named BeautifulSoup. Beautiful Soup supports the HTML parser (lxml) included in Python’s standard library. Use the following command to install beautiful soup and lmxl parser in case, not installed.

#for beautifulsoup
pip install beautifulsoup4

#for lmxl parser
pip install lxml

After successful installation, use these libraries in python code.

We are using this XML file to read with Python code.

<model>
  <child name="Acer" qty="12">Acer is a laptop</child>
  <unique>Add model number here</unique>
  <child name="Onida" qty="10">Onida is an oven</child>
  <child name="Acer" qty="7">Exclusive</child>
  <unique>Add price here</unique>
  <data>Add content here
     <family>Add company name here</family>
     <size>Add number of employees here</size>
  </data>
</model>

Example Read XML File in Python

Let's read the above file using beautifulsoup library in python script.

from bs4 import BeautifulSoup 

# Reading the data inside the xml file to a variable under the name  data
with open('models.xml', 'r') as f:
    data = f.read() 

# Passing the stored data inside the beautifulsoup parser 
bs_data = BeautifulSoup(data, 'xml') 

# Finding all instances of tag   
b_unique = bs_data.find_all('unique') 
print(b_unique) 

# Using find() to extract attributes of the first instance of the tag 
b_name = bs_data.find('child', {'name':'Acer'}) 
print(b_name) 

# Extracting the data stored in a specific attribute of the `child` tag 
value = b_name.get('qty') 
print(value)


[<unique>Add model number here</unique>, <unique>Add price here</unique>]
<child name="Acer" qty="12">Acer is a laptop</child>
12

Read XML File Using Element Tree

The Element tree module provides us with multiple tools for manipulating XML files. No installation is required. Due to the XML format present in the hierarchical data format, it becomes easier to represent it by a tree. Element Tree represents the whole XML document as a single tree.

Example Read XML File in Python

To read an XML file, firstly, we import the ElementTree class found inside the XML library. Then, we will pass the filename of the XML file to the ElementTree.parse() method, to start parsing. Then, we will get the parent tag of the XML file using getroot(). Then we will display the parent tag of the XML file. Now, to get attributes of the sub-tag of the parent tag will use root[0].attrib. At last, display the text enclosed within the 1st sub-tag of the 5th sub-tag of the tag root.

# importing element tree
import xml.etree.ElementTree as ET 

# Pass the path of the xml document 
tree = ET.parse('models.xml') 

# get the parent tag 
root = tree.getroot() 

# print the root (parent) tag along with its memory location 
print(root) 

# print the attributes of the first tag  
print(root[0].attrib) 

# print the text contained within first subtag of the 5th tag from the parent 
print(root[5][0].text) 


<Element 'model' at 0x0000024B14D334F8>
{'name': 'Acer', 'qty': '12'}
Add company name here

Read XML File Using Simple API for XML (SAX)

In this method, first, register callbacks for events that occur, then the parser proceeds through the document. this can be useful when documents are large or memory limitations are present. It parses the file because it reads it from disk and also the entire file isn't stored in memory. Reading XML using this method requires the creation of ContentHandler by subclassing xml.sax.ContentHandler.

Note: This method might not be compatible with Python 3 version. Please check your version before implementing this method.

  • ContentHandler - handles the tags and attributes of XML. The ContentHandler is called at the beginning and at the end of every element.
  • startDocument and endDocument - called at the start and the end of the XML file.
  • If the parser is'nt in namespace mode, the methods startElement(tag, attributes) and endElement(tag) are called; otherwise, the corresponding methods startElementNS and endElementNS

XML file

<collection shelf="New Arrivals">
<model number="ST001">
   <price>35000</price>
   <qty>12</qty>
   <company>Samsung</company>
</model>
<model number="RW345">
   <price>46500</price>
   <qty>14</qty>
   <company>Onida</company>
</model>
   <model number="EX366">
   <price>30000</price>
   <qty>8</qty>
   <company>Lenovo</company>
</model>
<model number="FU699">
   <price>45000</price>
   <qty>12</qty>
   <company>Acer</company>
</model>
</collection>

Python Code Example

import xml.sax

class XMLHandler(xml.sax.ContentHandler):
    def __init__(self):
        self.CurrentData = ""
        self.price = ""
        self.qty = ""
        self.company = ""

   # Call when an element starts
    def startElement(self, tag, attributes):
        self.CurrentData = tag
        if(tag == "model"):
            print("*****Model*****")
            title = attributes["number"]
            print("Model number:", title)

   # Call when an elements ends
    def endElement(self, tag):
        if(self.CurrentData == "price"):
            print("Price:", self.price)
        elif(self.CurrentData == "qty"):
            print("Quantity:", self.qty)
        elif(self.CurrentData == "company"):
            print("Company:", self.company)
        self.CurrentData = ""

   # Call when a character is read
    def characters(self, content):
        if(self.CurrentData == "price"):
            self.price = content
        elif(self.CurrentData == "qty"):
            self.qty = content
        elif(self.CurrentData == "company"):
            self.company = content

# create an XMLReader
parser = xml.sax.make_parser()

# turn off namepsaces
parser.setFeature(xml.sax.handler.feature_namespaces, 0)

# override the default ContextHandler
Handler = XMLHandler()
parser.setContentHandler( Handler )
parser.parse("models.xml")


*****Model*****
Model number: ST001
Price: 35000
Quantity: 12
Company: Samsung
*****Model*****
Model number: RW345
Price: 46500
Quantity: 14
Company: Onida
*****Model*****
Model number: EX366
Price: 30000
Quantity: 8
Company: Lenovo
*****Model*****
Model number: FU699
Price: 45000
Quantity: 12
Company: Acer

Conclusion

In this article, we learned about XML files and different ways to read an XML file by using several built-in modules and API's such as Minidom, Beautiful Soup, ElementTree, Simple API(SAX). We used some custom parsing codes as well to parse the XML file.



About the author:
An enthusiastic fresher, a patient person who loves to work in diverse fields. I am a creative person and always present the work with utmost perfection.