Introduction
XML (Extensible Markup Language) is a markup language used to store structured data. The Pandas data analysis library provides functions to read/write data for most of the file types.
For example, it includes read_csv()
and to_csv()
for interacting with CSV files. However, Pandas does not include any methods to read and write XML files.
In this article, we will take a look at how we can use other modules to read data from an XML file, and load it into a Pandas DataFrame. We'll also take data from a Pandas DataFrame and write it to an XML file.
Reading XML with Pandas
Let's have a look at a few ways to read XML data and put it in a Pandas DataFrame.
For this section, we'll use one set of input data for every script. Save the following XML in a file called properties.xml
:
<?xml version="1.0" encoding="UTF-8"?>
<root>
<bathrooms>
<n35237 type="number">1.0</n35237>
<n32238 type="number">3.0</n32238>
<n44699 type="number">nan</n44699>
</bathrooms>
<price>
<n35237 type="number">7020000.0</n35237>
<n32238 type="number">10000000.0</n32238>
<n44699 type="number">4128000.0</n44699>
</price>
<property_id>
<n35237 type="number">35237.0</n35237>
<n32238 type="number">32238.0</n32238>
<n44699 type="number">44699.0</n44699>
</property_id>
</root>
Reading with xml.etree.ElementTree
The xml.etree.ElementTree
module comes built-in with Python. It provides functionality for parsing and creating XML documents. ElementTree
represents the XML document as a tree. We can move across the document using nodes which are elements and sub-elements of the XML file.
In this approach, we read the file content in a variable and use ET.XML()
to parse the XML document from the string constant. We will loop across each child and sub child maintaining a list of data they contain. Meanwhile, writing child tags for the DataFrame column. Then we write this data into a DataFrame.
Note: When reading data from XML, we have to transpose the DataFrame, as the data list's sub-elements are written in columns. We need to write them as rows in the DataFrame.
Let's look at the code to demonstrate use of xml.etree.ElementTree
:
import xml.etree.ElementTree as ET
import pandas as pd
xml_data = open('properties.xml', 'r').read() # Read file
root = ET.XML(xml_data) # Parse XML
data = []
cols = []
for i, child in enumerate(root):
data.append([subchild.text for subchild in child])
cols.append(child.tag)
df = pd.DataFrame(data).T # Write in DF and transpose it
df.columns = cols # Update column names
print(df)
The above code will produce this output (varies on the input file used):
bathrooms price property_id
0 1.0 7020000.0 35237.0
1 3.0 10000000.0 32238.0
2 nan 4128000.0 44699.0
Reading with lxml
The lxml
library is a Python binding for the C libraries libxml2
and libxslt
. It also extends the native ElementTree
module. As this is a third-party module, you'll need to install it with pip
like this:
$ pip install lxml
Unlike ElementTree
, we don't read the file data and parse it. We can directly use objectify.parse()
and give it the path to the XML file. To get the root element, we will use getroot()
on the parsed XML data.
Now we can loop through the children elements of the root node and write them into a Python list. Like before, we'll create a DataFrame using the data list, and transpose the DataFrame.
Let's look at the code to create a Pandas DataFrame using lxml
:
from lxml import objectify
import pandas as pd
xml_data = objectify.parse('properties.xml') # Parse XML data
root = xml_data.getroot() # Root element
data = []
cols = []
for i in range(len(root.getchildren())):
child = root.getchildren()[i]
data.append([subchild.text for subchild in child.getchildren()])
cols.append(child.tag)
df = pd.DataFrame(data).T # Create DataFrame and transpose it
df.columns = cols # Update column names
print(df)
If we run this on Python interpreter, we see the following output:
bathrooms price property_id
0 1.0 7020000.0 35237.0
1 3.0 10000000.0 32238.0
2 nan 4128000.0 44699.0
Reading with xmltodict
The xmltodict
module converts the XML data into a Python dictionary as the name suggests. Like lxml
, this is a third-party module we need to install with pip
:
$ pip install xmltodict
Like we've done before, we read the XML contents into a variable. We give this data in the parse()
method which returns a dictionary of the XML data. It will be a nested dictionary that has elements and sub-elements of the XML file. We can loop through the elements and write them into a data list that we use to create a DataFrame.
Let's have a look at the code to parse XML data to create a DataFrame using xmltodict
:
import xmltodict
import pandas as pd
xml_data = open('properties.xml', 'r').read() # Read data
xmlDict = xmltodict.parse(xml_data) # Parse XML
cols = xmlDict['root'].keys()
data = []
for i in xmlDict['root']:
child = xmlDict['root'][i]
data.append([child[subchild]['#text'] for subchild in child])
df = pd.DataFrame(data).T # Create DataFrame and transpose it.
df.columns = cols
print(df)
If we run the above code, we can see the output as:
Check out our hands-on, practical guide to learning Git, with best-practices, industry-accepted standards, and included cheat sheet. Stop Googling Git commands and actually learn it!
bathrooms price property_id
0 1.0 7020000.0 35237.0
1 3.0 10000000.0 32238.0
2 nan 4128000.0 44699.0
Note: The xmltodict
library is not recommended for huge XML files as many developers have observed performance drops. The lxml
library is considered to be the fastest at working with XML, even faster than the included xml.etree.ElementTree
.
Use what's best for your project, and if performance is critical you should run tests with each library.
Writing XML with Pandas
Let's look at various ways to write Pandas DataFrame in an XML file. Every script we use below will create a new file called coordinates.xml with the following contents:
<root>
<A>
<X>1.3</X>
<Y>2.6</Y>
<Z>2.1</Z>
</A>
<B>
<X>1.4</X>
<Y>1.4</Y>
<Z>5.6</Z>
</B>
<C>
<X>5.2</X>
<Y>4.6</Y>
<Z>4.6</Z>
</C>
</root>
Writing with Built-in write() Function
We can use the included write()
function for files to write a DataFrame as an XML file. To accomplish this, we'll keep a list of the XML data, such that each item represents a line in XML. We will then iterate over the DataFrame and write the data with appropriate opening and closing tags of XML in the data list.
Once that's complete, we iterate over the list once more to write the data into the XML file. Here's the code that shows the use of write()
:
import pandas as pd
df = pd.DataFrame([[1.3, 1.4, 5.2],
[2.6, 1.4, 4.6],
[2.1, 5.6, 4.6]],
columns=['A', 'B', 'C'],
index=['X', 'Y', 'Z'])
xml_data = ['<root>']
for column in df.columns:
xml_data.append('<{}>'.format(column)) # Opening element tag
for field in df.index:
# writing sub-elements
xml_data.append('<{0}>{1}</{0}>'.format(field, df[column][field]))
xml_data.append('</{}>'.format(column)) # Closing element tag
xml_data.append('</root>')
with open('coordinates.xml', 'w') as f: # Writing in XML file
for line in xml_data:
f.write(line)
Running this code will produce a file named coordinates.xml in the current directory.
Writing XML Files with xml.etree.ElementTree
The default xml.etree.ElementTree
module can be used to store data as XML and convert it to a string so it can be written to a file.
Our first step is to create the root element. We then iterate over the columns and rows of the DataFrame, adding them as elements and sub-elements in the ElementTree. Then we convert the ElementTree
object's data into a binary string using the tostring()
method. We then write this string data into a file.
As the XML data is a binary string, we decode it into UTF-8 before writing it into the file.
The following code uses xml.etree.ElementTree
to write a DataFrame as an XML file:
import xml.etree.ElementTree as ET
import pandas as pd
df = pd.DataFrame([[1.3, 1.4, 5.2],
[2.6, 1.4, 4.6],
[2.1, 5.6, 4.6]],
columns=['A', 'B', 'C'],
index=['X', 'Y', 'Z'])
header = df.columns
root = ET.Element('root') # Root element
for column in df.columns:
entry = ET.SubElement(root, column) # Adding element
for row in df.index:
schild = row
child = ET.SubElement(entry, schild) # Adding sub-element
child.text = str(df[column][schild])
xml_data = ET.tostring(root) # binary string
with open('coordinates.xml', 'w') as f: # Write in file as utf-8
f.write(xml_data.decode('utf-8'))
Like before, running this script will create a coordinates.xml file with the expected output.
Writing XML Files with lxml
Using lxml
is similar to how we used xml.etree.ElementTree
. We begin by creating an etree
object with the root element of the file we're creating. We then iterate over the DataFrame, adding columns and rows as elements and sub-elements of the tree. Lastly, we use the tostring()
method to get the etree
as a binary string. We write the file after decoding the binary string into UTF-8.
Here is the code to write DataFrame as XML using lxml
:
from lxml import etree as et
import pandas as pd
root = et.Element('root') # Create root element
df = pd.DataFrame([[1.3, 1.4, 5.2],
[2.6, 1.4, 4.6],
[2.1, 5.6, 4.6]],
columns=['A', 'B', 'C'],
index=['X', 'Y', 'Z'])
for column in df.columns:
entry = et.SubElement(root, column) # Writing element
for row in df.index:
schild = row
child = et.SubElement(entry, schild) # Writing sub-elements
child.text = str(df[column][schild])
xml_data = et.tostring(root) # binary string
with open('coordinates.xml', 'w') as f: # Write in XML file as utf-8
f.write(xml_data.decode('utf-8'))
Upon successful completion, you will see the coordinates.xml with the XML coordinates.
Conclusion
This tutorial shows various ways we can read and write XML data with Pandas DataFrames. You can read data with the built-in xml.etree.ElementTree
module, as well as two third-party modules: lxml
and xmltodict
. The lxml
library is the fastest and most efficient for working with XML data.
For writing a Pandas DataFrame to an XML file, we have used conventional file write()
with lists, the xml.etree.ElementTree
module, and lxml
. Given that manipulating XML strings directly to write a file is more prone to human error, xml.etree.ElementTree
and lxml
are the preferable solutions for exporting a DataFrame to XML.