Python for Data Science – Importing XML to Pandas DataFrame

In my previous post, I showed how easy to import data from CSV, JSON, Excel files using Pandas package. Another popular format to exchange data is XML. Unfortunately Pandas package does not have a function to import data from XML so we need to use standard XML package and do some extra work to convert the data to Pandas DataFrames.

Here’s a sample XML file (save it as test.xml):

<?xml version="1.0"?>
<data>
    <customer name="gokhan" >
        <email>gokhan@gmail.com</email>
        <phone>555-1234</phone>
    </customer>
    <customer name="mike" >
        <email>mike@gmail.com</email>
    </customer>
    <customer name="john" >
        <email>john@gmail.com</email>
        <phone>555-4567</phone>
    </customer>
    <customer name="david" >
        <phone>555-6472</phone>
        <address>
            <street>Fifth Avenue</street>
        </address>
    </customer>
</data>

We want to convert his to a dataframe which contains customer name, email, phone and street:

name             email     phone        street
0  gokhan  gokhan@gmail.com  555-1234          None
1    mike    mike@gmail.com      None          None
2    john    john@gmail.com  555-4567          None
3   david              None  555-6472  Fifth Avenue

As you can see, we need to read attribute of an XML tag (customer name), text value of sub elements (address/street), so although we will use a very simple method, it will show you how to parse even complex XML files using Python.

Python provides a comprehensive XML package which provides different APIs to parse XML. Unfortunately none of them are secure against erroneous or maliciously constructed data. Please do not forget to check XML vulnerabilities. Anyway, I prefer the ElementTree XML API which is a simple and lightweight XML processor.

To read and parse an XML file, all we need is to call the parse method:

import xml.etree.cElementTree as et
parsedXML = et.parse( "test.xml" )

The above code will return an ElementTree object, then we can use “iter()” method to generate an iterator (for specific XML elements) or “getroot()” to get the root element for this tree, and then iterate all elements.

for node in parsedXML.getroot():
    name = node.attrib.get('name')
    email = node.find('email')
    phone = node.find('phone')
    street = node.find('address/street')

It’s almost done. We get customer data (name, email, phone and street). To be able to add these data to a DataFrame, we need to define a DataFrame before we iterate elements, then for each customer, we build a Pandas.Series object (an array), and append this Series object to the DataFrame. While doing this, we should also be ready to handle “null” values because some XML paths might be missing on our XML file.

Here’s the full script and the output:

import pandas as p

import xml.etree.cElementTree as et
parsedXML = et.parse( "test.xml" )
dfcols = ['name','email','phone','street']
df = pd.DataFrame(columns=dfcols)
def getvalueofnode( node ):
return node.text if node is not None else None
for node in parsedXML.getroot():
name = node.attrib.get('name')
email = node.find('email')
phone = node.find('phone')
street = node.find('address/street')
df = df.append( pd.Series( 
    [name, getvalueofnode(email), getvalueofnode(phone), getvalueofnode(street)],
    index=dfcols) ,ignore_index=True)
print df
Output:
     name             email     phone        street
0  gokhan  gokhan@gmail.com  555-1234          None
1    mike    mike@gmail.com      None          None
2    john    john@gmail.com  555-4567          None
3   david              None  555-6472  Fifth Avenue

As you can see, I defined a function to handle missing paths, and also gave the same column names while appending the series to the DataFrame (otherwise they would be added as additional columns). Please do not hesitate to ask questions! See you on next blog post!

Oracle

About the Author

Gokhan Atil

About the Author

Gokhan Atil has over 15 years of experience in the IT industry and a strong background in database management (Oracle 8i,9i,10g,11g), software development and UNIX systems. He is an Oracle certified professional for EBS R12, Oracle 10g and 11g. Gokhan specializes in high availability solutions, performance tuning and monitoring tools. Gokhan is a founding member and current vice president of Turkish Oracle User Group (TROUG). He’s also a member of Independent Oracle User Group (IOUG). Gokhan presented at various conferences, and he is one of co-authors of “Expert Oracle Enterprise Manager 12c” book.

About the Author

Gokhan Atil

About the Author

Start the discussion at forums.toadworld.com