2022-07-01

Python ElementTree notes

Develop with libvrt python API, xml parse and operation is frequently required. ElementTree (stantard python library) is introduced in python-xml-parse come into used for the sake of simplify xml configuration lifecycle handling.

This blog will go throught xml.etree.ElementTree combine with typical situations which is use as learning notes.

First, start with some basic concepts

The Element type is a flexible container object, designed to store hierarchical data structures in memory. The type can be described as a cross between a list and a dictionary.

Each element has a number of properties associated with it:

a tag which is a string identifying what kind of data this element represents (the element type, in other words).
a number of attributes, stored in a Python dictionary.
a text string.
an optional tail string.
a number of child elements, stored in a Python sequence

use following XML as sample data:

<?xml version="1.0"?>
<data>
    <country name="Liechtenstein">
        <rank>1</rank>
        <year>2008</year>
        <gdppc>141100</gdppc>
        <neighbor name="Austria" direction="E"/>
        <neighbor name="Switzerland" direction="W"/>
    </country>
    <country name="Singapore">
        <rank>4</rank>
        <year>2011</year>
        <gdppc>59900</gdppc>
        <neighbor name="Malaysia" direction="N"/>
    </country>
    <country name="Panama">
        <rank>68</rank>
        <year>2011</year>
        <gdppc>13600</gdppc>
        <neighbor name="Costa Rica" direction="W"/>
        <neighbor name="Colombia" direction="E"/>
    </country>
</data>

load xml from file:

Python 2.7.5 (default, Aug  4 2017, 00:39:18)
[GCC 4.8.5 20150623 (Red Hat 4.8.5-16)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import xml.etree.ElementTree as ET
>>> tree = ET.parse('test_data.xml')
>>> root = tree.getroot()
>>> root
<Element 'data' at 0x7f58bd8232d0>

or load xml from string:

>>> test_data_str = '''<?xml version="1.0"?>
... <data>
...     <country name="Liechtenstein">
...         <rank>1</rank>
...         <year>2008</year>
...         <gdppc>141100</gdppc>
...         <neighbor name="Austria" direction="E"/>
...         <neighbor name="Switzerland" direction="W"/>
...     </country>
...     <country name="Singapore">
...         <rank>4</rank>
...         <year>2011</year>
...         <gdppc>59900</gdppc>
...         <neighbor name="Malaysia" direction="N"/>
...     </country>
...     <country name="Panama">
...         <rank>68</rank>
...         <year>2011</year>
...         <gdppc>13600</gdppc>
...         <neighbor name="Costa Rica" direction="W"/>
...         <neighbor name="Colombia" direction="E"/>
...     </country>
... </data>'''
>>> ET.fromstring(test_data_str)
<Element 'data' at 0x7f58bd823a10>
>>> root = ET.fromstring(test_data_str)
>>> root
<Element 'data' at 0x7f58bd823f10>

As an element, use dir to check whats inside element we just loaded:

1
2

>>> dir(root)
['__class__', '__delattr__', '__delitem__', '__dict__', '__doc__', '__format__', '__getattribute__', '__getitem__', '__hash__', '__init__', '__len__', '__module__', '__new__', '__nonzero__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__setitem__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', '_children', 'append', 'attrib', 'clear', 'copy', 'extend', 'find', 'findall', 'findtext', 'get', 'getchildren', 'getiterator', 'insert', 'items', 'iter', 'iterfind', 'itertext', 'keys', 'makeelement', 'remove', 'set', 'tag', 'tail', 'text']

as we see, operations is listed and think about some typical user case.

find attributes

Before finding attributes check what attributes the xml has.

For root node, only has a tag but no attribute is set. So use tag and attrib to check this before find attributes.

>>> root.tag
'data'
>>> root.attrib
{}

return value is what we expected. Get deeper, a country tag with name attribute is used.

iterate can be used to get child tag of root.

>>> for child in root:
...     child
...
<Element 'country' at 0x7f58bd823f50>
<Element 'country' at 0x7f58bd825110>
<Element 'country' at 0x7f58bd825250>

or use index to find tag element directly:

1 2	>>> root[0] <Element 'country' at 0x7f58bd823f50>

for more duplicate case, those method became hard to use, so use iter or findall:

>>> for neighbor in root.iter('neighbor'):
...     print neighbor.attrib
...
{'direction': 'E', 'name': 'Austria'}
{'direction': 'W', 'name': 'Switzerland'}
{'direction': 'N', 'name': 'Malaysia'}
{'direction': 'W', 'name': 'Costa Rica'}
{'direction': 'E', 'name': 'Colombia'}

all tags match neighbor is listed.

>>> for country in root.findall('country'):
...     rank = country.find('rank').text
...     name = country.get('name')
...     print name, rank
...
Liechtenstein 1
Singapore 4
Panama 68

use find all, all tag with name country is found and its rank text and attribute name is listed.

change the parameters for test, change findall target, test if tag not matched what will happend:

1
2
3

>>> for country in root.findall('test'):
...     print country
...

when use find instead of findall

>>> for tag in root.find('country'):
...     print tag
...
<Element 'rank' at 0x7f58bd823f90>
<Element 'year' at 0x7f58bd823fd0>
<Element 'gdppc' at 0x7f58bd825050>
<Element 'neighbor' at 0x7f58bd825090>
<Element 'neighbor' at 0x7f58bd8250d0>

only first matched result is returned.

if find for a unexists tag None will be returned.

1 2	>>> print root.find('test') None

so in most cases, find and findall seems meet all the require for finding a specific tag.

use tag

get attribute of tag:

1 2	>>> root.find('country').get('name') 'Liechtenstein'

get text inside tag:

1 2	>>> root.find('country').text '\n '

1 2	>>> root.find('country').find('year').text '2008'

list all children

1 2	>>> root.getchildren() [<Element 'country' at 0x7f58bd825bd0>, <Element 'country' at 0x7f58bd823f10>, <Element 'country' at 0x7f58bd8239d0>

insert tag

create new element from string:

>>> new_element_str='''    <country name="China">
...         <rank>2</rank>
...         <year>2022</year>
...         <neighbor name="Japan" direction="E"/>
...     </country>'''
(reverse-i-search)`lo': {'name': 'Colombia', 'direction': 'E'}
KeyboardInterrupt
>>> {'name': 'Colombia', 'direction': 'E'}
{'direction': 'E', 'name': 'Colombia'}
>>> new = ET.fromstring(new_element_str)

check origin element tree:

1
2

>>> ET.tostring(root)
'<data>\n    <country name="Liechtenstein">\n        <rank>1</rank>\n        <year>2008</year>\n        <gdppc>141100</gdppc>\n        <neighbor direction="E" name="Austria" />\n        <neighbor direction="W" name="Switzerland" />\n    </country>\n    <country name="Singapore">\n        <rank>4</rank>\n        <year>2011</year>\n        <gdppc>59900</gdppc>\n        <neighbor direction="N" name="Malaysia" />\n    </country>\n    <country name="Panama">\n        <rank>68</rank>\n        <year>2011</year>\n        <gdppc>13600</gdppc>\n        <neighbor direction="W" name="Costa Rica" />\n        <neighbor direction="E" name="Colombia" />\n    </country>\n</data>'

insert new element:

1
2

>>> ET.tostring(root)
'<data>\n    <country name="China">\n        <rank>2</rank>\n        <year>2022</year>\n        <neighbor direction="E" name="Japan" />\n    </country><country name="Liechtenstein">\n        <rank>1</rank>\n        <year>2008</year>\n        <gdppc>141100</gdppc>\n        <neighbor direction="E" name="Austria" />\n        <neighbor direction="W" name="Switzerland" />\n    </country>\n    <country name="Singapore">\n        <rank>4</rank>\n        <year>2011</year>\n        <gdppc>59900</gdppc>\n        <neighbor direction="N" name="Malaysia" />\n    </country>\n    <country name="Panama">\n        <rank>68</rank>\n        <year>2011</year>\n        <gdppc>13600</gdppc>\n        <neighbor direction="W" name="Costa Rica" />\n        <neighbor direction="E" name="Colombia" />\n    </country>\n</data>

confirm new element is added:

>>> for country in root.findall('country'):
...     country.get('name')
...
'China'
'Liechtenstein'
'Singapore'
'Panama'