Parsing XML Documents with ElementTree
Parsed XML documents are represented in memory by ElementTree and
Element objects connected into a tree structure based on the way the
nodes in the XML document are nested.
Parsing an Entire Document
When you parse an entire document with parse(), an ElementTree
instance is returned. The tree knows about all of the data in the
input document, and the nodes of the tree can be searched or
manipulated in place. While this flexibility can make working with
the parsed document a little easier, it typically takes more memory
than an event-based parsing approach since the entire document must be
loaded at one time.
The memory footprint of small, simple documents such as this list of
podcasts represented as an OPML outline is not significant:
<?xml version="1.0" encoding="UTF-8"?>
<opml version="1.0">
<head>
<title>My Podcasts</title>
<dateCreated>Sun, 07 Mar 2010 15:53:26 GMT</dateCreated>
<dateModified>Sun, 07 Mar 2010 15:53:26 GMT</dateModified>
</head>
<body>
<outline text="Science and Tech">
<outline text="APM: Future Tense" type="rss"
xmlUrl="http://www.publicradio.org/columns/futuretense/podcast.xml"
htmlUrl="http://www.publicradio.org/columns/futuretense/" />
<outline text="Engines Of Our Ingenuity Podcast" type="rss"
xmlUrl="http://www.npr.org/rss/podcast.php?id=510030"
htmlUrl="http://www.uh.edu/engines/engines.htm" />
<outline text="Science & the City" type="rss"
xmlUrl="http://www.nyas.org/Podcasts/Atom.axd"
htmlUrl="http://www.nyas.org/WhatWeDo/SciencetheCity.aspx" />
</outline>
<outline text="Books and Fiction">
<outline text="Podiobooker" type="rss"
xmlUrl="http://feeds.feedburner.com/podiobooks"
htmlUrl="http://www.podiobooks.com/blog" />
<outline text="The Drabblecast" type="rss"
xmlUrl="http://web.me.com/normsherman/Site/Podcast/rss.xml"
htmlUrl="http://web.me.com/normsherman/Site/Podcast/Podcast.html" />
<outline text="tor.com / category / tordotstories" type="rss"
xmlUrl="http://www.tor.com/rss/category/TorDotStories"
htmlUrl="http://www.tor.com/" />
</outline>
<outline text="Computers and Programming">
<outline text="MacBreak Weekly" type="rss"
xmlUrl="http://leo.am/podcasts/mbw"
htmlUrl="http://twit.tv/mbw" />
<outline text="FLOSS Weekly" type="rss"
xmlUrl="http://leo.am/podcasts/floss"
htmlUrl="http://twit.tv" />
<outline text="Core Intuition" type="rss"
xmlUrl="http://www.coreint.org/podcast.xml"
htmlUrl="http://www.coreint.org/" />
</outline>
<outline text="Python">
<outline text="PyCon Podcast" type="rss"
xmlUrl="http://advocacy.python.org/podcasts/pycon.rss"
htmlUrl="http://advocacy.python.org/podcasts/" />
<outline text="A Little Bit of Python" type="rss"
xmlUrl="http://advocacy.python.org/podcasts/littlebit.rss"
htmlUrl="http://advocacy.python.org/podcasts/" />
<outline text="Django Dose Everything Feed" type="rss"
xmlUrl="http://djangodose.com/everything/feed/" />
</outline>
<outline text="Miscelaneous">
<outline text="dhellmann's CastSampler Feed" type="rss"
xmlUrl="http://www.castsampler.com/cast/feed/rss/dhellmann/"
htmlUrl="http://www.castsampler.com/users/dhellmann/" />
</outline>
</body>
</opml>
To parse the file, pass an open file handle to parse(). It will
read the data, parse the XML, and return an ElementTree object.
from xml.etree import ElementTree
with open('podcasts.opml', 'rt') as f:
tree = ElementTree.parse(f)
print tree
$ python ElementTree_parse_opml.py
<xml.etree.ElementTree.ElementTree instance at 0x82f58>Traversing the Parsed Tree
Now that we have a parsed XML tree, we can iterate over it, visiting
all of the children in order and examining their attributes and
contents.
from xml.etree import ElementTree
with open('podcasts.opml', 'rt') as f:
tree = ElementTree.parse(f)
for node in tree.getiterator():
print node.tag, node.attrib
Here we print the entire tree, one tag at a time.
$ python ElementTree_dump_opml.py
opml {'version': '1.0'}
head {}
title {}
dateCreated {}
dateModified {}
body {}
outline {'text': 'Science and Tech'}
outline {'xmlUrl': 'http://www.publicradio.org/columns/futuretense/podcast.xml', 'text': 'APM: Future Tense', 'type': 'rss', 'htmlUrl': 'http://www.publicradio.org/columns/futuretense/'}
outline {'xmlUrl': 'http://www.npr.org/rss/podcast.php?id=510030', 'text': 'Engines Of Our Ingenuity Podcast', 'type': 'rss', 'htmlUrl': 'http://www.uh.edu/engines/engines.htm'}
outline {'xmlUrl': 'http://www.nyas.org/Podcasts/Atom.axd', 'text': 'Science & the City', 'type': 'rss', 'htmlUrl': 'http://www.nyas.org/WhatWeDo/SciencetheCity.aspx'}
outline {'text': 'Books and Fiction'}
outline {'xmlUrl': 'http://feeds.feedburner.com/podiobooks', 'text': 'Podiobooker', 'type': 'rss', 'htmlUrl': 'http://www.podiobooks.com/blog'}
outline {'xmlUrl': 'http://web.me.com/normsherman/Site/Podcast/rss.xml', 'text': 'The Drabblecast', 'type': 'rss', 'htmlUrl': 'http://web.me.com/normsherman/Site/Podcast/Podcast.html'}
outline {'xmlUrl': 'http://www.tor.com/rss/category/TorDotStories', 'text': 'tor.com / category / tordotstories', 'type': 'rss', 'htmlUrl': 'http://www.tor.com/'}
outline {'text': 'Computers and Programming'}
outline {'xmlUrl': 'http://leo.am/podcasts/mbw', 'text': 'MacBreak Weekly', 'type': 'rss', 'htmlUrl': 'http://twit.tv/mbw'}
outline {'xmlUrl': 'http://leo.am/podcasts/floss', 'text': 'FLOSS Weekly', 'type': 'rss', 'htmlUrl': 'http://twit.tv'}
outline {'xmlUrl': 'http://www.coreint.org/podcast.xml', 'text': 'Core Intuition', 'type': 'rss', 'htmlUrl': 'http://www.coreint.org/'}
outline {'text': 'Python'}
outline {'xmlUrl': 'http://advocacy.python.org/podcasts/pycon.rss', 'text': 'PyCon Podcast', 'type': 'rss', 'htmlUrl': 'http://advocacy.python.org/podcasts/'}
outline {'xmlUrl': 'http://advocacy.python.org/podcasts/littlebit.rss', 'text': 'A Little Bit of Python', 'type': 'rss', 'htmlUrl': 'http://advocacy.python.org/podcasts/'}
outline {'xmlUrl': 'http://djangodose.com/everything/feed/', 'text': 'Django Dose Everything Feed', 'type': 'rss'}
outline {'text': 'Miscelaneous'}
outline {'xmlUrl': 'http://www.castsampler.com/cast/feed/rss/dhellmann/', 'text': "dhellmann's CastSampler Feed", 'type': 'rss', 'htmlUrl': 'http://www.castsampler.com/users/dhellmann/'}If we wanted to print only the groups of names and feed URLs for the
podcasts, leaving out of all of the data in the header section, we
could iterate over only just the outline nodes and print the
text and xmlUrl attributes.
from xml.etree import ElementTree
with open('podcasts.opml', 'rt') as f:
tree = ElementTree.parse(f)
for node in tree.getiterator('outline'):
name = node.attrib.get('text')
url = node.attrib.get('xmlUrl')
if name and url:
print ' %s :: %s' % (name, url)
else:
print name
Because we passed 'outline' to tree.getiterator() processing is
limited to only nodes with the tag 'outline'.
$ python ElementTree_show_feed_urls.py
Science and Tech
APM: Future Tense :: http://www.publicradio.org/columns/futuretense/podcast.xml
Engines Of Our Ingenuity Podcast :: http://www.npr.org/rss/podcast.php?id=510030
Science & the City :: http://www.nyas.org/Podcasts/Atom.axd
Books and Fiction
Podiobooker :: http://feeds.feedburner.com/podiobooks
The Drabblecast :: http://web.me.com/normsherman/Site/Podcast/rss.xml
tor.com / category / tordotstories :: http://www.tor.com/rss/category/TorDotStories
Computers and Programming
MacBreak Weekly :: http://leo.am/podcasts/mbw
FLOSS Weekly :: http://leo.am/podcasts/floss
Core Intuition :: http://www.coreint.org/podcast.xml
Python
PyCon Podcast :: http://advocacy.python.org/podcasts/pycon.rss
A Little Bit of Python :: http://advocacy.python.org/podcasts/littlebit.rss
Django Dose Everything Feed :: http://djangodose.com/everything/feed/
Miscelaneous
dhellmann's CastSampler Feed :: http://www.castsampler.com/cast/feed/rss/dhellmann/Finding Nodes in a Document
Walking the entire tree yourself like this searching for relevant
nodes can be error prone. In the example above, we had to look at
each outline node to determine if it was a group (nodes with only a
“text” attribute) or podcast (with both “text” and “xmlUrl”). If we
were writing a podcast downloader and needed to produce a simple list
of the podcast feed URLs, without names or groups, we might simplify
the logic using findall() to look for nodes with more descriptive
search characteristics.
A first pass at converting the above example might construct an XPath
argument to look for all outline nodes.
from xml.etree import ElementTree
with open('podcasts.opml', 'rt') as f:
tree = ElementTree.parse(f)
for node in tree.findall('.//outline'):
url = node.attrib.get('xmlUrl')
if url:
print url
The logic in this version is not substantially different than the
version using getiterator(). We still have to check for the
presence of the URL, except that we don’t print the group name when
the URL is not found.
$ python ElementTree_find_feeds_by_tag.py
http://www.publicradio.org/columns/futuretense/podcast.xml
http://www.npr.org/rss/podcast.php?id=510030
http://www.nyas.org/Podcasts/Atom.axd
http://feeds.feedburner.com/podiobooks
http://web.me.com/normsherman/Site/Podcast/rss.xml
http://www.tor.com/rss/category/TorDotStories
http://leo.am/podcasts/mbw
http://leo.am/podcasts/floss
http://www.coreint.org/podcast.xml
http://advocacy.python.org/podcasts/pycon.rss
http://advocacy.python.org/podcasts/littlebit.rss
http://djangodose.com/everything/feed/
http://www.castsampler.com/cast/feed/rss/dhellmann/Another version can take advantage of the fact that we know the
outline nodes are only nested two levels deep. If we change the
search path to .//outline/outline we will process only the second
level of outline nodes.
from xml.etree import ElementTree
with open('podcasts.opml', 'rt') as f:
tree = ElementTree.parse(f)
for node in tree.findall('.//outline/outline'):
url = node.attrib.get('xmlUrl')
print url
We expect all of those outline nodes nested 2 levels deep in the input
will have the xmlURL attribute refering to the podcast feed, so if we
were brave we could skip checking for for the attribute before using
it.
$ python ElementTree_find_feeds_by_structure.py
http://www.publicradio.org/columns/futuretense/podcast.xml
http://www.npr.org/rss/podcast.php?id=510030
http://www.nyas.org/Podcasts/Atom.axd
http://feeds.feedburner.com/podiobooks
http://web.me.com/normsherman/Site/Podcast/rss.xml
http://www.tor.com/rss/category/TorDotStories
http://leo.am/podcasts/mbw
http://leo.am/podcasts/floss
http://www.coreint.org/podcast.xml
http://advocacy.python.org/podcasts/pycon.rss
http://advocacy.python.org/podcasts/littlebit.rss
http://djangodose.com/everything/feed/
http://www.castsampler.com/cast/feed/rss/dhellmann/This version is limited to our existing structure, though, so if the
outline nodes are ever rearranged into a deeper tree it will stop
working.
Parsed Node Attributes
The items returned by findall() and getiterator() are Element
objects, each representing a node in the XML parse tree. Each Element
has attributes for accessing data pulled out of the XML. This can be
illustrated with a somewhat more contrived example input file,
data.xml:
1 | <?xml version="1.0" encoding="UTF-8"?> |
The “attributes” of a node are available in the attrib property,
which acts like a dictionary.
from xml.etree import ElementTree
with open('data.xml', 'rt') as f:
tree = ElementTree.parse(f)
node = tree.find('./with_attributes')
print node.tag
for name, value in sorted(node.attrib.items()):
print ' %-4s = "%s"' % (name, value)
The node on line 5 of the input file has 2 attributes, name and foo.
$ python ElementTree_node_attributes.py
with_attributes
foo = "bar"
name = "value"The text content of the nodes is available, along with the “tail” text
that comes after the end of a close tag.
from xml.etree import ElementTree
with open('data.xml', 'rt') as f:
tree = ElementTree.parse(f)
for path in [ './child', './child_with_tail' ]:
node = tree.find(path)
print node.tag
print ' child node text:', node.text
print ' and tail text :', node.tail
The child node on line 3 contains embedded text, and the node on
line 4 has text with a tail (including any whitespace).
$ python ElementTree_node_text.py
child
child node text: This child contains text.
and tail text :
child_with_tail
child node text: This child has regular text.
and tail text : And "tail" text.Conveniently, XML entity references embedded in the document are
converted to the appropriate characters before values are returned.
from xml.etree import ElementTree
with open('data.xml', 'rt') as f:
tree = ElementTree.parse(f)
node = tree.find('entity_expansion')
print node.tag
print ' in attribute:', node.attrib['attribute']
print ' in text :', node.text
The conversion saves you from having to worry about an implementation
detail of representing certain characters in an XML document.
$ python ElementTree_entity_references.py
entity_expansion
in attribute: This & That
in text : That & ThisWatching Events While Parsing
The other API useful for processing XML documents is event-based. The
parser generates start events for opening tags and end events
for closing tags. Iterating over the event stream lets you extract
data from the document while parsing it, which is convenient if you
don’t need to manipulate the entire document afterwards and if you
want to avoid holding the entire parsed document in memory.
iterparse() returns an iterable that produces tuples containing
the name of the event and the node triggering the event. Events can
be one of:
- start
- A new tag has been encountered. The closing angle bracket of the
tag was processed, but not the contents. - end
- The closing angle bracket of a closing tag has been processed. All
of the children were already processed. - start-ns
- Start a namespace declaration.
- end-ns
- End a namespace declaration.
from xml.etree.ElementTree import iterparse
depth = 0
prefix_width = 8
prefix_dots = '.' * prefix_width
line_template = '{prefix:<0.{prefix_len}}{event:<8}{suffix:<{suffix_len}} {node.tag:<12} {node_id}'
for (event, node) in iterparse('podcasts.opml', ['start', 'end', 'start-ns', 'end-ns']):
if event == 'end':
depth -= 1
prefix_len = depth * 2
print line_template.format(prefix=prefix_dots,
prefix_len=prefix_len,
suffix='',
suffix_len=(prefix_width - prefix_len),
node=node,
node_id=id(node),
event=event,
)
if event == 'start':
depth += 1
By default, only end events are generated. To see other events,
pass the list of event names you want to receive to iterparse(),
as in this example:
$ python ElementTree_show_all_events.py
start opml 876256
..start head 876336
....start title 888920
....end title 888920
....start dateCreated 889280
....end dateCreated 889280
....start dateModified 889320
....end dateModified 889320
..end head 876336
..start body 889400
....start outline 889560
......start outline 889600
......end outline 889600
......start outline 889480
......end outline 889480
......start outline 889680
......end outline 889680
....end outline 889560
....start outline 889720
......start outline 889760
......end outline 889760
......start outline 889840
......end outline 889840
......start outline 889920
......end outline 889920
....end outline 889720
....start outline 889880
......start outline 890040
......end outline 890040
......start outline 890120
......end outline 890120
......start outline 890200
......end outline 890200
....end outline 889880
....start outline 890240
......start outline 890360
......end outline 890360
......start outline 890440
......end outline 890440
......start outline 890520
......end outline 890520
....end outline 890240
....start outline 890640
......start outline 890720
......end outline 890720
....end outline 890640
..end body 889400
end opml 876256The event-style of processing may be more natural for some operations,
such as converting XML input to some other format. For example,
suppose we want to convert the list of podcasts we have been working
with from an XML file to a data file we can load into a spreadsheet or
database application. We don’t need to hold the entire data set in
memory at a time, since we’re simply changing the format.
import csv
from xml.etree.ElementTree import iterparse
import sys
writer = csv.writer(sys.stdout, quoting=csv.QUOTE_NONNUMERIC)
group_name = ''
for (event, node) in iterparse('podcasts.opml', events=['start']):
if node.tag != 'outline':
# Ignore anything not part of the outline
continue
if not node.attrib.get('xmlUrl'):
# Remember the current group
group_name = node.attrib['text']
else:
# Output a podcast entry
writer.writerow( (group_name, node.attrib['text'],
node.attrib['xmlUrl'],
node.attrib.get('htmlUrl', ''),
)
)
This example program converts our podcast list to a CSV file, ready to
be imported into another application.
$ python ElementTree_write_podcast_csv.py
"Science and Tech","APM: Future Tense","http://www.publicradio.org/columns/futuretense/podcast.xml","http://www.publicradio.org/columns/futuretense/"
"Science and Tech","Engines Of Our Ingenuity Podcast","http://www.npr.org/rss/podcast.php?id=510030","http://www.uh.edu/engines/engines.htm"
"Science and Tech","Science & the City","http://www.nyas.org/Podcasts/Atom.axd","http://www.nyas.org/WhatWeDo/SciencetheCity.aspx"
"Books and Fiction","Podiobooker","http://feeds.feedburner.com/podiobooks","http://www.podiobooks.com/blog"
"Books and Fiction","The Drabblecast","http://web.me.com/normsherman/Site/Podcast/rss.xml","http://web.me.com/normsherman/Site/Podcast/Podcast.html"
"Books and Fiction","tor.com / category / tordotstories","http://www.tor.com/rss/category/TorDotStories","http://www.tor.com/"
"Computers and Programming","MacBreak Weekly","http://leo.am/podcasts/mbw","http://twit.tv/mbw"
"Computers and Programming","FLOSS Weekly","http://leo.am/podcasts/floss","http://twit.tv"
"Computers and Programming","Core Intuition","http://www.coreint.org/podcast.xml","http://www.coreint.org/"
"Python","PyCon Podcast","http://advocacy.python.org/podcasts/pycon.rss","http://advocacy.python.org/podcasts/"
"Python","A Little Bit of Python","http://advocacy.python.org/podcasts/littlebit.rss","http://advocacy.python.org/podcasts/"
"Python","Django Dose Everything Feed","http://djangodose.com/everything/feed/",""
"Miscelaneous","dhellmann's CastSampler Feed","http://www.castsampler.com/cast/feed/rss/dhellmann/","http://www.castsampler.com/users/dhellmann/"Creating Your Own Tree Builder
A potentially more efficient means of handling parse events is to
replace the standard tree builder behavior with your own. The
ElementTree parser uses an XMLTreeBuilder to process the XML and call
methods on a target class to save the results. The usual output is an
ElementTree instance created by the default TreeBuilder class. By
replacing TreeBuilder with your own class, you can receive the events
before the Element nodes are instantiated, saving that portion of the
overhead.
The XML-to-CSV app from the previous section can be translated to a
tree builder.
import csv
from xml.etree.ElementTree import XMLTreeBuilder
import sys
class PodcastListToCSV(object):
def __init__(self, outputFile):
self.writer = csv.writer(outputFile, quoting=csv.QUOTE_NONNUMERIC)
self.group_name = ''
return
def start(self, tag, attrib):
if tag != 'outline':
# Ignore anything not part of the outline
return
if not attrib.get('xmlUrl'):
# Remember the current group
self.group_name = attrib['text']
else:
# Output a podcast entry
self.writer.writerow( (self.group_name, attrib['text'],
attrib['xmlUrl'],
attrib.get('htmlUrl', ''),
)
)
def end(self, tag):
# Ignore closing tags
pass
def data(self, data):
# Ignore data inside nodes
pass
def close(self):
# Nothing special to do here
return
target = PodcastListToCSV(sys.stdout)
parser = XMLTreeBuilder(target=target)
with open('podcasts.opml', 'rt') as f:
for line in f:
parser.feed(line)
parser.close()
PodcastListToCSV implements the TreeBuilder protocol. Each time a
new XML tag is encountered, start() is called with the tag name
and attributes. When a closing tag is seen end() is called with
the name. In between, data() is called when a node has content
(the tree builder is expected to keep up with the “current” node).
When all of the input is processed, close() is called. It can
return a value, which will be returned to the user of the
XMLTreeBuilder.
$ python ElementTree_podcast_csv_treebuilder.py
"Science and Tech","APM: Future Tense","http://www.publicradio.org/columns/futuretense/podcast.xml","http://www.publicradio.org/columns/futuretense/"
"Science and Tech","Engines Of Our Ingenuity Podcast","http://www.npr.org/rss/podcast.php?id=510030","http://www.uh.edu/engines/engines.htm"
"Science and Tech","Science & the City","http://www.nyas.org/Podcasts/Atom.axd","http://www.nyas.org/WhatWeDo/SciencetheCity.aspx"
"Books and Fiction","Podiobooker","http://feeds.feedburner.com/podiobooks","http://www.podiobooks.com/blog"
"Books and Fiction","The Drabblecast","http://web.me.com/normsherman/Site/Podcast/rss.xml","http://web.me.com/normsherman/Site/Podcast/Podcast.html"
"Books and Fiction","tor.com / category / tordotstories","http://www.tor.com/rss/category/TorDotStories","http://www.tor.com/"
"Computers and Programming","MacBreak Weekly","http://leo.am/podcasts/mbw","http://twit.tv/mbw"
"Computers and Programming","FLOSS Weekly","http://leo.am/podcasts/floss","http://twit.tv"
"Computers and Programming","Core Intuition","http://www.coreint.org/podcast.xml","http://www.coreint.org/"
"Python","PyCon Podcast","http://advocacy.python.org/podcasts/pycon.rss","http://advocacy.python.org/podcasts/"
"Python","A Little Bit of Python","http://advocacy.python.org/podcasts/littlebit.rss","http://advocacy.python.org/podcasts/"
"Python","Django Dose Everything Feed","http://djangodose.com/everything/feed/",""
"Miscelaneous","dhellmann's CastSampler Feed","http://www.castsampler.com/cast/feed/rss/dhellmann/","http://www.castsampler.com/users/dhellmann/"Parsing Strings
To work with smaller bits of XML text, especially string literals as
might be embedded in the source of a program, use
xml.etree.ElementTree.XML and pass a single argument, the string
containing the XML to be parsed.
from xml.etree.ElementTree import XML
parsed = XML('''
<root>
<group>
<child id="a">This is child "a".</child>
<child id="b">This is child "b".</child>
</group>
<group>
<child id="c">This is child "c".</child>
</group>
</root>
''')
print 'parsed =', parsed
for elem in parsed.getiterator():
print elem.tag
if elem.text is not None and elem.text.strip():
print ' text: "%s"' % elem.text
if elem.tail is not None and elem.tail.strip():
print ' tail: "%s"' % elem.tail
for name, value in sorted(elem.attrib.items()):
print ' %-4s = "%s"' % (name, value)
print
Notice that unlike with parse(), the return value is an Element
instance instead of an ElementTree.
$ python ElementTree_XML.py
parsed = <Element root at d4e40>
root
group
child
text: "This is child "a"."
id = "a"
child
text: "This is child "b"."
id = "b"
group
child
text: "This is child "c"."
id = "c"For structured XML that uses the “id” attribute to identify unique
nodes of interest, XMLID() is a convenient way to access the parse
results.
from xml.etree.ElementTree import XMLID
tree, id_map = XMLID('''
<root>
<group>
<child id="a">This is child "a".</child>
<child id="b">This is child "b".</child>
</group>
<group>
<child id="c">This is child "c".</child>
</group>
</root>
''')
for key, value in sorted(id_map.items()):
print '%s = %s' % (key, value)
XMLID() returns the parsed tree as an Element object,
along with a dictionary mapping the id attribute strings to the
individual nodes in the tree.
$ python ElementTree_XMLID.py
a = <Element child at d3eb8>
b = <Element child at d3d78>
c = <Element child at d9030>See also
- Outline Processor Markup Language, OPML
- Dave Winer’s OPML specification and documentation.
- XPath Support in ElementTree
- Part of Fredrick Lundh’s original documentation for ElementTree.
- csv
- Read and write comma-separated-value files
The canonical version of this article
0 comments:
Post a Comment